PG Performance
PG Performance
GRUNDNIVÅ, 15 HP
STOCKHOLM, SVERIGE 2021
Comparing database
optimisation techniques in
PostgreSQL
Indexes, query writing and the query optimiser
ELIZABETH INERSJÖ
KTH
SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP
© 2021
|1
Abstract
Databases are all around us, and ensuring their efficiency is of great importance.
Database optimisation has many parts and many methods, two of these parts
are database tuning and database optimisation. These can then further be split
into methods such as indexing. These indexing techniques have been studied
and compared between Database Management Systems (DBMSs) to see how
much they can improve the execution time for queries. And many guides
have been written on how to implement query optimisation and indexes. In
this thesis, the question "How does indexing and query optimisation affect
response time in PostgreSQL?" is posed, and was answered by investigating
these previous studies and theory to find different optimisation techniques and
compare them to each other. The purpose of this research was to provide
more information about how optimisation techniques can be implemented
and map out when what method should be used. This was partly done to
provide learning material for students, but also people who are starting to
learn PostgreSQL. This was done through a literature study, and an experiment
performed on a database with different table sizes to see how the optimisation
scales to larger systems.
What was found was that there are many use cases to optimisation that
mainly depend on the query performed and the type of data. From both the
literature study and the experiment, the main take-away points are that indexes
can vastly improve performance, but if used incorrectly can also slow it. The
main use cases for indexes are for short queries and also for queries using
spatio-temporal data - although spatio-temporal data should be researched
more. Using the DBMS optimiser did not show any difference in execution
time for queries, while correctly implemented query tuning techniques also
vastly improved execution time. The main use cases for query tuning are for
long queries and nested queries. Although, most systems benefit from some
sort of query tuning, as it does not have to cost much in terms of memory or
CPU cycles, in comparison to how indexes add additional overhead and need
some memory. Implementing proper optimisation techniques could improve
both costs, and help with environmental sustainability by more effectively
utilising resources.
Keywords
PostgreSQL, Query optimisation, Query tuning, Database indexing, Database
tuning, DBMS
Sammanfattning | 2
Sammanfattning
Databaser finns överallt omkring oss, och att ha effektiva databaser är mycket
viktigt. Databasoptimering har många olika delar, varav två av dem är databas-
justering och SQL optimering. Dessa två delar kan även delas upp i flera
metoder, så som indexering. Indexeringsmetoder har studerats tidigare, och
även jämförts mellan DBMS (Database Management System), för att se
hur mycket ett index kan förbättra prestanda. Det har även skrivits många
böcker om hur man kan implementera index och SQL optimering. I denna
kandidatuppsats ställs frågan "Hur påverkar indexering och SQL optimering
prestanda i PostgreSQL?". Detta besvaras genom att undersöka tidigare experi-
ment och böcker, för att hitta olika optimeringstekniker och jämföra dem med
varandra. Syftet med detta arbete var att implementera och kartlägga var och
när dessa metoder kan användas, för att hjälpa studenter och folk som vill lära
sig om PostgreSQL. Detta gjordes genom att utföra en litteraturstudie och ett
experiment på en databas med olika tabell storlekar, för att kunna se hur dessa
metoder skalas till större system.
Resultatet visar att det finns många olika användingsområden för optimer-
ing, som beror på SQL-frågor och datatypen i databasen. Från både litteratur-
studien och experimentet visade resultatet att indexering kan förbättra prestanda
till olika grader, i vissa fall väldigt mycket. Men om de implementeras fel
kan prestandan bli värre. De huvudsakliga användingsområdena för indexering
är för korta SQL-frågor och för databaser som använder tid- och rum-data
- dock bör tid- och rum-data undersökas mer. Att använda databassystemets
optimerare visade ingen förbättring eller försämring, medan en korrekt omskriv-
ning av en SQL fråga kunde förbättra prestandan mycket. The huvudsakliga
användi-ngsområdet för omskriving av SQL-frågor är för långa SQL-frågor
och för nestlade SQL-frågor. Dock så kan många system ha nytta av att skriva
om SQL-frågor för prestanda, eftersom att det kan kosta väldigt lite när det
kommer till minne och CPU. Till skillnad från indexering som behöver mer
minne och skapar så-kallad överhead". Att implementera optimeringstekniker
kan förbättra både driftkostnad och hjälpa med hållbarhetsutveckling, genom
att mer effektivt använda resuser.
Nyckelord
PostgreSQL, SQL optimering, DBMS, SQL justering, Databasoptimering,
Indexering
Acknowledgements | 3
Acknowledgements
I would like to thank Leif Lindbäck, the supervisor for this thesis, for making
this thesis possible. You helped me a lot with the planning and narrowing
down of the ideas, as well as provided me with an examiner.
I also would like to thank Thomas Sjöland for agreeing to be my examiner.
Lastly, I would like to thank my friend for helping me by answering
questions about report structure, and proofreading.
Thank you.
CONTENTS | 4
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Sustainability and ethics . . . . . . . . . . . . . . . . . . . . 4
1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . 4
1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Database systems . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Relational databases . . . . . . . . . . . . . . . . . . 6
2.1.2 Database management systems . . . . . . . . . . . . . 7
2.2 Structured query language . . . . . . . . . . . . . . . . . . . 8
2.2.1 Relational algebra . . . . . . . . . . . . . . . . . . . 8
2.2.2 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Views and materialised views . . . . . . . . . . . . . 11
2.3 Database tuning . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Database memory . . . . . . . . . . . . . . . . . . . 11
2.3.2 Indexing . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Index types . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Tuning variables . . . . . . . . . . . . . . . . . . . . 21
2.4 Query optimisation . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 The query optimiser . . . . . . . . . . . . . . . . . . 23
2.4.2 The PostgreSQL optimiser . . . . . . . . . . . . . . . 23
2.5 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Database performance tuning and query
optimization . . . . . . . . . . . . . . . . . . . . . . 26
CONTENTS | 5
3 Method 37
3.1 Research methods . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Quantitative and qualitative methods . . . . . . . . . . 37
3.1.2 Inductive and deductive approach . . . . . . . . . . . 38
3.1.3 Subquestions . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Applied methods and research process . . . . . . . . . . . . . 39
3.2.1 The chosen methods . . . . . . . . . . . . . . . . . . 39
3.2.2 The process . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Quality assurance . . . . . . . . . . . . . . . . . . . . 41
4 Experiment 42
4.1 Experiment design . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.2 Docker and the docker environment . . . . . . . . . . 43
4.1.3 Other software . . . . . . . . . . . . . . . . . . . . . 44
4.1.4 Method and purpose . . . . . . . . . . . . . . . . . . 44
4.1.5 Database design . . . . . . . . . . . . . . . . . . . . 44
4.1.6 Queries . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.7 Improved queries . . . . . . . . . . . . . . . . . . . . 49
4.1.8 Keys and indexing structure . . . . . . . . . . . . . . 50
4.1.9 The experiment tests . . . . . . . . . . . . . . . . . . 51
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Other results . . . . . . . . . . . . . . . . . . . . . . 57
6 Discussion 63
6.1 The result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1.1 Reliability Analysis . . . . . . . . . . . . . . . . . . . 68
6.1.2 Dependability Analysis . . . . . . . . . . . . . . . . . 69
6.1.3 Validity Analysis . . . . . . . . . . . . . . . . . . . . 69
6.2 Problems and sources of error . . . . . . . . . . . . . . . . . 69
6.2.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2 Sources of error . . . . . . . . . . . . . . . . . . . . . 70
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . 72
References 80
C Indexes 91
D Detailed graphs 93
D.0.1 Baseline test . . . . . . . . . . . . . . . . . . . . . . 93
D.0.2 Improved queries . . . . . . . . . . . . . . . . . . . . 96
D.0.3 Hash index . . . . . . . . . . . . . . . . . . . . . . . 99
D.0.4 B-tree index . . . . . . . . . . . . . . . . . . . . . . . 100
List of Figures
D.13 Execution time for the B-tree index implemented for query 2. . 102
D.14 Execution time for the B-tree index implemented for query 3. . 102
D.15 Execution time for the B-tree index implemented for query 4. . 103
D.16 Execution time for the B-tree index implemented for query 5. . 103
List of acronyms and abbreviations | 9
CD Compact Disk
I/O Input/Output
ID Identity Document
Chapter 1
Introduction
is used to handle the database [2]. The three-tier architecture can be seen in
Figure 1.1.
What does this look like in practice? Whenever a user requests a website,
the request gets sent to the webserver that requests the database to retrieve
or operate on necessary data, and then finally display the results to the user
[2]. Applying this logic to a social media application, logging into an account
requires access to a database, loading the post history, or even message history
also needs access to a database. The database is used to store the related data
and efficiently retrieve it [1, pg.4].
1.1 Background
Now that it has been concluded that databases are all around us and used in
variety of situations, it would be very noticeable if they were slow. This is
due to how it takes just three seconds for users to drop a website if it’s still
loading, according to Sitechecker [4], which is a company that offers resources
to analyse statistics on web pages. Their target audience is other companies
that have some type of web traffic to monitor, and offer customer stories and
ratings to prove that the product they are selling is reliable.
Databases are often connected to applications - which are called database
applications [1, pg.9] - such as for social media. As development has brought
us faster and faster internet, internet speed can no longer be blamed for slow
Introduction | 3
1.2 Problem
There are several methods to optimising a database system, as stated in the
introduction, ensuring efficiency and speed is important for many different
reasons. But as there are many methods of optimisation, which ones should
be used? That is a question that this thesis aims to provide a starting point
for. Having a compiled document with methods, their use cases, and how
efficient they are in practice could simplify the process of choosing methods.
PostgreSQL specifically is a popular open-source DBMS and providing more
information to the community could be valuable.
The research question is as follows:
1.3 Purpose
The purpose of this report is to describe and compare different methods for
optimising database systems. The purpose of the project is to develop an
understanding of how database tuning and query optimisation operate. It is
also to create material that can be used for teaching purposes in database
courses. This report should be able to lie as a starting point for further
experimentation and research.
Introduction | 4
1.6 Delimitations
Only a couple of optimisation methods are chosen to study in detail, these
methods are chosen based on the availability of information and the delimita-
tions of the performed experiment. The chosen areas are database indexing -
where indexes are chosen based on the available data - using the PostgreSQL
optimiser, as well as query tuning.
The delimitations of the experiment are to use PostgreSQL for the database
system and as a query language, the methods evaluated are limited to software
improvement. The database has a simple design but contains much data, and
the number of queries, indexes, and query improvements are based on the
Introduction | 5
Chapter four describes the experiment parameters and how it was performed.
Chapter five compiles the results for the experiment and the literature study.
Chapter six discusses the result and the evaluation of the result and methods.
Chapter seven contains the conclusion, answers to the research question posed,
and reflections about the work.
Background | 6
Chapter 2
Background
This chapter provides the basic information needed to understand the rest of
the report, as well as some related works for the literature study. It starts
with briefly going over some basics for SQL and database systems and then
moves on to describing memory aspects of database and indexes to provide
a background for tuning. As well as explaining what query optimisation is,
before moving on to the related works.
switching to more efficient parts. For example, the DBMS often uses caching
and buffers to improve performance. Caching means that the data retrieved
from disk is stored for a while - there are different methods to decide for how
long - with the prediction that it might be used again. This speeds up the
process as if the cached data gets used again the Central Processing Unit (CPU)
does not need to wait for retrieval from disk and can just use the cache instead.
The buffer helps to pipeline the process of retrieving data from disk to main
memory, it ensures that while the CPU works on data, the next data set can
get loaded into the buffer, so when the CPU is done it can immediately get the
new data. This is especially helpful if more data needs to be fetched than what
can fit in main memory [1, pg.20, 541-558].
The DBMS consists of multiple parts. One of them is the query optimiser,
which ensures that an appropriately effective execution plan is chosen for every
query, based on some variables, such as storage system and indexes. The
execution plan is the code that is built for the query, which decides what order
different aspects of the query get executed in [1, pg.655-658]. This will be
described further later on in this chapter.
2.2.2 PostgreSQL
PostgreSQL is an open-source object-relational database system that uses
SQL, and offers features such as foreign keys - reference keys that link tables
together - updatable views and more [10]. Views will be described in the next
subsection.
PostgreSQL can also be extended by its users by adding new data types,
functions, index methods, and more [10]. Its architecture is a client/server
model, and a session consists of a server process - that manage database
files, accepts connections to the database from the client-side, and performs
database operations requested by the clients. And the client application that
requests database actions for the server to perform. Like a typical client/server
application, the server and client do not need to be connected to the same
network and can communicate through normal internet procedures. This is
important to keep in mind as files on the client-side might not be accessible
on the server-side. PostgreSQL can handle multiple client connections to its
servers [11] as most servers can.
Earlier it was mentioned that PostgreSQL is a relational database manage-
ment system. This means that it is a system for managing data stored in
relations - the mathematical term for a table. There are multiple ways of
organising databases [12], but relational databases are what is the focus of
this report. Each table in a relational database system contains a collection
of named rows, and each row has a collection of named columns that contain
a specified data type. These tables are then grouped into database schemas.
There can be multiple databases in one server, just like there can be multiple
schemas in a database. A collection of databases managed by one PostgreSQL
server is called a database cluster [12]. Another aspect of PostgreSQL is that it
supports automatic handling of foreign keys, through accepting or rejecting the
value depending on its uniqueness. This means that PostgreSQL will warn if
the value in the foreign key column is not unique, which is done to maintain the
referential integrity of the data. The behaviour of the foreign key can be tuned
to the application by the developer [13], this can be done through specifying
Background | 10
deletion of referenced objects, the order of deletion, and other things [14].
2.2.3 Queries
Here some query concepts used in the experiment will be explained.
Query operations
Two of the query operations that are used in the experiment need some closer
examination. The LIKE and IN operations. To do this, the PostgreSQL
tutorial’s website is used. PostgreSQL tutorial is a website dedicated to
teaching PostgreSQL concepts. They show examples and explanations of how
to use operations and build a database [15].
The LIKE operation is used to pattern match strings to each other. This
can be done using wildcards, which in PostgreSQL is ’%’ for any sequence
of characters and ’_’ for any single character. A wildcard is used for pattern
matching, as stated before. For example, matching the string ’Jen%’ could
give any string that starts with ’Jen’. While using ’Jen_’ could match any
string starting with ’Jen’ and then a single character after [16].
The IN operator is used to match any string within a list of values. It does
this by returning true if the comparing string matches one of the stated values
in the IN operation. It is the equivalence of using equals and OR operations,
although PostgreSQL executes the IN queries faster than the OR queries [17].
Nested queries
A query that executes multiple queries in one contains an inner query - also
called a subquery - and an outer query [18]. Often these types of queries can
be split into multiple separate queries. PostgreSQL executes these queries by
first, executing the inner query, then getting the result and passing it to the
outer query. Lastly, it executes the outer query [18].
A correlated inner query is evaluated for each row that is processed by the
outer query, which differs from how a normal nested query executes according
to Geeks for Geeks, a website dedicated to learning programming languages
through examples [19]. As mentioned in the paragraph earlier, in a normal
nested query the inner query gets executed first and then the outer query. It
can also be said that the correlated query is driven by the outer query as the
result of the inner query is dependent on the outer query [19]. This works
similarly to how nested loops work in any other programming language.
Background | 11
is bringing data to the primary storage from the secondary storage, for the
execution of operations on the database. In some cases, a database can be
stored in the primary memory - a so-called main memory database - this
is often done for real-time applications. But because databases often store
persistent data, some of which needs to be read or handled multiple times
while it is stored, it needs to use secondary storage. The databases are also
generally too big to store on a single disk which means that multiple disks need
to be used, and the benefits of secondary storage hardware often outweigh the
benefits of the primary storage ones [1, pg.542-544].
Typically, the database application only needs small amounts of data to
process from the database, hence, the data needs to be accessed on disk and
effectively moved to main memory to increase the speed of execution. As
mentioned earlier this is partly done through hardware by the use of buffers,
as there is a noticeable difference between how quickly the CPU can process
data and the moving of data from disk to main memory. Other ways to do this
require a basic understanding of how the data is stored in the database and the
hardware.
The data on disks are stored as something called files of records, in which a
record is a set of data values that describe entities, their attributes, and relations
- i.e a table [1, pg.560]. Files of records are often stored in data blocks - also
called a page - which are fixed sizes of storage on a disk. This is important
to note as the transmission of data from disk to main memory usually is done
on a per-block basis. By physically storing data in contiguous blocks on disk
performance can be improved as it puts related data near each other, which can
prevent the arm on the disk (HDD) from having to move longer distances. This
can be further improved by prediction, which is done through reading multiple
blocks of data at once and putting it in main memory. This can reduce the
search time on disk access. It only works if the application is likely to need
consecutive blocks and the ordering of the file organisation allows it, though
[1, pg.561-563].
How files are ordered in memory can be done in different ways. Storing
the files in a specific order is called the file organisation [23], and it can be
described as the auxiliary relationship between the records that build up the
file. It is used to identify and access any given record [1, pg.545-546]. In the
database, there are two ways to store files, the primary file organisation and
the secondary file organisation. The primary file organisation decides how
file records are physically placed on disk. This is done by using different data
structures such as heaps, hash structures, and B-trees. For example, a heap file
would not store the records in any particular order and instead place them as a
Background | 13
heap would order them. Unlike the primary file organisation, the secondary file
organisation is a logical access structure that improves the access to file records
based on other fields than what is used for the primary file organisation. This
is often done through indexing [1, pg.545-546, 604-611].
There can be different types of records in a file, the type is decided by
the collection of field names and their corresponding data types contained in
the record. This means that records in files can be constant or of variable
length. If a file has variable length records it can affect indexing and search
algorithms efficiency. This is due to the way files consist of sequences of
records. By having a constant length on records it is simpler to calculate the
start of each field in a record based on the relative starting point of the record
in the file. Therefore, algorithms handling variable-length records often need
to be more complex, which can affect the speed of execution [1, pg.560-561].
The different ways of how variable-length files can look are as follows:
• The file record is of the same type but one or more of the fields have
different sizes.
• The file record is of the same type but one or more of the fields have
multiple values for each record, this is called a repeating field.
• The file record is of the same type but one or more of the fields are not
mandatory.
• The file contains one or more records of different record types, this leads
to the records being of different sizes. This often happens in clusters of
related records.
[6, pg.]60-5611
As mentioned earlier, there are heap files and ordered files, which are the
main ways of storing records on a file. The heap files store records in a heap
structure, while the ordered files can use many different data structures for
storage. The main benefit of using ordered files is that other search algorithms
than linear search can be used when searching for a record. Although, ordered
files are rarely used unless a primary index is implemented [1, pg.567-572].
The main data structures implemented for ordered files are hash tables, hash
maps, and B-trees, which each have their pros and cons and are chosen
depending on what the file is used for [1, pg.583]. These data structures are
described in more detail later on in this chapter.
Background | 14
2.3.2 Indexing
An index is a supplementary access structure. It is used to quickly find and
retrieve a record based on specified requirements. They are stored as files on
disk and contain a secondary access path to reach records without having to
physically order the files [1, pg.601-602]. Without an index, a query would
have to scan an entire table to find the entries it is searching for. In a big table,
having to go through every element sequentially would be very inefficient,
especially in comparison to using an index. For example in a b-tree index, the
search would only need to go a couple of levels deep in the tree [1, pg.601-
602]. The index is handled by the DBMS in PostgreSQL. Which in part
handles the updates for the index when a table changes. The downside to
using indexes is that updating them as the tables change adds an overhead to
the data manipulation operations. This means that updating a table indirectly
adds to the execution time of the data manipulation operations [24], which is
an important aspect to keep in mind when deciding if an index should be built
on a table or not [1, pg.601].
The indexes are based on an index field, which can be any field in a file or
multiple fields in the file. Multiple indexes can also be created on the same
file. As mentioned earlier, indexes are data structures used to improve search
performance. Therefore, many data structures can be used to construct them.
The data structure is chosen depending on many different factors. One such
factor is what queries are predicted to be used on the index. Indexes can be
separated into two main areas, single-level indexing and multilevel indexing
[1, pg.601], which will be described below.
Single-level indexes
Single-level indexing using ordered elements has the same idea as a book
index, which has a text title and the page it can be found on. This can be
compared to how the index has the index field and the field containing the
pointers to where the data can be found on disk. The index field used for
building the index on a file - with multiple fields - with a specified record
structure, is usually only based on one field. Like earlier mentioned, the index
stores the index field and a list of pointers to each disk block that contains a
record with the same index field. The values - index fields - in an ordered
index are also sorted so that a binary search can be performed to quickly find
the desired data. How efficient is this? Well, if a comparison is made in the
case of having both the data file and the index file sorted, the index file is often
smaller than the data file. This means that searching through the index is still
Background | 15
Multi-level indexes
The idea behind a multilevel index is to reduce the part of the index that is
searched with the block factor (bfri) - also called the fan-out (fo) - for the
index. During a multilevel index search, the area that is searched is reduced
by fo, which if larger than two makes it more efficient than binary search. The
way the multi-level index works is by viewing the index file as an ordered file
with a distinct value for each entry. The index file counts as the first level
of the multi-level index and the second level is defined as the primary index
that is created on the first level. A block anchor is created for the second
level so that it has an entry for each block of the first level. The block factor
remains the same for every level of the multi-level index as the size for entries
remains the same - a field value and a block address. This process is then be
repeated, level three is another primary index created on the second level, et
cetera. More levels are only needed if a level needs more than one block for
storage as each level reduces the number of entries by a factor of fo, this means
each level requires less storage. This also means that only one disk block is
accessed per level, thus, for a multi-level index with t levels only t disk blocks
are accessed during a search. Which increases the speed of searches. Lastly,
the last level of the index is called the top index level, and the multi-level
index can use primary, secondary and cluster indexes [1, pg.613-614]. Multi-
level indexes still suffer from the issues of insertion and deletion of records.
Dynamic multi-level indexes aim to solve this by leaving space in blocks for
insertion of new entries and using appropriate insertion/deletion algorithms
for creating/deleting index blocks when the data file grows/shrinks. This is
often done by using B+-trees - which functions like a B-tree but has its leaf
nodes connected as well - as a data structure [1, pg.613-614].
B-trees
B-trees are balanced search trees that are useful for equality and range queries
on data that can be ordered [25]. The PostgreSQL query planner will consider
using a B-tree if any comparison operator is used in the query. B-tree indexes
are also useful for retrieving data in sorted order, due to the nature of the B-
trees [25]. PostgreSQL also supports multi-column B-trees. They are most
effective when there are constraints on the leading columns but can be used for
any subset of the index’s columns. The rule is that when an equality constraint
is used in the leading columns and any inequality constraint is used in the
first column the part of the index that is scanned is more restricted. Column
constraints to the right of these index columns are checked in the index so not
as many accesses to the table is done, but there is no reduction of what parts
of the index need to be scanned [26]. A visual representation of a B-tree index
can be seen in Figure 2.1.
Hash indexes
Hash indexes are a secondary index structure that accesses a file through
hashing a search key - which can not be the primary key for the file’s
organisation system [1, pg.633]. PostgreSQL supports persistent, on disk hash
Background | 18
indexes that are crash recoverable. One of the benefits of using a hash index is
that any data type can be indexed by it as it only stores the hash value of the data
being indexed, thus, there is no size constraint for the data column that is being
indexed [28]. Although the use cases for the hash index are limited as hash
indexes only support single-column indexes and cannot check uniqueness,
nor can they perform range operations. They are best used for SELECT and
UPDATE heavy operations that use equality scans over large tables. Another
pitfall of the hash structure is the problem of overflow, therefore, hash indexes
are most useful for mostly unique data. Because of the inherent nature of the
hash structure causing difficulty with expansion, it is most useful for tables
with few if any insertions [28].
A hash index can be implemented in different ways [1, pg.633], but in
PostgreSQL, it is done by using buckets [28]. These buckets have a certain
depth that is split when there are insertions into the index [1, pg.633-635]. An
example figure of this can be seen in Figure 2.2.
GiST indexes
A GiST index is a type of index that can be tweaked by the developer as
there are many different kinds of index strategies that can be implemented
[25]. It is based on the balanced tree access method to use for arbitrary
indexing schemes. The main advantage to using GiST is that it allows for
the development of a custom data type with an appropriate access structure
by a data type expert - a programmer that does not have to be a database
administrator [30]. How the GiST index is used depends on what operator
class is implemented, but the standard for PostgreSQL is to include several
two-dimensional geometric data types [25]. The operator class defines what
operators can be used on the columns in the index, for example, comparison
operations between different data types [31]. GiST indexes can optimise
nearest-neighbour searches, but this is dependent on the operator classes
defined [25]. A multi-column GiST can be used with query conditions that
use any subset of the index’s columns. Adding additional columns restricts
the entries returned by the index. The way this works is that the first column
is used to determine how much of the index need to be scanned. This index is
not very effective if the first column only has a few distinct values [26].
SP-GiST indexes
SP-GiST indexes expand on GiST indexes by permitting the implementation of
different non-balanced disk-based data structures, such as radix trees, tries et
cetera [25]. It supports partitioned trees which allow developing non-balanced
tree structures. The generally desired feature for the structures is to use it to
divide the search into pieces of equal size [32]. The standard operators for an
SP-GiST index in PostgreSQL is to use an operator class for two-dimensional
points [25].
GIN indexes
GIN indexes are similar to the previous two ones, although it differs by using
the standard operator class for standard array operators [25]. GIN is specially
designed to handle when the items to be indexed are composite values, and
the queries performed need to search for the element values in the composite
items. The word item refers to the composite values to be indexed and the
key is the element value. The way the GIN works is that it stores sets of
pairs - with the key and the posting list. The posting list is a set of rows
Identity Documents (IDs) where the keys occur. Each key-value is only stored
Background | 20
once even though the same ID can occur multiple times [33]. Multi-column
GIN indexes work similar to multi-column GiSTs, the main difference is that
the search effectiveness is not dependent on what index column the query
conditions use [26].
BRIN indexes
BRIN indexes store summaries of the values in a table in consecutive physical
block ranges [25]. It is designed to handle very large tables that have columns
with some natural correlation to where the columns are physically stored
within the table. BRIN indexes can perform queries with regular bitmap index
scans which returns all tuples in all pages - within a specified range - if the
summary information stored by the index is part of the query conditions. This
summary information needs to be updated when new pages of data are filled.
This is not done when a new page is created, it is instead created when a
summarisation run is invoked. On the other hand, values in a table changing
can also cause the index tuple in the summary to be inaccurate. TO solve this,
de-summarisation can be run [34]. The operator class that BRIN uses depends
on the implemented strategies. For data with linear store order, the data in the
index usually correspond to the minimum and maximum values of the columns
for each block range, which makes some operations more suitable than others.
But as different types of data can be stored in this type of index, the operations
need to be chosen based on the type of data [25]. Multi-column BRIN indexes,
like GIN has no dependence on what column is used in the query condition.
Although there are few reasons as to why a multi-column BRIN would be used
[26].
indexing common values, since querying common values most often do not
use indexes anyways. This reduces the size of the index so that many table
operations are sped up when performed on the index [36].
All indexes are secondary indexes in PostgreSQL. This means that the
table rows that are referenced can be anywhere on the PostgreSQL data heap.
To access the data from an index scan, therefore, involves random access.
Which depending on the disk drive can be slow. To make this more efficient,
something called an index-only scan is supported. What this means is that
a query can be answered without accessing the heap. The idea behind it
is to return index entries instead of consulting with the heap entries. In
PostgreSQL, only B-trees, GiSTs and SP-GiSTs can support index-only scans,
and only B-trees always has built-in support for it [37]. One requirement to
decide if an index-only scan is possible to form is that the query that wants
to use the index-only scan must only reference columns that are stored in
the index, otherwise, heap access is needed. Another requirement for index-
only scans is that each row retrieved is visible to the query’s Multi-Version
Concurrency Control (MVCC) snapshot [37]. The MVCC is something that
PostgreSQL uses for concurrency control. It works by showing each query
and transaction a snapshot of how the database was some time ago, no
matter how the data looks at the exact moment of querying. This protects
the transaction from seeing inconsistent data that could be caused by other
concurrent transactions [38]. The visibility information is not stored in the
index, but PostgreSQL keeps track of the data that is old enough that it should
be visible for all future transactions. This means that there is a loophole for
data that does not change often to use index-only scans [37]. To effectively
use this feature, a covering index can be used. This type of index is designed
to include columns needed by a specified query. Sometimes some columns
that are not part of the result is needed for a query, PostgreSQL supports
this by adding a payload that is not part of the search key with the command
INCLUDE [37], this can also be used to solve the problem of missing values
in indexes like discussed for combining indexes.
• Some indexes are updated too frequently because the index attribute
changes too often.
[1, pg.640]
To figure out if any of these issues apply to the database, many DBMSs have
commands for tracing how a query is executed. After doing that the issues
can be solved by either dropping, creating, or changing indexes (to or from
cluster indexes), and rebuilding the indexes. All of these options can improve
performance if the tracing is read correctly. The reason why rebuilding indexes
can improve performance is because of how in the case of there being many
deletions on the index key the index pages may contain space that is not used.
This space can then be reclaimed during a rebuild. Rebuilding can also solve
overflow issues caused by insertions [1, pg.640].
There are different types of scan nodes depending on the type of scan that is
performed. If the query has other operations such as join, sorting, et cetera
there will be nodes above the scan nodes - which means that the tree grows
upwards [39]. As there are different ways to perform these operations, other
nodes can also appear. The output of EXPLAIN shows a line for each node
in the plan tree, its type, and the estimated cost of the execution of that node.
The costs are estimated in arbitrary units that are dependent on the planner’s
cost parameters. The cost of an upper level-node includes the cost of all its
children nodes.
An important thing to keep in mind is that the planner only will consider
things it cares about in the cost, transmitting the result is not one of them. This
is important to note as there can be other things that affect efficiency that the
planner does not count on [39], which could mean that optimising a query is
not the best solution to all efficiency problems.
To check the accuracy of the planners estimate the command EXPLAIN
ANALYZE can be used. This causes the EXPLAIN command to execute the
query and then display the row count and the run time for each plan node
as well as their estimates. For the executed plans the unit is in milliseconds
instead of an arbitrary unit, which is used by the statistics that EXPLAIN
shows. EXPLAIN also has other options, among them is a BUFFER option
that further can help with analysing run time statistics. This is done through
helping with analysing what I/O operations are the most sensitive [39].
It is also important to note that with EXPLAIN ANALYZE the transactions
need to be rolled back as the query is executed [39]. There are also other
pitfalls to using EXPLAIN ANALYZE, such as the statistics deviating from
normal run-time execution time. One reason as to why this happens is due
to no output rows being delivered to a client. This means that there is
no consideration to transmission time and I/O conversion costs. Another
issue is that the overhead to EXPLAIN ANALYZE can be significant, this
is because of how different operating systems can have different speeds for
their gettimeofday() operations, so the operation can take longer than actual
execution time due to this. The last pitfall to keep in mind is that EXPLAIN
results cannot be generalised among different tables. This means that the same
result cannot be expected to apply on a large table when tested on a small table
[39].
The query planner looks at statistics to make good estimates, it does this
for specific variables. For single-column statistics, important factors are the
total number of entries in each table, and index, as well as the disk blocks they
occupy. This information is kept as part of the table in the pg_class, under
Background | 25
the names reltuples and relpages. These two columns are not updated very
often, so they often contain old values. VACUUM or ANALYZE can be used
to update them on a per-use basis, which means that they are incrementally
updated as they are used [40].
A common issue for slow queries is that the columns used in the query
are correlated. The planner assumes that multiple conditions are independent
[40]. PostgreSQL supports multivariate statistics to help with this. This is
done by creating statistics objects with the CREATE STATISTICS command.
Which facilitates an interest in a multivariate statistics object. The data
collection is still done with ANALYZE. There are different ways to handle
multivariate statistics, but the supported extended statistics in PostgreSQL
are: functional dependencies, multivariate N-distinct counts and multivariate
Most Common Value (MCV) lists [40]. The functional dependencies are the
simplest of the extended statistics. A functional dependency is defined as ’if
column a is functionally dependent on column b and if the knowledge of the
value in b is sufficient to derive the knowledge in column a’. For example,
having a column for social security number and also a birth month column,
the birth month can be derived from the social security number, i.e the birth
month is functionally dependent on the social security number. The reason
as to why functional dependencies have their own statistics tool is due to how
the existence of functional dependencies affects the accuracy of estimates in
queries [40]. One important thing to note is that for PostgreSQL version 13
functional dependency statistics are limited to simple equality queries [40].
Multivariate N-distinct counts in PostgreSQL help improve the estimates
for numbers of distinct values when combining more than one column - such
as in GROUP BY(a, b) operations. It is only advisable to create these objects
if combinations of columns are grouped, otherwise ANALYZE cycles are
wasted. The multivariate MCV lists improve the accuracy of estimates for
queries with conditions on multiple columns. This is done by ANALYZE
collecting MCV lists on combinations of columns, so the MCV list contains
the most common values collected by ANALYZE in the specified columns.
This is not recommended to do very often as MCV lists are stored - unlike
the information collected by N-distinct counts - which then can take up too
much memory. It is advised to only use MCV lists on columns that are used
in conditions together [40].
The planner can be controlled with JOIN clauses [41]. As there are many
JOIN possibilities between tables to form the same result for queries, the more
efficient ones need to be chosen. As JOINS deal with the cartesian product,
the less calculation, and processing needed for the same result the better. The
Background | 26
number of JOIN possibilities grows exponentially the more tables are involved,
and the PostgreSQL optimiser will then switch from exhaustive search to
genetic probabilistic search by limiting the number of possibilities. This takes
less time for the search but might not result in the best possible option [41].
There is less freedom for outer joins than inner joins for the planner [41].
on the queries that are being executed, the authors have defined different query
types, which are the following:
• Point queries return one record or parts of a record based on an equality
selection.
• Range queries return a set of records whose values are within an interval.
• Prefix match queries are queries that use AND and LIKE statements, to
match strings or sets of characters.
• Extremal queries are queries that obtain a set of records that return the
minimum or maximum of attribute values.
• Join queries are queries that links two or more tables. There are different
types of join queries. For joins that use an equality statement (equijoins),
the optimisation process is simpler, for join queries that are not equijoins
the system will try to execute the select statement before joining. This
is due to non-equijoins often needing to do full table scans, even when
there is an index present.
The authors then go on to describe index types, how they function and,
what queries have the most use of them. There are clustering indexes - also
called primary indexes - and non-clustering indexes. This has been described
earlier in the background and will not be discussed further in this section.
They describe B-trees as good indexes for range, prefix match, and
ordering queries. They state that one benefit of using a clustering B-tree the
need for using an ORDER BY statement can be removed, this is good to keep
in mind if sorting queries often are used on that table. Although, generally
non-clustering indexes work best if the index covers all attributes necessary
in a query. This is due to the fact that the query then can circumvent the
need to access the table entirely if all information it needs is present in the
index. They further develop that B-trees are useful for partial match, point,
multipoint, range, and general join queries. And that hash indexes are good
for point, multipoint and equijoin queries [43].
The authors then describe composite indexes and their benefits. A composite
index is an index based on multiple attributes as its key. And having a dense
Background | 29
composite index can sometimes entirely answer a query without accessing the
table. It is best used when a query is based on most of the key attributes in the
index, rather than only one or a few of them. The main disadvantage for this
type of index is the large key sizes as there are many more attributes that can
potentially need to get updated when the table is updated. They conclude the
chapter by stating that indexes should be avoided on small tables, dense indexes
should be used on critical queries and indexes should not be used when the cost
of updates and inserts exceed the time saved in queries [43].
The next part of the book describes query tuning and some tips on how
to implement optimisation. They promote tuning over indexing by writing
that inserting indexes can have a harmful global effect while rewriting a query
only can have positive effects if done well. But what is a bad query? How
is that determined? The authors state that a query is bad when it requires too
many disk accesses and that it does not use the relevant indexes. They follow
this up by describing some tips to use to improve queries. One of them is
to not use DISTINCT as it creates an overhead due to sorting. DISTINCT is
only needed if the fields returned do not contain a key as it then is a subset
of the relation created by the FROM and WHERE clauses. It is not needed
when every table mentioned returns fields that contain a key of the table by
the select statement - a so-called privileged table. Or if every unprivileged
table is joined with a privileged one - this is called that the unprivileged table
reaches the privileged one [43]. They also caution that many systems do not
handle subqueries well, and that the use of temporaries can cause operations
to be executed in a sub-optimal manner. Complicated correlation sub-queries
can often execute inefficiently and should be rewritten. But a benefit to using
temporaries is that it can help with subverting the need of using an ORDER BY
statement when there are queries with slightly different bind variables. They
also warn against using HAVING statements if a WHERE statement is enough
and encourage studying the idiosyncrasies of the system. Some systems might
not use indexes when there is an OR statement involved, to circumvent this
a union could be used. They state that the ordering of tables in the FROM
statement can affect the order of joins, especially if more than five tables are
used. They then discourage the use of views as it can lead to writing inefficient
queries [43]. Rewriting nested queries is highly encouraged by the authors as
query optimisers do not perform as well on many nested queries [43].
Background | 30
of the large tables. This means that almost all rows contribute to the output,
even if the output size is small. The way to optimise these types of queries is
by avoiding multiple full table scans and reducing the size of the result as soon
as possible. Indexes are not needed here and should not be used, the authors
state. For joins, a hash join is most likely the better algorithm for the job when
dealing with long queries. If GROUP BY is used by a long query, the filtering
needs to be applied first in most cases, to ensure efficiency. There are times that
GROUP BY can reduce the size of the data-set, but the rule of thumb is to apply
the SELECT statements first for the optimiser. Set operations can sometimes
be used to prompt alternative execution plans. This can be done by replacing
NOT EXIST and NOT IN with except, EXIST and IN with INTERSECT, and
use UNION instead of multiple complex selection criteria with OR [21].
The authors then describe the pitfalls of views and that their main use,
which is for encapsulation purposes. Materialised views on the other hand can
help improve performance. This is due to the fact that data is actually stored
and because indexes can be created on them. A materialised view should be
created if the data it is based on does not update often if it is not very critical
to have up-to-date data, the data in the materialised view is read often, and if
many queries could make use of it.
After this section, the authors discuss partitioned tables. The main use for
them is to optimise table scanning. If a query uses values in the range of the
partitioning, only one partition would need to get scanned. This means that
the key should be chosen to satisfy a search condition. Indexes can be applied
on these tables, and they are beneficial for short queries [21].
After this, multidimensional and spatial searches are discussed. The
authors state that spatial data often require range queries. Which means finding
all the data located at a certain distance or closer to a specified point in space.
And nearest-neighbour queries, which is to find a variable number of objects
closest to the specified point. These queries cannot be supported by one-
dimensional indexes or even multiple indexes. This is when GiST indexes
come into play. They describe GiST indexes as points and search conditions
are represented as a rectangle and that all points within the rectangle or polygon
are returned as the result [21].
Lastly, the book concludes with the ultimate optimisation algorithm for
queries which summarises the points brought up in this section.
Background | 33
The steps are the queries performed, the second column shows the time
of execution without indexes, and the third column the time after indexes and
clustering were implemented. The result for the prepared query execution was
that no major difference was noticeable on single queries. The prepared query
was done by running EXPLAIN ANALYZE to ensure that the optimisers
statistics were up to date [45].
so if the wildcard appears in the first character then a full table search has to be
made, it explores all the avenues without filtering. The trie on the other hand
uses non-wildcard characters in the search for filtering.
In this report experimenting, the trie had better search performance than
the B+-tree when it came to exact match, around 150% better, it also scaled
better than the B+-tree. For prefix matches the B+-tree outperformed the trie,
this was due to the inherent nature of having the keys sorted in the leaf nodes.
Which allows the tree to answer prefix match queries very efficiently. For exact
matches the B+-tree scales better as well, this is due to how the trie consists
of more nodes and more node splits than the B+-tree.
Kd-tree and R-tree comparison was done over a two-dimensional point data
set. The kd-tree performed 300% better than the R-tree when it came to point
search and 125% better when it came to range search, although the R-tree has
better insertion time and better index size. This is due to how the kd-tree has
a node size (bucket size) of one and every insertion causes a node split. This
leads to the number of nodes being very large and the clustering technique that
SP-GiST uses to reduce the tree page height costs the index page utilisation.
PMR quadtree in comparison to R-tree for indexing was done on line-
segment data sets. The R-tree had better insertion and search performance.
The nearest neighbour search for the kd-tree and the point quadtree was
better than for the trie. This is due to how the trie performs the NN search
character by character while for the kd-tree and the point quadtree the NN
search is based on partitions.
Method | 37
Chapter 3
Method
This chapter describes the research methods and methods used for the testing
of optimisation methods. The first section describes the methodologies used
for the research and how they were used for the project. The sub-questions for
the project are then presented as well as the research approach.
Qualitative methods often choose data sets that are small and selective, while
the quantitative approach uses large and random data. The purpose of the
random collection of data is to be able to draw general conclusions [50].
3.1.3 Subquestions
The research question posed in chapter one is "how do indexing and query
optimisation affect response time for a PostgreSQL database?". To explain
what this means the question can be divided into sub-questions.
• What methods of indexing are there and what are their use cases?
• How does query optimisation work and how can queries be optimised?
• How does indexing, the query optimiser, and query tuning compare to
each other?
Pre-study
As seen in Figure 3.1, the first step of the method is to conduct the pre-
study. This is conducted to gain a basic understanding of the research area
as well as to develop the aim and research question that are formed in this
report. This is the information found and presented in the background of this
report. This is done to ensure that the necessary skills and knowledge for this
report are known and that the researcher in question has that knowledge. By
learning more detail about databases and how to optimise them, delimitations
are formed, and the research question is worded in such a way that it included
the sub-questions.
• Ensuring validity. Making sure that the research has been conducted
according to the rules of the project and that the meaning of the result
can be easily discernible. As well as making sure that any testing
instruments measure the correct things [49].
Chapter 4
Experiment
This chapter goes into detail about the experiment performed for this project.
It tells of the hardware and software used, the database design and queries
performed. The improved queries can be seen in a section below. The database
schema, and the index design can be found in the appendixes A and C at the
end of this report. The details in this chapter should be sufficient enough to
ensure replicability.
4.1.1 Hardware
The following list presents the relevant hardware used to run the database
environment, for the purpose of replicability of the experiment:
• SSD: Kingston A400 SSD, 960 GB, 500MB/s read and 450MB/s write
Experiment | 43
updated to the latest version as of 2021-10-06 and the time package is installed
(apt install time).
F, which is a database filled with movies, games, tv-shows and other media.
It contains information about people who have worked in the media, and how
it is rated. The ratings are collected from users on the IMDb website. As the
figure shows, the database only has six tables, with a couple of attributes in
each table. The person table contains the person_id which is just a string of
characters to identify the row in the table - and is also the primary key. It also
contains the surname and last name of the person, their date of birth, and death
date - which is null if the person is still alive. This table has a one-to-many
relationship with the crew table as one person can be multiple crew members.
The crew table has a title_id - which is the id of the media the crewmember
worked on. The person_id to link to who the person is, a category - which is
the title of their work, i.e actor, director, writer, etc - and job. From looking
at the data in the database file the job column is mostly null values, with the
exception for producers who have a repeat of ’producer’, and writers which
have what they write - poem, play, book - and the title of it in a string.
The next group of tables is the ones that contain information about the
media in the database. The akas table is an overview of the media in the
database, it contains the basic information about the media such as the title,
the region it was produced in, what language it contains, what type it is - shows
IMDb display, original, alternative or null. What attributes it has - information
about the title, mostly null - and a boolean value for if the title is an original
title or not. This then gets further divided into an episodes table, that shows
information about the episodes in a show, i.e the episode number and season
number. More information about the titles can be found in the titles table. It
shows what type of media it is, the original and the primary title of the media, if
it is adult rated, when it premiered, when it ended (mainly for shows), how long
the runtime for the media is, as well as what genres the media is in. The genre
column contains a string of all the genres the media belongs to. The titles table
also contains the primary key, which is on the title_id column. The last table is
the ratings table, which contains the average rating for the title and how many
votes it has received, it also has a primary key on the title_id column. Thus,
there is a one-to-one relationship between titles and ratings, a one-to-many
relationship between titles and akas, and a one-to-many relationship between
titles and episodes.
It is important to note that the database does not contain any foreign keys,
the only key constraints that exist are the primary keys that can be seen in
Figure 4.2.
Experiment | 46
In Figure 4.3 the different amount of data can be seen for the testing. The
data is calculated by dividing the original amount of rows - seen under the
column named 1 - in each table by ten for each iteration. The first n rows are
taken from the original file and placed in separate files to fill the database with.
Experiment | 47
Table name 1 2 3 4
Akas 1436745 143675 rows 14368 rows 1437 rows
rows
Crew 9990049 999005 rows 99901 rows 9990 rows
rows
Episodes 593366 rows 59337 rows 5934 rows 593 rows
People 3571826 357183 rows 35718 rows 3572 rows
rows
Ratings 362285 rows 36229 rows 3623 rows 362 rows
Titles 2294719 229472 rows 22947 rows 2295 rows
rows
Figure 4.3: The table sizes in the database.
4.1.6 Queries
The queries used for the experiment are listed below, including a small
description as to what they are testing and why they are chosen. The reason
for there being a smaller amount of queries to test things is due to there being
more extensive testing, both to see scaling and performance, rather than trying
to find queries that have specific cases.
1 -- how many m o v i e s are in the d a t a b a s e ?
2 SELECT COUNT ( DISTINCT title_id )
3 FROM t i t l e s
4 W H E R E type IN ( ’ m o v i e ’ , ’ v i d e o ’ ) ;
Query 1
Query 1 is used to test multi-point queries. This query is chosen due to the fact
that both hash indexes and B-tree indexes are good choices for point queries,
so seeing a difference in performance would be noticeable here. This query is
also easy to make improvements to when it comes to tuning, which is another
factor in the choosing of it.
1 -- how much c o n t e n t in each type is on the
d a t a b a s e and what are the t y p e s ?
2 S E L E C T type , C O U N T (*)
3 FROM t i t l e s
4 G R O U P BY ( type )
5 O R D E R BY ( type ) ASC ;
Query 2
Query 2 looks at all the types that there are in the table and then how many of
each type there are. This is a large query which is why it was chosen. Seeing
Experiment | 48
if it can be improved by any of the methods for this query type would be very
useful.
1 -- list all a c t o r s / a c t r e s s e s p l a y i n g in
2 a spiderman movie
3 S E L E C T D I S T I N C T name
4 FROM ( S E L E C T p r i m a r y _ t i t l e , o r i g i n a l _ t i t l e ,
5 crew . title_id , person_id , c a t e g o r y
6 FROM crew
7 I N N E R JOIN t i t l e s ON
8 t i t l e s . t i t l e _ i d = crew . t i t l e _ i d
9 WHERE primary_title
10 LIKE ’ Spider - Man % ’
11 OR o r i g i n a l _ t i t l e LIKE ’ Spider - Man % ’ ) as a
12 I N N E R JOIN p e o p l e ON
13 a. person_id = people . person_id
14 WHERE a. category = ’ actor ’
15 OR a . c a t e g o r y = ’ a c t r e s s ’ ;
Query 3
Query 3 lists all the actors and actresses that have played in a Spider-Man
movie. This query is chosen in part because of how it has an inner query,
that could easily be transformed into a materialised view, so comparing that
in performance is of interest.
1 -- get the second - h i g h e s t r a t i n g
2 SELECT DISTINCT rating
3 FROM r a t i n g s
4 WHERE rating = (
5 S E L E C T MAX ( r a t i n g ) FROM r a t i n g s
6 W H E R E r a t i n g != (
7 S E L E C T MAX ( r a t i n g ) FROM r a t i n g s ) ) ;
Query 4
Query 4 gets the second-highest rating for the media in the database, this
query is chosen due to how it has a correlated inner query and see if any of
the methods can improve this is of interest as correlated inner queries can be
likened to inner loops in other programming languages. Seeing if this could
be improved by the optimiser or an index is of great interest.
1 -- find all m o v i e s made b e t w e e n 2000 and 2010
2 SELECT primary_title , premiered
3 FROM t i t l e s
4 W H E R E type LIKE ’ m o v i e ’
5 AND p r e m i e r e d B E T W E E N 2000 AND 2010
6 O R D E R BY p r e m i e r e d ASC ;
Query 5
Experiment | 49
Query 5 gets all the movies that are premiered between 2000 and 2010. This
query is chosen to test range queries.
The difference between improved query 1 and query 1 can be seen in the
SELECT statement and the WHERE statement. The improved query does not
use DISTINCT, as it is deemed unnecessary due to the nature of primary keys
being unique. DISTINCT would just add extra overhead to the filtering. The
WHERE clause differs in how the improved query uses OR instead of IN. This
is done to see if IN and OR had any difference in performance. As IN checks
the column value and matches it to a list of values. Technically, as stated in the
background, the IN statement should be executed faster than OR, so it is not
an improvement, rather a difference in the query to see if it makes a difference
in performance.
The improved query 2 differs from query 2 by counting the column title_id
instead of *. This is done to test the the statement in the literature study. The
source states that by switching * to the column to be counted, performance
would be improved.
Experiment | 50
1 -- list all a c t o r s / a c t r e s s e s p l a y i n g in
2 a spiderman movie
3 C R E A T E M A T E R I A L I Z E D VIEW q3
4 AS
5 SELECT primary_title , original_title ,
6 crew . title_id , person_id , c a t e g o r y
7 FROM crew
8 I N N E R JOIN t i t l e s ON
9 t i t l e s . t i t l e _ i d = crew . t i t l e _ i d
10 WHERE primary_title
11 LIKE ’ Spider - Man % ’
12 OR o r i g i n a l _ t i t l e LIKE ’ Spider - Man % ’ ;
13
14 S E L E C T D I S T I N C T name FROM p e o p l e
15 I N N E R JOIN q3 ON
16 q3 . p e r s o n _ i d = p e o p l e . p e r s o n _ i d
17 W H E R E q3 . c a t e g o r y = ’ a c t o r ’
18 OR q3 . c a t e g o r y = ’ a c t r e s s ’ ;
Improved query 3
The improved query 3 uses builds a materialised view instead of using an inner
query. As source from the literature study states that the query planner can
have a difficult time optimising queries with inner loops. It is also stated in the
background that running queries on materialised views can cause better query
performance.
• The queries, this is used for baseline measuring and is used to decide if
the other results are slower or faster.
• Improved queries, the original queries that have been tuned for better
performance.
• General indexes, running the baseline queries with indexes built based
on key constraints.
• Personalised indexes, running the baseline queries with indexes that are
built based on columns used by the queries.
Results and Analysis | 52
Chapter 5
This chapter summarises the result of the literature study, as well as presents
the result from the experiment.
5.1.1 Theory
In the report ’Database performance tuning and query optimization’ [42]
the main take-away points are that indexing can be the solution to many
performance issues, but maintaining an index can cause overhead when
updating tables. It can also cause CPU and I/O usage to increase which
also increases the cost of writing data to disk [42]. The book ’Database
tuning principles, experiments, and troubleshooting techniques [43] further
develops on this. First, it describes how a database administrator should think
to improve a database with a three-step technique:
• Start-up costs are high, running cost is low: improving execution time
often costs memory or processing power.
Results and Analysis | 53
The book then continues to describe how queries can be divided up into types,
and what indexes suit which query type. The query types are described as:
• Point queries return one record or parts of a record based on an equality
selection.
• Range queries return a set of records whose values are within an interval.
• Prefix match queries are queries that use AND and LIKE statements, to
match strings or sets of characters.
• Extremal queries are queries that obtain a set of records that return the
minimum or maximum of attribute values.
• Join queries are queries that links two or more tables. There are different
types of join queries. For joins that use an equality statement (equijoins),
the optimisation process is simpler, for join queries that are not equijoins
the system will try to execute the select statement before joining. This
is due to non-equijoins often needing to do full table scans, even when
there is an index present.
B-tree indexes are in particular good for range, prefix match, partial match,
point, multipoint, general join, and ordering queries. Clustering B-trees are
good for getting rid of the ORDER BY statement, due to the ordering nature of
B-trees in combination with physical storage. And for non-clustering indexes,
covering all the attributes necessary for a query is the best way to use them, as
then it is possible for the DBMS to use an index-only scan. Another type of
index is the composite index, whose use-cases are mainly to ensure minimal
table accesses for queries that use many of the key attributes in the index.
Although, there can become an issue with updates, as this type of index use
many attributes for its key, the chance of the index having to update when the
table does is higher. The major tip from this book is that indexes should be
avoided on smaller tables, dense indexes should be used on critical queries to
make use of the index-only scan, and building an index is dependent on if the
time saved in execution time is larger than the cost of updating the index [43].
Some methods for improving queries are getting rid of the * and instead
using the column name in operations. Make sure that the HAVING clause is
Results and Analysis | 54
executed after restricting the data with the SELECT statements. As well as
by minimising the number of subquery blocks that are in a nested query [42].
Query tuning should be considered before implementing indexes, as inserting
indexes can have harmful global effects. In comparison, rewriting a query can
only have positive effects, if done correctly [43]. Tips for rewriting queries
are:
• Do not use DISTINCT unnecessarily as it creates an overhead due to
sorting.
• Study the idiosyncrasies of the system. Some systems might not use
indexes when there is an OR statement involved, to circumvent this a
union could be used.
• They state that the ordering of tables in the FROM statement can affect
the order of joins, especially if more than five tables are used.
A query can access data in different ways. The main ways are full table
scan, index-only scan, and index access. For smaller values of selectivity,
index access is preferable, as it is faster than a full table scan. This also means
that if selectivity is high, using a full table scan is preferable. But the best
option is to use an index-only scan if the query allows it. Although this is
entirely dependent on the index used [21].
The book then describes short and long queries and how they can be tuned.
Short queries benefit from using restrictive indexes and are most efficient with
unique indexes, as these have fewer values to go through. Things to keep in
mind when using short queries are that column transformations make it so
that an index search cannot be performed on the transformed attribute. LIKE
statements also do not utilise indexes, so they should also be avoided and can
instead be replaced by equivalent OR statements. Some tips for indexing on
the other hand are that indexes should not be used when the table is small, or
if the majority of the rows in a table is needed to execute a query, or a column
transformation is used. A tip to force a query to use an index is to use the
ORDER BY operation [21].
The way to optimise long queries is by avoiding multiple full table scans
and reducing the size of the result as soon as possible. Indexes are not needed
here and should not be used. Another tip is that hash join is most likely
the better algorithm for joining long queries. And if GROUP BY is used
by a long query, the filtering needs to be applied first in most cases. There
are times that GROUP BY can reduce the size of the data-set, but the rule
of thumb is to apply the SELECT statements first for the optimiser. Lastly,
set operations can sometimes be used to prompt alternative execution plans.
Another tip to improve execution time for queries is to use materialised views,
but a materialised view should only be created if the data it is based on does
not update often. This means that if it is not very critical to have up-to-date
data, the data in the materialised view is read often, and if many queries could
make use of it [21].
Lastly, multidimensional and spatial searches are discussed. Spatial data
often require range queries. Which means finding all the data located at a
certain distance or closer to a specified point in space. And nearest-neighbour
queries, which is to find a variable number of objects closest to the specified
point. These queries cannot be supported by one-dimensional indexes or even
multiple indexes, and must instead use special indexes, such as the GiST.
Results and Analysis | 56
every insertion causes a node split. The R-tree had better insertion and search
performance than the SP-GiST PMR quadtree. Lastly, the nearest neighbour
search for the kd-tree and the point quadtree was better than for the trie. This
is due to how the trie performs the NN search character by character while for
the kd-tree and the point quadtree the NN search is based on partitions.
5.2 Results
In Figure 5.1, Figure 5.2 and, Figure 5.5 the baseline, improved query, and
using the baseline query on the personalised B-tree schema can be seen. The
result for using the ANALYZE command is omitted due to how it remained
the same as the query result, the same reason applies to the generic B-tree and
Hash indexes.
In Figure 5.3 there is also the addition of the generic B-tree and Hash
indexes, the optimiser result is once again omitted for the same reason stated
before.
In Figure 5.4 only the baseline query and the personalised B-tree results
can be seen. The same reason for omitted results stands as for the first
paragraph. The reason that there is no improved query result, is due to how
that was not tested here.
Detailed results for how the queries were executed, the EXPLAIN output,
and more detailed graphs can be seen in the appendix E and D.
The graph shows the execution times (y-axis) for the query executed, the
improved query, and the query executed with a B-tree index and how they
scale over increased data in the table. The x-axis shows the rows in the table
with logarithmic growth. The query executed with the B-tree shows similar
performance to just executing the query, but has a slight improvement when
scaled to a larger data set. The result for the general indexes and the query
executed with the ANALYZE command were omitted due to how they did not
show any difference from just executing the query.
Results and Analysis | 59
The graph shows the execution times (y-axis) for the query executed, the
improved query, and the query executed with a B-tree index and how they
scale over increased data in the table. The x-axis shows the rows in the table
with logarithmic growth. The query executed with the B-tree shows similar
performance to just executing the query, but has a slight improvement when
scaled to a larger data set. The improved query also has similar performance
but scales worse. The result for the general indexes and the query executed
with the ANALYZE command were omitted due to how they did not show
any difference from just executing the query.
Results and Analysis | 60
The graph shows the execution times (y-axis) for the query executed, the
improved query, the generic B-tree and hash index, and the query executed
with a personalised B-tree index and how they scale over increased data in
the table. The x-axis shows the rows in the table with logarithmic growth.
The generic indexes show similar performance, they scale better than just the
query, the same can be said for the improved query. The personalised B-tree
line is difficult to see but it is behind the query line, which means that they had
similar performance. The ANALYZE performance was omitted due to how it
executed like just running the query.
Results and Analysis | 61
The graph shows the execution times (y-axis) for the query executed, and the
query executed with a B-tree index, and how they scale over increased data in
the table. The x-axis shows the rows in the table with logarithmic growth. The
query executed with the B-tree shows similar performance to just executing
the query at first, but shows a large improvement when it comes to scaling.
The result for the general indexes and the query executed with the ANALYZE
command were omitted due to how they did not show any difference from
just executing the query. This query was not tuned, which is why there is no
improved query result.
Results and Analysis | 62
The graph shows the execution times (y-axis) for the query executed, the
improved query, and the query executed with a B-tree index and how they
scale over increased data in the table. The x-axis shows the rows in the table
with logarithmic growth. The query executed with the B-tree shows similar
performance to just executing the query, but scales worse. The result for the
general indexes and the query executed with the ANALYZE command were
omitted due to how they did not show any difference from just executing the
query.
Discussion | 63
Chapter 6
Discussion
This chapter discusses and provides explanations for the result in the experi-
ment by using the information provided in the background as well as the
appendix. It compares the result to the literature studies and analyses the
reliability and validity of the result. It also discusses the problems faced during
the thesis, how they were solved and what problems could not be solved. It
brings up the sources of errors to consider as well as the limitations of the result
and reiterates what sustainability and ethical effects this result may have.
Query 1
As can be seen in Figure 5.1, the query and the query executed on the B-tree
have similar execution times, but the B-tree query scales slightly better. The
improved query on the other hand shows a very big improvement. To better
see the result see Figure D.6 in the appendix D. The difference between the
normal query and the improved query is the usage of DISTINCT, and IN and
OR. From the EXPLAIN output file (in the appendix E), it shows that it filters
the IN statement first for any strings that contain any of ’movie’ or ’video’, as
Discussion | 64
Query 2
In Figure 5.2 the result for executing the query, the improved query, and the
query on the B-tree index can be seen. What is most notable is that the
improved query was not improved at all. The query differs from the original
query by using SELECT(title_id) instead of SELECT(*). The difference in
performance is minimal, yet the sources from the literature study stated that
changing the * should improve performance. The query performed on the
index scaled better.
The EXPLAIN output shows that for the query first a parallel sequential
scan is performed, then it is partially hashed based on type, this is then sorted
on the type as well, and then finally merged into the result. The improved query
shows the same output as the normal query, which means that technically both
of them should have the same performance. This can be noted as until the
largest data-set was used, the performance was incredibly similar, and for the
largest data-set, the performance only differs by less than a millisecond. Which
could be a source of error, potentially due to caching. Or it could be an effect
of having implemented a materialised view, which might be hogging some
memory.
Discussion | 65
The query performed on the B-tree index uses the index with the types
column as the indexing column. It first performs an index-only scan, then
groups the types, and finally merges the result. It does this with two workers
as well. This explains why this is faster than performing the query without
indexes, as an index-only scan is faster than a sequential scan. It also does not
perform any type of sorting as the B-tree already has the data sorted which
also saves time.
As stated before, the Hash index could not be implemented on the types
column, mainly due to the nature of Hash structures and how they do not
work well with many entries that use the same Hash key. The result for the
generalised indexes was also omitted from the result as the query did not use
the index.
Query 3
As Figure 5.3 states, the query executed on the generic B-tree and Hash
indexes have the same execution times, and they both show slight improvement
compared to executing the query without an index. The tuned query also shows
slight improvement when it comes to scaling. The other B-tree index has the
same performance as just executing the query. The improved query shows
a big improvement compared to the other tests. It should also be noted that
building the materialised view for the improved query took less than a minute.
The EXPLAIN output is quite long but can be summarised as the following.
The query uses an index condition on the primary key for the people table, it
then does an index scan using the primary key on people. After this, the query
filters the movie titles and gathers the titles that are searched for. This is done
with a parallel sequential scan on titles and then hashed. Another parallel
sequential scan is run on crew, the result is then hashed and then a hash join is
applied to form the result of the nested loop. This is then sorted by name and
the result is merged.
The improved query on the other hand uses an index scan on the primary
key for the people table, it uses the cached key for the materialised view
(as it also contains person_id) for memoisation purposes. The set is then
filtered on the crew conditions and a nested sequential scan is performed on
the materialised view. The result is then sorted.
The query on the generic B-tree and Hash index shows similar outputs.
They both do an index scan based on the person_id column, they then filter
the category on the crew table and use the index condition for title_id (as it is
a primary key for titles). They then both do a parallel sequential scan on titles
Discussion | 66
and uses two nested loop - in comparison to just executing the query which
only uses one nested loop - and then sorts to gather the result.
Lastly, the personalised B-tree performs an index scan using the person_id
column, it then filters the title and performs a parallel sequential scan on the
titles table. It then parallel hashes the result from this part of the query. Then
it continues on to filter the crew table for the rows needed, does a parallel hash
join - like just performing the query does - and then sorts. This means that the
personalised B-tree performs almost exactly like just performing the query,
except for the usage of the implemented index instead of using the primary
key constraint. Which would explain why they have the same execution times.
Query 4
Figure 5.4 shows the result for the query and the query using the B-tree index.
What can be seen is that by using the index the query scales a lot better.
The query begins with performing a parallel sequential scan on the ratings
table with a variable filter ($3). This is then repeated again, but the result is
saved as ratings_2 that compares the result from the previous sequential scan
to find the second highest rating. After this another parallel sequential scan is
performed to gather all the media with this rating.
When the query is executed with the index, the same thing happens but
instead of using sequential scans, index only scans backwards are used instead,
and only done twice. This explains how the performance could improve by so
much. Like stated earlier, index only scans are a lot quicker than sequential
scans.
Query 5
In Figure 5.5 the query, improved query and the query performed on the B-tree
index can be seen. The B-tree index scales worse than the query. Whilst the
improved query scales slightly better.
The query is executed by first filtering the the premiered column on the
desired values, it then performs a parallel sequential scan on titles with this.
After this it sorts the result based on the premier dates. The improved query
does the exact same thing, which means that technically they should have the
same performance.
With the index on the other hand, a bitmap index scan is performed on the
types column, the result is then filtered on the premiered column as specified
in the query and applied to a bitmap heap scan on the titles table. The
result is then sorted. The cost of the bitmap index scan and the bitmap heap
Discussion | 67
scan combined is lower than the parallel sequential scan, which means that
there should be slight improvement. The reason as to why there is not any
improvement can be due to different things, it could be that the query planner
is inaccurate in the execution times of the planning due to not updated statistic.
Other issues, as mentioned in related works, could be that there are elements to
this query that the query optimiser does not take into consideration, something
that might have to be fixed manually with a statistics object or something else.
The most likely explanation for why the execution time was higher, is due
to how at a certain point for selectivity, the heap access (also called index
access) has a higher cost than doing a full table scan. Which means that for
this result, despite doing an index scan to find the correct entries to save time,
the heap scan takes more time than doing a table scan would.
Optimiser results
As stated in the result, the optimiser result was omitted from the graphs. This
is due to how they performed almost exactly like just performing the query,
sometimes some milliseconds better and sometimes some milliseconds worse.
The explanation for that is assumed to be measurement errors as the measuring
were done on different times and memory and cache usage could have changed
between them. The lack of change in execution time could be because of how
the statistics remained the same when just executing the query, so the need to
change query plan did not have to happen. For further research, testing the
ANALYZE command when executing queries on indexes might be of more
interest, as inserting an index could maybe cause the statistics in the query
planner to be out of date.
in the result was that by having primary keys, a type of index search could still
be performed, so it would be far-fetched, to believe that by implementing more
of these key constraints in the correct places would improve query execution
time. Although that is something that should be further tested. Lastly for the
experiment section of the litterateur study, the ANALYZE result showed that
there was no major difference between using it or not, which was the same
result gathered in this thesis’ experiment.
Some of the methods gathered in the literature study were tested when
tuning queries to optimise performance. Removing DISTINCT in Query 1
greatly improved performance. Changing * to the column to be counted did
not improve performance for Query 2. In the experiment the result showed
that it actually worsened performance, although the EXPLAIN output showed
that the execution strategy and the predicted time would remain the same -
dismissing any idea of there being errors - in this case doing this did not
improve the query. Avoiding subqueries was also something that was tested.
This was done in Query 3, by creating a materialised view. This improved
performance greatly, especially when it came to scaling the query over a larger
data-set.
Although idiosyncrasies were not studied in detail, through the result for
Query 1, either the lack of DISTINCT or the addition of OR (or both) could
have caused the planner to choose a plan that made use of two workers instead
of one, which is the most likely reason for improved performance. As that
was the main difference in the EXPLAIN output. Another thing that could be
seen was the described relation between full table scan, index only scan and
index access in Query 3 and Query 5. As the most likely reason as to why
the indexes did not improve performance was due to high selectivity, which
makes the index access (also called heap access) more inefficient than a full
table scan.
The queries were also tested as they were constructed to ensure that they
gathered what they were supposed to so that the measuring would be accurate.
6.2.1 Problems
One issue at the beginning, that was realised at a later point, was the
delimitations. At first, they were too few and too imprecise. Both of these
problems were solved as the project went along and the research for the
Discussion | 70
background and the literature study was found and analysed. The background
showed the extent of the area of database optimisation, which caused more
delimitations to be formed, and the literature study showed how other studied
formulated problems and described the problem area which made it easier to
form the research questions for this thesis.
Another issue found during this study was that there are many different
indexes for PostgreSQL, as mentioned in the background. The issue was that
the information found about them pointed towards their main use case is for
a specific type of data - spatio-temporal - or specific types of operations -
nearest neighbour search, find coordinates within an area, etc. Due to the
time constraints of this thesis, there was no time to experiment with these
types of indexes which meant that they were also put as a delimitation for
the experiment. Although, a couple of research papers were found within the
area of spatio-temporal indexing, in which a few of them contained tests done
with PostgreSQL. One of these reports was more relevant than the others and
was then put in the literature study result. This was done to have some relevant
information about the use cases of one of the indexes, and also how it compares
to other indexes used in the same problem area.
More problems were found when the experiment was being planned. First,
there was no documentation of the data in the database, therefore, some of
the data was looked over to check what each attribute actually meant. As
well as to see if there were any key constraints. At first, the constraints were
not found at all, because they were at the end of the file. When they were
found a plan was made to test to see if they could be changed to include
more foreign keys. Due to the nature of the data in the database, and that the
data was decided to be split for the multiple versions of the database, making
foreign keys was difficult, and had to be foregone entirely. Which meant that
only the original key constraints were used for the database and the indexes
implemented. Another issue found was some issues with query tuning, since
that was somewhat of a novel concept, formulating better queries was a bit
difficult and in some cases almost impossible. This can be seen in the case of
Query 4, which was not tuned due to a lack of knowledge.
a readable operating system environment, which is why the exact times should
be taken with a grain of salt. Clearing the cache between queries would not
simulate a real database, but would instead measure the worst-case benchmark,
which would mean that most likely in a real-life scenario the benchmark would
be better than measured. Which means that the result would be more accurate.
As mentioned earlier, there were also issues with query tuning, which means
that the queries might not have been tuned very well, which means that the
result might not show the extent of how the query tuning can improve execution
time.
Another smaller source of error is that the running of tests happened during
different days, and times of days (query and improved query on one day, and
index and optimiser on another). This could cause errors in that cache and
memory performance can differ, which could lead to minor execution time
differences.
It should also be noted that for Query 3 the two smaller data sets did not
return any result, so the scaling could be inaccurate for them. And for Query 4,
the ratings table is a lot smaller than the other tables, which means that scaling
differences, in the beginning, could be a lot smaller than for the other results.
It would be of interest to have a larger ratings table to see if that reasoning is
correct or not.
6.3 Limitations
There are some limitations to the result that are worth mentioning. One of them
is what was brought up in the background. As mentioned, a database is often
not a standalone product, it most often is connected to some sort of application
as an interface of sorts, and a server. The result of this thesis does not take into
account how application or server issues play into efficiency. Neither does
it test hardware, to see how that affects efficiency. Both are limitations that
should be taken into consideration when optimising a database.
Another limitation is the fact that other PostgreSQL indexes than the B-
tree and Hash index could not be tested. The Hash index could not be tested in
all cases either. This means that the extent of improvement that indexes have
on a database could not be accurately measured. Although it could be argued
that by PostgreSQL using the B-tree as the standard indexing structure, there
could be something in that the B-tree most often is suitable for indexing.
Discussion | 72
6.4 Sustainability
As mentioned in the introduction, optimising a database system has an
environmental effect as it reduces the resources a database uses. Shorter
response time and efficient use of hardware lead to lessening the total computing
time and could reduce the wear on hardware as well as a reduction in energy
usage. And an ethical problem that is related to database efficiency, is the
potential that people more easily can manage to compile data from different
data sets. This can then be presented or used to discern information that can
cause privacy issues.
Conclusions and Future work | 73
Chapter 7
This chapter summarises the result and the information discussed in the
discussion chapter, as well as answering the research questions, and sub-
questions posed.
7.1 Conclusion
The purpose of this thesis was to investigate how indexing and query optimisation
affect the response time for a PostgreSQL database, with the purpose of
furthering research in the area, as well as providing information for database
administrators and students alike. As one of the aims was to provide course
material for database courses.
To summarise the findings of the experiment and the literature study, the
research question and the subquestions are answered below.
1. What methods of indexing are there and what are their use cases?
2. How does query optimisation work and how can queries be optimised?
4. How does indexing, the query optimiser, and query tuning compare to
each other?
Conclusions and Future work | 74
Subquestion 1
As discovered in the background and literature study, there are many types
of indexes in PostgreSQL. The methods of implementing indexes differ
depending on if they are primary or secondary indexes. As in PostgreSQL,
only secondary indexes are used, the focus will lay there to answer this
question. The method to implement indexes is to look at the queries, analyse
their frequency, and what type of queries they are. As well as looking at
the table to see what type of data there is on there and how often it gets
updated. Depending on the data types and index structure that should be
chosen, this also depends on the types of queries that are supposed to use
the index. Thereafter it can be determined what should go into the index if
it should be sparse or dense. And also if the index needs to index all data in
a table, if it does not a partial index can be used. If the query uses multiple
tables, or other columns than what is indexed, determine if a composite index
can be used.
The use-cases of indexes are mainly determined by what indexing structures
are used. In most cases, the type of query or data type can determine what
index should be used. For example, as mentioned in the background, the
Hash index is suitable for point, and multi-point queries. Which B-trees also
are good for but are also extended to include range queries, prefix matches,
and ordering queries. SP-GiST, GiST, GIN and BRIN are mostly used for
implementing special data-types into a database. In the literature study SP-
GiST was described to mainly be used for spatio-temporal data, and depending
on how these indexes are implemented - i.e what data structures are used -
they can be useful for different types of queries. This result recommended
their implementations of SP-GiST trie for regular expression matches, exact
matches B+-trees for prefix match queries, the SP-GiST kd-tree for point
search and range searches, but if insertion and index size is of critical nature,
the R-tree works better. This also is the reason to use an R-tree over a SP-GiST
PMR quadtree. Nearest neighbour searches also benefit from using a kd-tree
implementation.
A generalisation based on the gathered result would be that using a B-tree
index is more versatile and suits more situations than using a hash index would
be, but if implemented incorrectly could instead slow down the execution time.
Removing DISTINCT from a query where possible makes the scaling of a
query a lot better than using the operation. On smaller data sets (in this thesis
tables with less than 100 000 rows) rarely show a difference in execution time
no matter if an index is implemented or if a query is tuned.
Conclusions and Future work | 75
Subquestion 2
Query optimisation can be separated into two parts, query tuning, and using
the query optimiser. The query optimiser is part of the DBMS and works
with statistics over the database, and the query planner to ensure that a good
query plan is chosen. This is done by looking at specific factors, such as CPU
cycles and I/O accesses, combining them to a single unit, and then comparing
this unit between plans. The PostgreSQL optimiser can update the statistics
by running the ANALYZE command for a query, as well as be improved by
implementing supported statistical objects - for multivariate statistics. This is
necessary as there are use cases where the optimiser does not work well, such
as for correlated columns in queries, etc. The query planner optimises a query
by setting up a plan tree, with plan nodes, in which each plan node contains
the cost of planned execution (in the special unit). To not have infinite plans,
and ensure that the optimised query is the equivalent of the starting query,
heuristics rules are used.
Query tuning on the other hand uses techniques and the skills of the query
writer. It is done by manually rewriting queries, to better make use of the
database resources. This is entirely based on the knowledge that the query
writer has about the database and the query language used. As different
types of queries benefit from different optimisation techniques. Summarised
techniques from the literature study result are:
• Temporaries can cause execution to be slow, but can also subvert the
need for using ORDER BY operations.
• Depending on the system, some operations can cause the query to not
use indexes. These idiosyncrasies need to be studied.
• Index-only scans are always faster than full table scans, but index access
can be slower than full table scans if the selectivity of the query is high.
Conclusions and Future work | 76
• Short queries benefit from using restrictive indexes, especially when the
indexes are unique as well.
• Long queries, do not benefit from indexes, and instead are optimised by
ensuring that few full table scans are done. It is also beneficial to reduce
the size of the result as soon as possible.
Subquestion 3
From the information stated, indexing and queries are incredibly entwined.
The purpose of both query optimisation and indexing is to improve efficiency.
Although, this can be done in different ways. Indexing can be used for the
ordering of files, which would be one of the main differences. Another
difference is that because of how indexes need to be implemented on the
database as an auxiliary structure, query optimisation can be a less invasive
procedure to use when improving execution time on a sensitive database. Such
as for databases that cannot afford more memory allocation, or have their tables
changing often. From the experiment, it can also be seen that, in the case of the
experiment, it is only so much an index can do if the query is bad. So query
optimisation and indexing have areas where they both are entwined to have
good execution time. The query optimiser is always running as well, although
the accuracy can be improved by specific operations based on data type, query,
and other factors. This means that the optimiser overlap with both indexes and
query tuning.
Subquestion 4
From subquestion 3, it can then be argued to mean that one method cannot
be superior to the others, as something like this cannot be generalised. It all
depends on the situation. How the database looks if the database structure
can change, how much memory is available, and if there is a priority to
Conclusions and Future work | 77
queries. Although, based on the result we can split it up into some cases.
Implementing indexes for spatio-temporal data improves execution time for
queries - this should be complemented with seeing how query optimisation
affects it though. B-tree indexes are more well-rounded in their use cases, and
from the experiment worked really well for improving a correlated subquery.
Query tuning worked really well for a nested query (by using a materialised
view), as well as for a large query (selecting many rows in a table) - which
was also stated in the literature study as long queries benefit more from query
optimisation than indexes. Based on the literature study result, in cases of
column transformation query optimisation works better. And for short queries,
using indexes is more beneficial.
• Testing more of the different query types that are mentioned in [43], to
see how they interact with indexes, the optimiser, and query tuning.
Conclusions and Future work | 78
• Have larger data sets and different types of data, to be able to generalise
conclusions.
• Implement one of the other PostgreSQL indexes, to see how they affect
performance.
This is mainly motivated by filling in the gaps for the limitations of the result
in this thesis. Having this information, and more information in general,
would make a stronger case for the conclusions of this thesis. As well
as, further mapping out more information about optimisation techniques in
general. As of this thesis, it was somewhat difficult to find other published
research focusing specifically on PostgreSQL and how to optimise a database
within it. Complementing this thesis with any of the above suggestions
would contribute to having more detailed and specific information for the
PostgreSQL community.
7.3 Reflections
This chapter describes some reflections of the works, suggestions towards
others, what I would change about the works, and the impact of the work done.
As well as some other thoughts about the project.
say that doing the pre-study was an integral part of this thesis, so for other
thesis students, I would recommend conducting a pre-study to collect basic
knowledge about what information is out there within their research area. To
ensure that what they are doing is possible, and within the delimitations.
Some things I would change if I were to redo this work are to summarise
the literature study result before conducting the experiment, as it would have
saved me more time than having to go back and forth in the report to find
the information I need. As well as, when writing, spending that time actually
formulating and editing as I go, instead of writing the necessary information
and then having to go back and edit large sections at a time. I believe that it
would have been faster if I had spent the time writing it better the first time.
This potentially could have given me more time for the experiment, so that I
could have tested more scenarios.
7.3.2 Impact
The impact of the result of this thesis, I believe is somewhat small on a
socio-econimic scale. I think it could have a larger impact on students as it
summarises a lot of information, and test it on a specified database language.
Which they then could use for their own learning purposes. I also believe
that it could potentially help database administrators that have started working
with PostgreSQL. I believe that if continued, this research has the potential of
having a high impact on the PostgreSQL community in the sense of making
it even more available. This could lead to more people, and companies using
PostgreSQL for their relational databases.
As mentioned in the discussion and the background, the impact of optimisation
can improve environmental costs. Partly by less usage of hardware leads to
less wear, and also software optimisation could lead to needing to upgrade
hardware less. Another environmental improvement would be that needing
less time for execution could lead to less energy usage overall. This could also
be argued to help companies to keep unnecessary costs down.
REFERENCES | 80
References
Appendix A
1 --
2 -- P o s t g r e S Q L d a t a b a s e dump
3 --
4
5 -- D u m p e d from d a t a b a s e v e r s i o n 13.0 ( D e b i a n
13.0 -1. p g d g 1 0 0 +1)
6 -- D u m p e d by p g _ d u m p v e r s i o n 13.0 ( D e b i a n 13.0 -1.
p g d g 1 0 0 +1)
7
8 SET s t a t e m e n t _ t i m e o u t = 0;
9 SET l o c k _ t i m e o u t = 0;
10 SET i d l e _ i n _ t r a n s a c t i o n _ s e s s i o n _ t i m e o u t = 0;
11 SET c l i e n t _ e n c o d i n g = ’ UTF8 ’ ;
12 SET s t a n d a r d _ c o n f o r m i n g _ s t r i n g s = on ;
13 SELECT pg_catalog . set_config ( ’ search_path ’, ’’,
false );
14 SET c h e c k _ f u n c t i o n _ b o d i e s = f a l s e ;
15 SET x m l o p t i o n = c o n t e n t ;
16 SET c l i e n t _ m i n _ m e s s a g e s = w a r n i n g ;
17 SET r o w _ s e c u r i t y = off ;
18
19 SET d e f a u l t _ t a b l e s p a c e = ’ ’ ;
20
21 SET d e f a u l t _ t a b l e _ a c c e s s _ m e t h o d = heap ;
22
23 --
24 -- Name : akas ; Type : T A B L E ; S c h e m a : p u b l i c ; O w n e r :
postgres
25 --
26
27 C R E A T E T A B L E p u b l i c . akas (
Appendix A: The database schema | 86
28 t i t l e _ i d c h a r a c t e r v a r y i n g NOT NULL , -- P R I M A R Y
KEY
29 t i t l e c h a r a c t e r varying ,
30 r e g i o n c h a r a c t e r varying ,
31 l a n g u a g e c h a r a c t e r varying ,
32 t y p e s c h a r a c t e r varying ,
33 a t t r i b u t e s c h a r a c t e r varying ,
34 is_original_title integer
35 );
36
37
38 A L T E R T A B L E p u b l i c . akas O W N E R TO p o s t g r e s ;
39
40 --
41 -- Name : crew ; Type : T A B L E ; S c h e m a : p u b l i c ; O w n e r :
postgres
42 --
43
44 C R E A T E T A B L E p u b l i c . crew (
45 t i t l e _ i d c h a r a c t e r varying , -- R E F E R E N C E S p u b l i c
. akas
46 p e r s o n _ i d c h a r a c t e r varying , -- R E F E R E N C E S
public . people
47 c a t e g o r y c h a r a c t e r varying ,
48 job c h a r a c t e r v a r y i n g
49 );
50
51
52 A L T E R T A B L E p u b l i c . crew O W N E R TO p o s t g r e s ;
53
54 --
55 -- Name : e p i s o d e s ; Type : T A B L E ; S c h e m a : p u b l i c ;
Owner : postgres
56 --
57
65
66 A L T E R T A B L E p u b l i c . e p i s o d e s O W N E R TO p o s t g r e s ;
67
Appendix A: The database schema | 87
68 --
69 -- Name : p e o p l e ; Type : T A B L E ; S c h e m a : p u b l i c ; O w n e r
: postgres
70 --
71
79
80 A L T E R T A B L E p u b l i c . p e o p l e O W N E R TO p o s t g r e s ;
81
82 --
83 -- Name : r a t i n g s ; Type : T A B L E ; S c h e m a : p u b l i c ;
Owner : postgres
84 --
85
92
93 A L T E R T A B L E p u b l i c . r a t i n g s O W N E R TO p o s t g r e s ;
94
95 --
96 -- Name : t i t l e s ; Type : T A B L E ; S c h e m a : p u b l i c ; O w n e r
: postgres
97 --
98
109 );
110
111
112 A L T E R T A B L E p u b l i c . t i t l e s O W N E R TO p o s t g r e s ;
113
114 --
115 -- Data for Name : akas ; Type : T A B L E DATA ; S c h e m a :
public ; Owner : postgres
116 --
Keys
1 --
2 -- Name : p e o p l e p e o p l e _ p k e y ; Type : C O N S T R A I N T ;
Schema : public ; Owner : postgres
3 --
4
5 A L T E R T A B L E ONLY p u b l i c . p e o p l e
6 ADD C O N S T R A I N T p e o p l e _ p k e y P R I M A R Y KEY (
person_id );
7
9 --
10 -- Name : r a t i n g s r a t i n g s _ p k e y ; Type : C O N S T R A I N T ;
Schema : public ; Owner : postgres
11 --
12
13 A L T E R T A B L E ONLY p u b l i c . r a t i n g s
14 ADD C O N S T R A I N T r a t i n g s _ p k e y P R I M A R Y KEY (
title_id );
15
16
17 --
18 -- Name : t i t l e s t i t l e s _ p k e y ; Type : C O N S T R A I N T ;
Schema : public ; Owner : postgres
19 --
20
21 A L T E R T A B L E ONLY p u b l i c . t i t l e s
22 ADD C O N S T R A I N T t i t l e s _ p k e y P R I M A R Y KEY (
title_id );
Appendix B: The script template | 89
Appendix B
The commented lines (5-7) were used when the ANALYZE command was
run, this was to ensure that the latest statistics were used every time the script
was run.
The commented line 21 did not work due to permission errors, as mentioned
in the report.
1 # !/ bin / bash
2
3 # e x e c u t e ./ l o o p 1 when in the r i g h t d o c k e r i m a g e
4 L I M I T =100
5 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -#
6 # U n c o m m e n t to e x e c u t e the sql f i l e s that has the
ANALYZE
7 # command
8 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -#
9
16 for (( i = 0; i < L I M I T ; i ++ ) ) ;
17
18 do
19 # FORMAT BELOW
20 # / usr / bin / time - o < o u t p u t f i l e > - a - f % e psql - U <
Appendix B: The script template | 90
Appendix C
Indexes
B-tree indexes
The commented indexes are the generic indexes that were first tested.
1 -- C R E A T E I N D E X t i t l e s _ b ON p u b l i c . t i t l e s U S I N G
BTREE ( title_id );
2 -- C R E A T E I N D E X a k a s _ b ON p u b l i c . akas U S I N G B T R E E (
title_id );
3 -- C R E A T E I N D E X c r e w _ b ON p u b l i c . crew U S I N G B T R E E (
title_id );
4 -- C R E A T E I N D E X p e o p l e _ b ON p u b l i c . p e o p l e U S I N G
BTREE ( person_id );
5 -- C R E A T E I N D E X r a t i n g s _ b ON p u b l i c . r a t i n g s U S I N G
BTREE ( title_id );
6 -- C R E A T E I N D E X e p i s o d e s _ b ON p u b l i c . e p i s o d e s U S I N G
BTREE ( show_title_id );
7
8 C R E A T E I N D E X t i t l e s _ b ON p u b l i c . t i t l e s U S I N G B T R E E (
type ) ;
9 C R E A T E I N D E X t i t l e s p r e m _ b ON p u b l i c . t i t l e s U S I N G
BTREE ( premiered );
10 C R E A T E I N D E X a k a s _ b ON p u b l i c . akas U S I N G B T R E E (
title_id );
11 C R E A T E I N D E X c r e w _ b ON p u b l i c . crew U S I N G B T R E E (
category );
12 C R E A T E I N D E X p e o p l e _ b ON p u b l i c . p e o p l e U S I N G B T R E E (
person_id );
13 C R E A T E I N D E X r a t i n g s _ b ON p u b l i c . r a t i n g s U S I N G
BTREE ( rating );
14 C R E A T E I N D E X e p i s o d e s _ b ON p u b l i c . e p i s o d e s U S I N G
BTREE ( show_title_id );
Appendix C: Indexes | 92
Hash indexes
The commented indexes are the personalised indexes that could not be generated
for the large database.
1 C R E A T E I N D E X t i t l e s _ b ON p u b l i c . t i t l e s U S I N G HASH (
title_id );
2 C R E A T E I N D E X a k a s _ b ON p u b l i c . akas U S I N G HASH (
title_id );
3 C R E A T E I N D E X c r e w _ b ON p u b l i c . crew U S I N G HASH (
title_id );
4 C R E A T E I N D E X p e o p l e _ b ON p u b l i c . p e o p l e U S I N G HASH (
person_id );
5 C R E A T E I N D E X r a t i n g s _ b ON p u b l i c . r a t i n g s U S I N G HASH
( title_id );
6 C R E A T E I N D E X e p i s o d e s _ b ON p u b l i c . e p i s o d e s U S I N G
HASH ( s h o w _ t i t l e _ i d ) ;
7
8 -- C R E A T E I N D E X t i t l e s _ b ON p u b l i c . t i t l e s U S I N G HASH
( type ) ;
9 -- C R E A T E I N D E X t i t l e s p r e m _ b ON p u b l i c . t i t l e s U S I N G
HASH ( p r e m i e r e d ) ;
10 -- C R E A T E I N D E X a k a s _ b ON p u b l i c . akas U S I N G HASH (
title_id );
11 -- C R E A T E I N D E X c r e w _ b ON p u b l i c . crew U S I N G HASH (
category );
12 -- C R E A T E I N D E X p e o p l e _ b ON p u b l i c . p e o p l e U S I N G HASH
( person_id );
13 -- C R E A T E I N D E X r a t i n g s _ b ON p u b l i c . r a t i n g s U S I N G
HASH ( r a t i n g ) ;
14 -- C R E A T E I N D E X e p i s o d e s _ b ON p u b l i c . e p i s o d e s U S I N G
HASH ( s h o w _ t i t l e _ i d ) ;
Appendix D: Detailed graphs | 93
Appendix D
Detailed graphs
Figure D.12: Execution time for the B-tree index implemented for query 1.
Appendix D: Detailed graphs | 102
Figure D.13: Execution time for the B-tree index implemented for query 2.
Figure D.14: Execution time for the B-tree index implemented for query 3.
Appendix D: Detailed graphs | 103
Figure D.15: Execution time for the B-tree index implemented for query 4.
Figure D.16: Execution time for the B-tree index implemented for query 5.
Appendix E: EXPLAIN output | 104
Appendix E
EXPLAIN output
The EXPLAIN output was generated for the largest database only. The reason
as to why not all queries show all the tests is due to how information is repeated.
For example if an index was created for Query 1 but not used by the query
execution plan, then the execution plan would remain the same as running the
query without the index.
1 m o v i e s ( q1 )
2 Aggregate ( cost = 6 3 7 2 8 . 9 3 . . 6 3 7 2 8 . 9 4 rows =1 w i d t h =8)
3 -> Seq Scan on t i t l e s ( cost = 0 . 0 0 . . 6 3 0 0 7 . 9 1
rows = 2 8 8 4 0 5 w i d t h =10)
4 F i l t e r : (( type ) :: text = ANY ( ’{ movie , v i d e o
} ’:: text []) )
5
6 improved :
7 Finalize Aggregate ( cost = 4 9 9 6 0 . 6 0 . . 4 9 9 6 0 . 6 1 rows =1
w i d t h =8)
8 -> Gather ( cost = 4 9 9 6 0 . 3 9 . . 4 9 9 6 0 . 6 0 rows =2
w i d t h =8)
9 Workers Planned : 2
10 -> Partial Aggregate ( cost
= 4 8 9 6 0 . 3 9 . . 4 8 9 6 0 . 4 0 rows =1 w i d t h =8)
11 -> P a r a l l e l Seq Scan on t i t l e s ( cost
= 0 . 0 0 . . 4 8 6 6 7 . 9 6 rows = 1 1 6 9 7 3 w i d t h =10)
12 F i l t e r : ((( type ) :: text = ’ movie ’::
text ) OR (( type ) :: text = ’ video ’:: text ) )
13
14 personalised btree :
15 Aggregate ( cost = 4 1 8 2 7 . 0 1 . . 4 1 8 2 7 . 0 2 rows =1 w i d t h =8)
16 -> B i t m a p Heap Scan on t i t l e s ( cost
= 3 1 7 2 . 3 2 . . 4 1 1 0 5 . 8 9 rows = 2 8 8 4 4 6 w i d t h =10)
17 R e c h e c k Cond : (( type ) :: text = ANY ( ’{ movie ,
v i d e o } ’:: text []) )
Appendix E: EXPLAIN output | 105
21
22 t y p e s ( q2 ) :
23 Finalize GroupAggregate ( cost = 4 9 6 6 8 . 2 5 . . 4 9 6 7 0 . 7 8
rows =10 w i d t h =16)
24 G r o u p Key : type
25 -> Gather Merge ( cost = 4 9 6 6 8 . 2 5 . . 4 9 6 7 0 . 5 8 rows
=20 w i d t h =16)
26 Workers Planned : 2
27 -> Sort ( cost = 4 8 6 6 8 . 2 2 . . 4 8 6 6 8 . 2 5 rows =10
w i d t h =16)
28 Sort Key : type
29 -> Partial HashAggregate ( cost
= 4 8 6 6 7 . 9 6 . . 4 8 6 6 8 . 0 6 rows =10 w i d t h =16)
30 G r o u p Key : type
31 -> P a r a l l e l Seq Scan on t i t l e s (
cost = 0 . 0 0 . . 4 3 8 8 7 . 9 7 rows = 9 5 5 9 9 7 w i d t h =8)
32
33 improved :
34 Finalize GroupAggregate ( cost = 4 9 6 6 8 . 2 5 . . 4 9 6 7 0 . 7 8
rows =10 w i d t h =16)
35 G r o u p Key : type
36 -> Gather Merge ( cost = 4 9 6 6 8 . 2 5 . . 4 9 6 7 0 . 5 8 rows
=20 w i d t h =16)
37 Workers Planned : 2
38 -> Sort ( cost = 4 8 6 6 8 . 2 2 . . 4 8 6 6 8 . 2 5 rows =10
w i d t h =16)
39 Sort Key : type
40 -> Partial HashAggregate ( cost
= 4 8 6 6 7 . 9 6 . . 4 8 6 6 8 . 0 6 rows =10 w i d t h =16)
41 G r o u p Key : type
42 -> P a r a l l e l Seq Scan on t i t l e s (
cost = 0 . 0 0 . . 4 3 8 8 7 . 9 7 rows = 9 5 5 9 9 7 w i d t h =18)
43
44 personalised btree :
45 Finalize GroupAggregate ( cost = 1 0 0 0 . 4 5 . . 3 4 6 9 4 . 6 5
rows =10 w i d t h =16)
46 G r o u p Key : type
47 -> Gather Merge ( cost = 1 0 0 0 . 4 5 . . 3 4 6 9 4 . 4 5 rows
=20 w i d t h =16)
48 Workers Planned : 2
Appendix E: EXPLAIN output | 106
53
54 join ( q3 )
55 Unique ( cost = 1 9 2 3 5 1 . 9 5 . . 1 9 2 4 2 9 . 2 8 rows =650 w i d t h
=14)
56 -> Gather Merge ( cost = 1 9 2 3 5 1 . 9 5 . . 1 9 2 4 2 7 . 6 6
rows =650 w i d t h =14)
57 Workers Planned : 2
58 -> Sort ( cost = 1 9 1 3 5 1 . 9 3 . . 1 9 1 3 5 2 . 6 1 rows
=271 w i d t h =14)
59 Sort Key : p e o p l e . name
60 -> N e s t e d Loop ( cost
= 4 8 6 7 0 . 5 1 . . 1 9 1 3 4 0 . 9 8 rows =271 w i d t h =14)
61 -> P a r a l l e l Hash Join ( cost
= 4 8 6 7 0 . 0 8 . . 1 9 1 2 0 8 . 3 2 rows =271 w i d t h =10)
62 Hash Cond : (( crew . t i t l e _ i d ) :: text = (
t i t l e s . t i t l e _ i d ) :: text )
63 -> P a r a l l e l Seq Scan on crew (
cost = 0 . 0 0 . . 1 3 8 5 3 7 . 6 2 rows = 1 5 2 4 0 4 0 w i d t h =20)
64 F i l t e r : ((( c a t e g o r y ) :: text = ’ actor
’:: text ) OR (( c a t e g o r y ) :: text = ’ actress ’:: text )
)
65 -> P a r a l l e l Hash ( cost
= 4 8 6 6 7 . 9 6 . . 4 8 6 6 7 . 9 6 rows =170 w i d t h =10)
66 -> P a r a l l e l Seq Scan on t i t l e s
( cost = 0 . 0 0 . . 4 8 6 6 7 . 9 6 rows =170 w i d t h =10)
67 F i l t e r : ((( p r i m a r y _ t i t l e ) :: text
~~ ’ Spider - Man % ’:: text ) OR (( o r i g i n a l _ t i t l e ) ::
text ~~ ’ Spider - Man % ’:: text ) )
68 -> I n d e x Scan u s i n g
p e o p l e _ p k e y on p e o p l e ( cost = 0 . 4 3 . . 0 . 4 9 rows =1
w i d t h =24)
69 I n d e x Cond : (( p e r s o n _ i d ) ::
text = ( crew . p e r s o n _ i d ) :: text )
70 JIT :
71 F u n c t i o n s : 19
72 O p t i o n s : I n l i n i n g false ,
O p t i m i z a t i o n false , E x p r e s s i o n s true , D e f o r m i n g
true
73
Appendix E: EXPLAIN output | 107
74 improved :
75 Unique ( cost = 5 2 4 . 3 8 . . 5 2 4 . 7 0 rows =64 w i d t h =14)
76 -> Sort ( cost = 5 2 4 . 3 8 . . 5 2 4 . 5 4 rows =64 w i d t h
=14)
77 Sort Key : p e o p l e . name
78 -> N e s t e d Loop ( cost = 0 . 4 4 . . 5 2 2 . 4 6 rows =64
w i d t h =14)
79 -> Seq Scan on q3 ( cost = 0 . 0 0 . . 5 . 1 7 rows
=64 w i d t h =10)
80 F i l t e r : ((( c a t e g o r y ) :: text = ’ actor ’:: text )
OR (( c a t e g o r y ) :: text = ’ actress ’:: text ) )
81 -> Memoize ( cost = 0 . 4 4 . . 8 . 4 6 rows =1
w i d t h =24)
82 C a c h e Key : q3 . p e r s o n _ i d
83 -> I n d e x Scan u s i n g p e o p l e _ p k e y on
people ( cost = 0 . 4 3 . . 8 . 4 5 rows =1 w i d t h =24)
84 I n d e x Cond : (( p e r s o n _ i d ) :: text = (
q3 . p e r s o n _ i d ) :: text )
85
86 generic btree :
87 Unique ( cost = 5 1 2 6 7 . 2 6 . . 5 1 3 4 4 . 5 9 rows =650 w i d t h =14)
88 -> Gather Merge ( cost = 5 1 2 6 7 . 2 6 . . 5 1 3 4 2 . 9 7 rows
=650 w i d t h =14)
89 Workers Planned : 2
90 -> Sort ( cost = 5 0 2 6 7 . 2 4 . . 5 0 2 6 7 . 9 2 rows =271
w i d t h =14)
91 Sort Key : p e o p l e . name
92 -> N e s t e d Loop ( cost = 0 . 8 6 . . 5 0 2 5 6 . 2 9
rows =271 w i d t h =14)
93 -> N e s t e d Loop ( cost = 0 . 4 3 . . 5 0 1 2 3 . 6 3
rows =271 w i d t h =10)
94 -> P a r a l l e l Seq Scan on t i t l e s ( cost
= 0 . 0 0 . . 4 8 6 6 9 . 9 9 rows =170 w i d t h =10)
95 F i l t e r : ((( p r i m a r y _ t i t l e ) :: text ~~ ’
Spider - Man % ’:: text ) OR (( o r i g i n a l _ t i t l e ) :: text
~~ ’ Spider - Man % ’:: text ) )
96 -> I n d e x Scan u s i n g c r e w _ b on crew
( cost = 0 . 4 3 . . 8 . 5 3 rows =2 w i d t h =20)
97 I n d e x Cond : (( t i t l e _ i d ) :: text = (
t i t l e s . t i t l e _ i d ) :: text )
98 F i l t e r : ((( c a t e g o r y ) :: text = ’ actor
’:: text ) OR (( c a t e g o r y ) :: text = ’ actress ’:: text )
)
99 -> I n d e x Scan u s i n g p e o p l e _ b
on p e o p l e ( cost = 0 . 4 3 . . 0 . 4 9 rows =1 w i d t h =24)
100 I n d e x Cond : (( p e r s o n _ i d ) :: text
= ( crew . p e r s o n _ i d ) :: text )
Appendix E: EXPLAIN output | 108
101
102 g e n e r i c hash :
103 Unique ( cost = 5 4 4 1 7 . 4 6 . . 5 4 4 9 4 . 7 9 rows =650 w i d t h =14)
104 -> Gather Merge ( cost = 5 4 4 1 7 . 4 6 . . 5 4 4 9 3 . 1 6 rows
=650 w i d t h =14)
105 Workers Planned : 2
106 -> Sort ( cost = 5 3 4 1 7 . 4 3 . . 5 3 4 1 8 . 1 1 rows =271
w i d t h =14)
107 Sort Key : p e o p l e . name
108 -> N e s t e d Loop ( cost = 0 . 0 0 . . 5 3 4 0 6 . 4 8
rows =271 w i d t h =14)
109 -> N e s t e d Loop ( cost = 0 . 0 0 . . 5 3 3 8 9 . 6 8
rows =271 w i d t h =10)
110 -> P a r a l l e l Seq Scan on t i t l e s ( cost
= 0 . 0 0 . . 4 8 6 6 9 . 9 9 rows =170 w i d t h =10)
111 F i l t e r : ((( p r i m a r y _ t i t l e ) :: text ~~ ’
Spider - Man % ’:: text ) OR (( o r i g i n a l _ t i t l e ) :: text
~~ ’ Spider - Man % ’:: text ) )
112 -> I n d e x Scan u s i n g c r e w _ b on crew
( cost = 0 . 0 0 . . 2 7 . 7 4 rows =2 w i d t h =20)
113 I n d e x Cond : (( t i t l e _ i d ) :: text = (
t i t l e s . t i t l e _ i d ) :: text )
114 F i l t e r : ((( c a t e g o r y ) :: text = ’ actor
’:: text ) OR (( c a t e g o r y ) :: text = ’ actress ’:: text )
)
115 -> I n d e x Scan u s i n g p e o p l e _ b
on p e o p l e ( cost = 0 . 0 0 . . 0 . 0 6 rows =1 w i d t h =24)
116 I n d e x Cond : (( p e r s o n _ i d ) :: text
= ( crew . p e r s o n _ i d ) :: text )
117
138
139 s e c o n d h i g h ( q4 ) :
140 Gather ( cost = 1 4 1 8 2 . 8 8 . . 1 9 9 0 8 . 8 4 rows = 3 9 8 1 w i d t h =8)
141 Workers Planned : 1
142 P a r a m s E v a l u a t e d : $3
143 I n i t P l a n 2 ( r e t u r n s $3 )
144 -> Finalize Aggregate ( cost
= 1 3 1 8 2 . 8 7 . . 1 3 1 8 2 . 8 8 rows =1 w i d t h =8)
145 I n i t P l a n 1 ( r e t u r n s $1 )
146 -> Finalize Aggregate ( cost
= 6 3 2 7 . 9 7 . . 6 3 2 7 . 9 8 rows =1 w i d t h =8)
147 -> Gather ( cost = 6 3 2 7 . 8 6 . . 6 3 2 7 . 9 7 rows =1
w i d t h =8)
148 Workers Planned : 1
149 -> Partial Aggregate ( cost
= 5 3 2 7 . 8 6 . . 5 3 2 7 . 8 7 rows =1 w i d t h =8)
150 -> P a r a l l e l Seq Scan on r a t i n g s
ratings_1 ( cost = 0 . 0 0 . . 4 7 9 5 . 0 9 rows = 2 1 3 1 0 9 w i d t h
=8)
151 -> Gather ( cost = 6 8 5 4 . 7 8 . . 6 8 5 4 . 8 9 rows
=1 w i d t h =8)
152 Workers Planned : 1
153 P a r a m s E v a l u a t e d : $1
154 -> Partial Aggregate ( cost
= 5 8 5 4 . 7 8 . . 5 8 5 4 . 7 9 rows =1 w i d t h =8)
Appendix E: EXPLAIN output | 110
178
179 i n t e r v a l ( q5 ) :
180 Gather Merge ( cost = 5 3 5 3 2 . 9 7 . . 5 8 3 3 6 . 9 4 rows = 4 1 1 7 4
w i d t h =24)
181 Workers Planned : 2
182 -> Sort ( cost = 5 2 5 3 2 . 9 5 . . 5 2 5 8 4 . 4 2 rows = 2 0 5 8 7
w i d t h =24)
183 Sort Key : p r e m i e r e d
184 -> P a r a l l e l Seq Scan on t i t l e s ( cost
= 0 . 0 0 . . 5 1 0 5 7 . 9 5 rows = 2 0 5 8 7 w i d t h =24)
Appendix E: EXPLAIN output | 111
187 improved :
188 Gather Merge ( cost = 5 3 5 3 2 . 9 7 . . 5 8 3 3 6 . 9 4 rows = 4 1 1 7 4
w i d t h =24)
189 Workers Planned : 2
190 -> Sort ( cost = 5 2 5 3 2 . 9 5 . . 5 2 5 8 4 . 4 2 rows = 2 0 5 8 7
w i d t h =24)
191 Sort Key : p r e m i e r e d
192 -> P a r a l l e l Seq Scan on t i t l e s ( cost
= 0 . 0 0 . . 5 1 0 5 7 . 9 5 rows = 2 0 5 8 7 w i d t h =24)
193 F i l t e r : (( p r e m i e r e d >= 2 0 0 0 ) AND ( p r e m i e r e d
<= 2 0 1 0 ) AND (( type ) :: text = ’ movie ’:: text ) )
194
Appendix F
Database link
https://canvas.kth.se/courses/19966/files/3413108/download
TRITA-EECS-EX-2021:821
www.kth.se