Leveling Up SQL
Leveling Up SQL
Leveling Up SQL
with SQL
Advanced Techniques for Transforming
Data into Insights
—
Mark Simon
Leveling Up with SQL
Advanced Techniques
for Transforming Data into Insights
Mark Simon
Leveling Up with SQL: Advanced Techniques for Transforming Data into Insights
Mark Simon
Ivanhoe VIC, VIC, Australia
Introduction������������������������������������������������������������������������������������������������������������xix
v
Table of Contents
vi
Table of Contents
One-to-One Relationships����������������������������������������������������������������������������������������������������������� 72
One-to-Maybe Relationships������������������������������������������������������������������������������������������������� 72
Multiple Values���������������������������������������������������������������������������������������������������������������������������� 76
Many-to-Many Relationships������������������������������������������������������������������������������������������������ 77
Joining Many-to-Many Tables����������������������������������������������������������������������������������������������� 82
Summarizing Multiple Values������������������������������������������������������������������������������������������������ 84
Combining the Joins�������������������������������������������������������������������������������������������������������������� 86
Many-to-Many Relationships Happen All the Time���������������������������������������������������������������� 90
Another Many-to-Many Example������������������������������������������������������������������������������������������������ 90
Inserting into Related Tables������������������������������������������������������������������������������������������������������� 93
Adding a Book and an Author������������������������������������������������������������������������������������������������ 94
Adding a New Sale���������������������������������������������������������������������������������������������������������������� 98
Review�������������������������������������������������������������������������������������������������������������������������������������� 102
Types of Relationships��������������������������������������������������������������������������������������������������������� 103
Joining Tables���������������������������������������������������������������������������������������������������������������������� 103
Views����������������������������������������������������������������������������������������������������������������������������������� 104
Inserting into Related Tables����������������������������������������������������������������������������������������������� 104
Summary���������������������������������������������������������������������������������������������������������������������������������� 104
Coming Up��������������������������������������������������������������������������������������������������������������������������������� 104
vii
Table of Contents
viii
Table of Contents
ix
Table of Contents
x
Table of Contents
xi
Table of Contents
Summary���������������������������������������������������������������������������������������������������������������������������������� 364
Simple CTEs������������������������������������������������������������������������������������������������������������������������� 364
Parameter Names���������������������������������������������������������������������������������������������������������������� 365
Multiple CTEs����������������������������������������������������������������������������������������������������������������������� 365
Recursive CTEs�������������������������������������������������������������������������������������������������������������������� 365
Coming Up��������������������������������������������������������������������������������������������������������������������������������� 365
Chapter 10: More Techniques: Triggers, Pivot Tables, and Variables������������������� 367
Understanding Triggers������������������������������������������������������������������������������������������������������������� 368
Some Trigger Basics������������������������������������������������������������������������������������������������������������ 369
Preparing the Data to Be Archived�������������������������������������������������������������������������������������� 370
Creating the Trigger������������������������������������������������������������������������������������������������������������� 372
Pros and Cons of Triggers���������������������������������������������������������������������������������������������������� 380
Pivoting Data����������������������������������������������������������������������������������������������������������������������������� 381
Pivoting the Data����������������������������������������������������������������������������������������������������������������� 382
Manually Pivoting Data�������������������������������������������������������������������������������������������������������� 384
Using the Pivot Feature (MSSQL, Oracle)����������������������������������������������������������������������������� 389
Working with SQL Variables������������������������������������������������������������������������������������������������������ 394
Code Blocks������������������������������������������������������������������������������������������������������������������������� 395
Updated Code to Add a Sale������������������������������������������������������������������������������������������������ 396
Review�������������������������������������������������������������������������������������������������������������������������������������� 404
Triggers�������������������������������������������������������������������������������������������������������������������������������� 404
Pivot Tables�������������������������������������������������������������������������������������������������������������������������� 405
SQL Variables����������������������������������������������������������������������������������������������������������������������� 405
Summary���������������������������������������������������������������������������������������������������������������������������������� 406
Index��������������������������������������������������������������������������������������������������������������������� 443
xii
About the Author
Mark Simon has been involved in training and education
since the beginning of his career. He started as a teacher
of mathematics, but quickly pivoted into IT consultancy
and training because computers are much easier to work
with than high school students. He has worked with and
trained in several programming and coding languages and
currently focuses mainly on web development and database
languages. When not involved in work, you will generally
find him listening to or playing music, reading, or just
wandering about.
xiii
About the Technical Reviewer
Aaditya Pokkunuri is an experienced senior cloud database
engineer with a demonstrated history of working in the
information technology and services industry with 13 years
of experience.
He is skilled in performance tuning, MS SQL Database
Server Administration, SSIS, SSRS, PowerBI, and SQL
development.
He possesses in-depth knowledge of replication,
clustering, SQL Server high availability options, and ITIL
processes.
His expertise lies in Windows administration tasks, Active Directory, and Microsoft
Azure technologies.
He also has extensive knowledge of MySQL, MariaDB, and MySQL Aurora database
engines.
He has expertise in AWS Cloud and is an AWS Solution Architect Associate and AWS
Database Specialty.
Aaditya is a strong information technology professional with a Bachelor of Technology
in Computer Science and Engineering from Sastra University, Tamil Nadu.
xv
Acknowledgments
The sample data includes information about books and authors from Goodreads
(www.goodreads.com/), particularly from their lists of classical literature over the past
centuries. Additional author information was obtained, of course, from Wikipedia
(www.wikipedia.org/).
The author makes no guarantees about whether the information was correct or even
copied correctly. Certainly, the list of books should not in any way be interpreted as an
endorsement or even an indication of personal taste. After all, it’s just sample data.
xvii
Introduction
In the early 1970s, a new design for managing databases was being developed based on
the original work of E. F. Codd. The underlying model was known as the relational model
and described a way of collecting data and accessing and manipulating data using
mathematical principles.
Over the decade, the SQL language was developed, and, though it doesn’t follow the
relational model completely, it attempts to make the database accessible using a simple
language.
The SQL language has been improved, enhanced, and further developed over the
years, and in the late 1980s, the language was developed into a standard of both ANSI
(the American National Standards Institute) and ISO (the International Organization for
Standardization, and, that’s right, it doesn’t spell ISO).
The takeaways from this very brief history are
• SQL is a developing language, and there are new features and new
techniques being added all the time.
The second half of the third point is worth stressing. Nobody quite sticks to the SQL
standards. There are many reasons for this, some good, some bad. But you’ll probably
find that the various dialects of SQL are about 80–90% compatible, and the rest we’ll fill
you in on as we go.
In this book, you’ll learn about using SQL to a level which goes beyond the basics.
Some things you’ll learn about are newer features in SQL; some are older features that
you may not have known about. We’ll look at a few non-standard features, and we’ll also
look at using features that you already know about, but in more powerful ways.
This book is not for the raw beginner—we assume you have some knowledge and
experience in SQL. If you are a raw beginner, then you will get more from my previous
xix
Introduction
book, Getting Started with SQL and Databases;1 you can then return to this book full of
confidence and enthusiasm with a good solid grounding in SQL.
If you have the knowledge and experience, the first chapter will give you a quick
overview of the sort of knowledge you should have.
The first chapter will go into the details of getting your DBMS software and sample
database ready. It will also give you an overview of the story behind the sample database.
Notes
While you’re writing SQL to work with the data, there’s a piece of software at the other
end responding to the SQL. That software is referred to generically as a database server,
and, more specifically, as a DataBase Management System, or DBMS to its friends. We’ll
be using that term throughout the book.
The DBMSs we’ll be covering are PostgreSQL, MariaDB, MySQL, Microsoft SQL
Server, SQLite, and Oracle. We’ll assume that you’re working with reasonably current
versions of the DBMSs.
Chapter 1 will go into more details on setting up your DBMS, as well as downloading
and installing the sample database.
Source Code
All source code used in this book can be downloaded from github.com/apress/
leveling-up-sql.
1
https://link.springer.com/book/978148429494.
xx
CHAPTER 1
Getting Ready
If you’re reading this book, you’ll already know some SQL, either through previous study
or through bitter experience, or, more likely, a little of both. In the process, there may be
a few bits that you’ve missed, or forgotten, or couldn’t see the point.
We’ll assume that you’re comfortable enough with SQL to get the basic things
done, which mostly involves fetching data from one or more tables. You may even have
manipulated some of that data or even the tables themselves.
We won’t assume that you consider yourself an expert in all of this. Have a look in
the section “What You Probably Know Already” to check the sort of experience we think
you already have. If there are some areas you’re not completely sure about, don’t panic.
Each chapter will include some of the background concepts which should take you to
the next level.
If all of this is a bit new to you, perhaps we can recommend an introductory book. It’s
called Getting Started with SQL and Databases by Mark Simon, and you can learn more
about it at https://link.springer.com/book/10.1007/978-1-4842-9493-2.
• BookWorks will then procure the books and ship them to customers
at some point.
1
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_1
Chapter 1 Getting Ready
To manage all of this, the database tables look something like Figure 1-1.
In real life, there’s more to the story. For example, we haven’t included payment or
shipping methods, and we haven’t included login credentials. There’s no stock either,
although we’ll presume that the books are ordered on demand.
But there’s enough in this database for us to work with as we develop and improve
our SQL skills.
2
Chapter 1 Getting Ready
Setting Up
You can sit in a comfortable chair with a glass of your favorite refreshment and a box of
nice chocolates and read this book from cover to cover. However, you’ll get more from
this book if you join in on the samples.
• PostgreSQL
• MariaDB/MySQL
• SQLite
• Oracle
PostgreSQL, MariaDB/MySQL, and SQLite are all free. Microsoft SQL Server and
Oracle are paid products, but have free versions.
MariaDB is a spin-off of MySQL, which is why they’re treated together. They are
almost identical in features, but you’ll find a few places where they’re not identical.
3
Chapter 1 Getting Ready
It’s possible—even likely—that you already have the DBMS installed. Just make
sure that
If you can’t make changes to the database, you can still work with most of the book,
and you’ll just have to nod your head politely as you’re reading Chapter 2, in which we
make a few changes to the database. You might also have some difficulty in creating
views, which we cover in Chapter 6 and in other chapters.
Database Client
You’ll also need a database client. All the major DBMS vendors have their own free
client, and there are plenty of free and paid third-party alternatives.
4
Chapter 1 Getting Ready
1. For your DBMS, create your sample database. If you can’t think
of a better name, bookworks is a fine name. For most DBMSs,
you can run
2. Using the preceding link, select the options for your DBMS.
For this sample, you should select the “Book Works” sample
(Step 2), as well as the additional Towns and Countries tables
(Step 4).
• Writing SQL
• Basic SQL
• Data Types
• SQL Clauses
• Calculating Columns
• Joins
5
Chapter 1 Getting Ready
• Aggregates
• Manipulating Data
• Set Operations
This is a summary of what you will have encountered in the prior book. Some of
these topics will be pushed further in the following chapters.
6
Chapter 1 Getting Ready
• A value is the content of the data. It may be NULL which means that
you don’t have the value, and it may be duplicated because, well,
these things happen.
We may use the term “information” loosely to refer to data, but it’s really not the
same thing.
7
Chapter 1 Getting Ready
Database Tables
SQL databases store data in one or more tables. In turn, a table presents the data in rows
and columns. You get the picture in Figure 1-2.
A row is an instance of the data, such as a book in the books table or a customer in
the customers table. Columns are used for details, such as the given name of a customer
or the title of a book. Figure 1-3 gives the idea.
8
Chapter 1 Getting Ready
• Row order is not significant: You can sort them if you like, but the row
order has no real significance.
• Rows are unique: You don’t have two rows describing the same thing.
• Columns are of a single type: You can’t mix types in a single column.
One important consequence of this is that columns should never be used to hold
multiple values, either singly or in combination. This means
There are a few additional rules, but they are more fine-tuning of the basic principles.
SQL uses the term “table” in two overlapping ways:
When we need to refer to the generated table data, we’ll use the term virtual table to
make the point clear.
Writing SQL
SQL is a simple language which has a few rules and a few recommendations for
readability:
• SQL is relaxed about using extra spacing. You should use as much
spacing as required to make your SQL more readable.
• The SQL language is case insensitive, as are the column names. Table
names may be case sensitive, depending on the operating system.
Microsoft SQL is relaxed about the use of semicolons, and many MSSQL
developers have got in the bad habit of forgetting about them. However, Microsoft
strongly encourages you to use them, and some SQL may not work properly if
you get too sloppy. See https://docs.microsoft.com/en-us/sql/t-sql/
language-elements/transact-sql-syntax-conventions-transact-
sql#transact-sql-syntax-conventions-transact-sql.
If you remember to include semicolons, you’ll stay out of trouble.
Remember, some parts of the language are flexible, but there is still a strict syntax to
be followed.
Basic SQL
The basic statement used to fetch data from a table is the SELECT table. In its simplest
form, it looks like this:
SELECT ...
FROM ...;
10
Chapter 1 Getting Ready
• The SELECT statement will select one or more columns of data from
a table.
Calculated columns should be named with an alias; noncalculated columns can also
be aliased.
A comment is additional text for the human reader which is ignored by SQL:
Data Types
Broadly, there are three main data types:
• Numbers
• Strings
Number literals are represented bare: they do not have any form of quotes.
Numbers are compared in number line order and can be filtered using the basic
comparison operators.
String literals are written in single quotes. Some DBMSs also allow double quotes,
but double quotes are more correctly used for column names rather than values.
• In some DBMSs and databases, upper and lower case may not match.
• Trailing spaces should be ignored, but aren’t always.
11
Chapter 1 Getting Ready
SQL Clauses
For the most part, we use up to six clauses in a typical SELECT statement. SQL clauses are
written in a specific order. However, they are processed in a slightly different order, as in
Figure 1-4.
The important thing to remember is that the SELECT clause is the last to be evaluated
before the ORDER BY clause. That means that only the ORDER BY clause can use values
and aliases produced in the SELECT clause.1
As we’ll see later in the book, there are additional clauses which are extensions to the
one we have here.
1
SQLite is the exception here. You can indeed use aliases in the other clauses.
12
Chapter 1 Getting Ready
SELECT columns
FROM table
WHERE conditions;
The conditions are one or more assertions, expressions which evaluate to true or not
true. If an assertion is not true, it’s not necessarily false either. Typically, if the expression
involves NULL, the result will be unknown, which is also not true.
• NULLs will always fail a comparison, such as the equality operator (=).
Testing for NULL requires the special expression IS NULL or IS NOT NULL.
Multiple Assertions
You can combine multiple assertions with the logical AND and OR operators. If you
combine them, AND takes precedence over OR.
The IN operator will match from a list. It is the equivalent of multiple OR expressions.
It can also be used with a subquery which generates a single column of values.
Wildcard Matches
Strings can be compared more loosely using wildcard patterns and the LIKE operator.
• Some DBMSs allow you to use LIKE with non-string data, implicitly
converting them to strings for comparison.
SELECT columns
FROM table
-- WHERE ...
ORDER BY ...;
The ORDER BY clause is both the last to be written and the last to be evaluated.
• Sorting does not change the actual table, just the order of the results
for the present query.
• You can sort using multiple columns, which will effectively group the
rows; column order is arbitrary, but will affect how the grouping is
effected.
• Some DBMSs will sort upper and lower case values separately.
Limiting Results
A SELECT statement can also include a limit on the number of rows. This feature has been
available unofficially for a long time, but is now an official feature.
14
Chapter 1 Getting Ready
SELECT ...
FROM ...
ORDER BY ... OFFSET ... ROWS FETCH FIRST ... ROWS ONLY;
SELECT ...
FROM ...
ORDER BY ... LIMIT ... OFFSET ...;
This is supported in PostgreSQL (which also supports OFFSET ... FETCH), MariaDB/
MySQL, and SQLite.
MSSQL also has a simple TOP clause added to the SELECT clause.
Sorting Strings
Sorting alphabetically is, by and large, meaningless. However, there are techniques to
sort strings in a more meaningful order.
Calculating Columns
In SQL, there are three main data types: numbers, strings, and dates. Each data type has
its own methods and functions to calculate values:
• For numbers, you can do simple arithmetic and calculate with more
complex functions. There are also functions which approximate
numbers.
• For dates, you can calculate an age between dates or offset a date.
You can also extract various parts of the date.
• For strings, you can concatenate them, change parts of the string, or
extract parts of the string.
• For numbers and dates, you can generate a formatted string which
gives you a possibly more friendly version.
15
Chapter 1 Getting Ready
Aliases
Every column should have a distinct name. When you calculate a value, you supply this
name as an alias using AS. You can also do this with noncalculated columns to provide a
more suitable name.
Aliases and other names should be distinct. They should also follow standard
column naming rules, such as not being the same as an SQL keyword and not having
special characters.
If, for any reason, a name or an alias needs to break the naming rules, you can always
wrap the name in double quotes ("double quotes") or whatever the DBMS supplies as
an alternative.
Some DBMSs have an alternative to double quotes, but you should prefer double
quotes if possible.
Subqueries
A subquery is an additional SELECT statement used as part of the main query.
A column can also include a value derived from a subquery. This is especially useful
if you want to include data from a separate related table. If the subquery involves a value
from the main table, it is said to be correlated. Such subqueries can be costly, but are
nonetheless a useful technique.
16
Chapter 1 Getting Ready
Casting a Value
You may be able to change the data type of a value, using cast():
• You can change within a main type to a type with more or less detail.
Views
You can save a SELECT statement into the database by creating a view. A view allows you
to save a complex statement as a virtual table, which you can use later in a simpler form.
Views are a good way of building a collection of useful statements.
Joins
Very often, you will create a query which involves data from multiple tables. Joins
effectively widen tables by attaching corresponding rows from the other tables.
The basic syntax for a join is
SELECT columns
FROM table JOIN table;
There is an older syntax using the WHERE clause, but it’s not as useful for most joins.
Although tables are joined pairwise, you can join any number of tables to get results
from any related tables.
When joining tables, it is best to distinguish the columns. This is especially important
if the tables have column names in common:
17
Chapter 1 Getting Ready
The ON Clause
The ON clause is used to describe which rows from one table are joined to which rows
from the other, by declaring which columns from each should match.
The most obvious join is from the child table’s foreign key to the parent table’s
primary key. More complex joins are possible.
You can also create ad hoc joins which match columns which are not in a fixed
relationship.
Join Types
The default join type is the INNER JOIN. The INNER is presumed when no join type is
specified:
• An INNER JOIN results only in child rows for which there is a parent.
Rows with a NULL foreign key are omitted.
There is also a CROSS JOIN, which combines every row in one table with every row
in the other. It’s not generally useful, but can be handy when you cross join with a single
row of variables.
Aggregates
Instead of just fetching simple data from the database tables, you can generate various
summaries using aggregate queries. Aggregate queries use one or more aggregate
functions and imply some groupings of the data.
18
Chapter 1 Getting Ready
Aggregate queries effectively transform the data into a secondary summary table.
With grand total aggregates, you can only select summaries. You cannot also select non-
aggregate values.
The main aggregate functions include
• min() and max() which fetch the first or last of the values in sort order
• sum(), avg(), and stdev() (or stddev()) which perform the sum,
average, and standard deviation on a column of numbers
When it comes to working with numbers, not all numbers are used in the same way,
so not all numbers should be summarized.
For strings, you also have
In all cases, aggregate functions only work with values: they all skip over NULL.
You can control which values in a column are included:
• You can use DISTINCT to count only one instance of each value.
• You can use CASE ... END to work as a filter for certain values.
Without a GROUP BY clause, or using GROUP BY (), the aggregates are grand totals:
you will get one row of summaries.
You can also use GROUP BY to generate summaries in multiple groups. Each group is
distinct. When you do, you get summaries for each group, as well as additional columns
with the group values themselves.
Aggregates are not limited to single tables:
In many cases, it makes sense to work with your aggregates in more than one step.
For that, it’s convenient to put your first step into a common table expression, which is a
virtual table which can be used with the next step.
When grouping your data, sometimes you want to filter some of the groups. This is
done with a HAVING clause, which you add after the GROUP BY clause.
19
Chapter 1 Getting Ready
• Column names
• Data types
A table design can be changed afterward, such as adding triggers or indexes. More
serious changes, such as adding or dropping columns, can be effected using ALTER
TABLE statements.
Data Types
There are three main types of data:
• Numbers
• Strings
• Dates
There are many variations of the preceding types which make data storage and
processing more efficient and help to validate the data values.
There are also additional types such as boolean or binary data, which you won’t see
so much in a typical database.
Constraints
Constraints define what values are considered valid. Standard constraints include
• NOT NULL
• UNIQUE
• DEFAULT
20
Chapter 1 Getting Ready
Foreign Keys
A foreign key is a reference to another table and is also regarded as a constraint, in that it
limits values to those which match the other table.
The foreign key is defined in the child table.
A foreign key also affects any attempt to delete a row from the parent table. By
default, the parent row cannot be deleted if there are matching child rows. However, this
can be changed to either (a) setting the foreign key to NULL or (b) cascading the delete to
all of the children.
Indexes
Since tables are not stored in any particular order, they can be time consuming to search.
An optional index can be added for any column you routinely search, which makes
searching much quicker.
Manipulating Data
Data manipulation statements are used to add or change data. In addition to the SELECT
statement, there are
Set Operations
In SQL, tables are mathematical sets of rows. This means that they contain no duplicates
and are unordered. It also means that you can combine tables and virtual tables with set
operations.
21
Chapter 1 Getting Ready
• UNION combines two or more tables and results in all of the rows, with
any duplicates filtered out. If you want to keep the duplicates, you use
the UNION ALL clause.
• EXCEPT (a.k.a. MINUS in Oracle) returns the rows in the first table
which are not also present in the second.
When applying a set operation, there are some rules regarding the columns in each
SELECT statement:
• Only the names and aliases from first SELECT are used.
• Only the values are matched, which means that if your various
SELECTs change the column order or select different columns, they
will be matched if they are compatible.
A SELECT can include any of the standard clauses, such as WHERE and GROUP BY, but
not the ORDER BY clause. You can, however, sort the final results with an ORDER BY at
the end.
Set operations can also be used for special techniques, such as creating sample data,
comparing result sets, and combining aggregates.
Coming Up
As we said, we won’t presume that you’re an expert in all of this. As we introduce the
following chapters, we’ll also recap some of the basic principles to help you steady
your feet.
In the chapters that follow, we’ll have a good look at working with the
following ideas:
• How to improve the reliability and efficiency of the database tables
(Chapter 2)
• How the tables are related to each other and how to work with
multiple tables (Chapter 3)
22
Chapter 1 Getting Ready
• How to manipulate the values to get more value out of the values
(Chapter 4)
In Chapter 2, we’ll make a few changes to the database tables and even add a few
more tables to improve its overall design. It won’t be perfect, but it will show how a
database can be further developed.
23
CHAPTER 2
Of course, we won’t be able to make the table perfect: that would take a long time
and a lot of experience with the database. You mightn’t even be in a position to do this
with your database. However, we’ll be able to get a better understanding of what makes a
database work better.
25
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_2
Chapter 2 Working with Table Design
Developing a database table often starts with a rough idea of the sort of data to be
stored and then goes through a process of normalization: examining the features of the
table and making changes to fit in better with the requirements of a normal table.
There are various rules and levels of normalization. In general, a normalized
database meets the following requirements:
• Data is atomic.
Of course, you will always see the rows in some sort of order, but
the row order is insignificant.
You can’t mix data types in a single column. Strictly, you shouldn’t
be able to mix domains, which are sets of acceptable values, but
SQL has a hard time checking that.
26
Chapter 2 Working with Table Design
One of the design problems that normalization addresses deals with multiple values.
If data is to be atomic, and columns are to be independent, how do you manage multiple
values? For example, how do you handle sales with multiple sale items or books with
multiple genres?
The solution is to put these values into a separate table, one value per row, and let
the table refer back to the first table. We will be looking at the relationships between
tables, especially in handling multiple values, in the next chapter. The basic idea will be
that an additional table will hold multiple values in multiple rows; it will then include a
foreign key, a reference to a primary key in the first table.
In this chapter, we’ll look at some of these principles and see how our database
compares against them. When we do find shortcomings, we’ll make some changes
to the tables themselves and even add a few more tables to make the database more
conformant to these ideas. We’ll also look at ways of ensuring that the data is more
reliable and, to some extent, more efficient.
We’ll begin by addressing problems with interdependent columns, which will mean
changing some tables. It will also mean adding additional tables. To make these changes
less inconvenient, we’ll look at creating views to take in the additional tables.
We’ll also look at how to work toward a more reliable database by adding additional
constraints—data rules—to check what goes into the table in the first place.
Finally, we’ll look at adding indexes to improve the performance of the database.
When it comes to the question of multiple values, we’ll see more on that in the next
chapter, which deals with how tables are related with each other.
No database is perfect, and it won’t be our aim to make this one perfect: we’ll leave a
lot of work undone. It’s also quite likely not your job anyway. However, at least we’ll get a
better understanding of what makes a good database work.
SELECT
id, givenname, familyname,
street, town, state, postcode
FROM customers;
27
Chapter 2 Working with Table Design
you will see that there is indeed a relationship between some of the address columns.
~ 303 rows ~
For example, if you change your address from one town to another, you will probably
also need to change the postcode and possibly the state. On top of that, people living in the
same town probably also have the same postcode; certainly, they will be in the same state.
This creates a maintenance problem:
The table also includes an autonumbering id column, which is the primary key. A
primary key is a column which uniquely identifies a row. In this case, it’s an arbitrary
number. The actual details will depend on the DBMS you’re using.
Although there will be duplicated names, states, and postcodes, the combination
will be unique.
The preceding UNIQUE clause also creates an index, which will make searching the
table faster. You will learn more about indexes later.
You can run this script now to create and populate the towns table.
Depending on your DBMS, you may need to make sure that you are installing this
table into the correct database.
Note that the townid column must match the data type of the id column in the towns
table, which, in this case, is an integer.
You’ll notice that it doesn’t actually use the term FOREIGN KEY. It’s the keyword
REFERENCES that makes it a foreign key: in this case, it references an id in the towns table.
You’ll also notice the naming of the foreign key using CONSTRAINT fk_customers_
town. Every constraint actually has a name, but you don’t have to name it yourself if
you’re prepared to allow the DBMS to make one up. If so, you can use a shorter form:
If you already had the column, you could have added the foreign key constraint
retroactively with
By default, when you create a new column, it will be filled with NULLs. You could have
added a default value instead, but that would be pointless in this case, since everybody
lives somewhere else; in some cases, we don’t have the customer’s address at all.
SELECT
id, givenname, familyname,
town, state, postcode, -- existing data
(SELECT id FROM towns AS t WHERE -- new data
t.name=customers.town
AND t.postcode=customers.postcode
AND t.state=customers.state
) AS reference
FROM customers;
Some of these results will, of course, be NULL, as some of the customers have no
recorded address.
~ 303 rows ~
A subquery is a query within a query. In this case, it’s a simple way of looking up
something from another table.
30
Chapter 2 Working with Table Design
This subquery is a correlated subquery: it is run for every row in the main query,
using values from the main query to compare to the subquery. That’s normally an
expensive type of query, but we won’t use it very much. It will also be useful for the
next step.
You will learn more about subqueries later.
Note that we have aliased the towns table in the subquery; that’s to make the code
easier to read and write. You could also have aliased the customers table, but that won’t
work for all DBMSs in the next step.
We don’t just want to look at the reference: we want to copy the reference into the
customers table. You do that with an UPDATE statement:
UPDATE customers
SET townid=(
SELECT id FROM towns AS t
WHERE t.name=customers.town
AND t.postcode=customers.postcode
AND t.state=customers.state
);
The UPDATE statement is used to change values in an existing table. You can
set the value to a constant value, a calculated value, or, as in this case, a value from
another table.
Here, the same subquery is used to fetch the id that will be copied into the
townid column.
Some DBMSs allow you to alias the customers table, which would make the UPDATE
statement a little simpler.
A correlated subquery can be expensive, and it’s normally preferable to use a join
if you can. We could have used a join for the SELECT statement, but not all DBMSs
cooperate so well with UPDATE statements. Here, the subquery is intuitive and
works well, and, since you’re only running this once, not too expensive.
31
Chapter 2 Working with Table Design
SELECT
c.id, c.email, c.familyname, c.givenname,
c.street,
-- original values
c.town, c.state, c.postcode,
c.townid,
-- from towns table
t.name AS town, t.state, t.postcode,
c.dob, c.phone, c.spam, c.height
FROM customers AS c LEFT JOIN towns AS t ON c.townid=t.id;
If you’re doing this in Oracle, remember that you can’t use AS for the table aliases:
SELECT
...
FROM customers c LEFT JOIN towns t ON c.townid=t.id;
Note that
• We use the LEFT JOIN to include customers without an address.
• We alias the customers and towns tables for convenience.
• The towns table has a name column, instead of the town column.
However, in the context of the query, it makes sense to alias it to town.
• We’ve also included the c.townid column, which, though it’s
redundant, might make it easier to maintain.
32
Chapter 2 Working with Table Design
Once you have checked that the SELECT statement does the job, you can create a
view. Of course, you should leave out the old town data, since the whole point is to use
the data from the joined data:
In Microsoft SQL, you need to wrap the CREATE VIEW statement between a pair of GO
keywords:
-- MSSQL:
GO
CREATE VIEW customerdetails AS
SELECT
c.id, c.email, c.familyname, c.givenname,
c.street,
c.townid, t.name as town, t.state, t.postcode,
c.dob, c.phone, c.spam, c.height
FROM customers AS c LEFT JOIN towns AS t ON c.townid=t.id;
GO
33
Chapter 2 Working with Table Design
Here, we use DROP COLUMN which removes one or more columns and, of course, all of
their data, so you would want to be sure that you don’t need it anymore. As you’ve seen
earlier, there are some variations in the syntax between DBMSs.
In Microsoft SQL, you will get an error that you can’t drop the postcode column
because there is an existing constraint. A constraint is an additional rule for a
valid value.
In this case, there is a constraint called ck_customers_postcode which requires that
postcodes comprise four digits only. You won’t need that constraint now, especially
since you’re going to remove the column.
To remove the constraint, run
-- MSSQL
ALTER TABLE customers
DROP CONSTRAINT ck_customers_postcode;
Once you have successfully removed the constraint, you can now remove the
columns:
Remember, if you drop the wrong column, it is very tricky or impossible to get
it back.
34
Chapter 2 Working with Table Design
Note that we’re reading from the customerdetails view, because the town data is no
longer in the customers table, though the townid is.
Now, change the customer’s townid to anything you like (as long as it’s no more than
the highest id in the towns table):
35
Chapter 2 Working with Table Design
Of course, you can’t really update a view because it’s really just a SELECT statement
and doesn’t contain any data. Instead, the DBMS tries to work out which table the
particular column belongs to and passes the change on to the table. There are times
when it can’t work that out, such as when you try to update a calculated column. In that
case, the update will fail, and you’ll have to update the table directly.
36
Chapter 2 Working with Table Design
2. Add a countryid column to the towns table, similar to the way you
added townid to the customers table. Remember, the data type
must match the preceding primary key:
-- MySQL / MariaDB
ALTER TABLE towns
ADD countryid CHAR(2) REFERENCES countries(id);
3. Update the towns table to set the value of countryid to 'au' for
Australia or whichever country you choose. This is much simpler
than setting it from a subquery:
4. You will have to modify your view. First, drop the old version:
-- Not Oracle:
DROP VIEW IF EXISTS customerdetails;
-- Oracle:
DROP VIEW customerdetails;
-- Not Oracle
CREATE VIEW customerdetails AS
SELECT
...
c.townid, t.name AS town, t.state, t.postcode,
n.name AS country
...
FROM
customers AS c
LEFT JOIN towns AS t ON c.townid=t.id
37
Chapter 2 Working with Table Design
Note
• This includes an additional JOIN to the countries table; to
accommodate the longer clause, we have split the JOIN over
multiple lines.
• The alias for the countries table has been set to n (for Nation);
this is simply because we can’t use c as it is already in use.
Additional Comments
You may have noticed that we didn’t do anything about the street address column.
Strictly speaking, this is also subject to the same issues as the rest of the address, so it
would have been better if we did something similar.
However, street addresses are much more complicated, and we don’t have so many
customers, so we have left them as they are. This leaves us with an imperfect but much
improved design.
In all cases, of course, there’s no guarantee that the value is true—just valid.
If you want to get more specific in your definition of what is valid, there is also the
CHECK constraint. The CHECK is a miscellaneous constraint which allows you to set up
your own rules using an expression similar to a WHERE clause. Sometimes, these are
called business rules.
In this section, we’ll look at some weaknesses of the database and try to fill in some
of the design gaps by adding some constraints.
Much of the following will involve making changes to existing columns. If you’re
using SQLite, then, sadly, you can’t do that. SQLite has very limited ALTER TABLE
functionality, and you can’t make changes to existing columns. If you really need to
make such changes, you would have to go through a more complicated process of
dropping a column and creating a new one.
39
Chapter 2 Working with Table Design
1 1 1403 1 11.5
2 1 1861 1 13.5
3 1 643 [NULL] 18
4 2 187 1 10
5 2 1530 1 12.5
6 2 1412 2 16
~ 13964 rows ~
However, through an oversight, the column allows NULL, which, if you look far
enough, you’ll find in a number of rows. That doesn’t make sense: you can’t have a sale
item if you don’t know how many copies it’s for.
It’s reasonable to guess that a missing quantity suggests a quantity of 1. You can
implement this guess using coalesce():
SELECT
id, saleid, bookid,
coalesce(quantity,1) AS quantity, price
FROM saleitems
ORDER BY saleid, id;
40
Chapter 2 Working with Table Design
Now we’ll get the same results, except that the NULLs have been replaced with 1:
1 1 1403 1 11.5
2 1 1861 1 13.5
3 1 643 1 18
4 2 187 1 10
5 2 1530 1 12.5
6 2 1412 2 16
~ 13964 rows ~
As always with the coalesce() function, you need to check your assumptions. Is 1
really a reasonable guess? In this case, it’s unlikely to mean zero copies or any other
number, but it all really depends on the situation. For the exercise, we’ll just play along...
We certainly don’t want to keep doing this every time, so we’re going to fix the old
values and prevent the NULLs in the future.
What follows won’t work with SQLite. However, there is a section after this which is
what you might do to make the same changes in SQLite.
UPDATE saleitems
SET quantity=1
WHERE quantity IS NULL;
41
Chapter 2 Working with Table Design
From here, we won’t need to use coalesce() on existing data, but we need to
prevent NULLs in the future.
-- PostgreSQL
ALTER TABLE saleitems ALTER COLUMN quantity SET NOT NULL;
-- MySQL/MariaDB
ALTER TABLE saleitems MODIFY quantity INT NOT NULL;
-- MSSQL
ALTER TABLE saleitems ALTER COLUMN quantity INT NOT NULL;
-- Oracle
ALTER TABLE saleitems MODIFY quantity NOT NULL;
-- Not Possible in SQLite
Earlier, the ALTER TABLE statement was used to add or remove a column. You can
also use it to make changes to an existing column. Here, we use it to add a NOT NULL
constraint.
As you’ve seen earlier, each DBMS has its own subtle variation on the ALTER TABLE
statement.
-- PostgreSQL
ALTER TABLE saleitems
ALTER COLUMN quantity SET DEFAULT 1;
-- MySQL/MariaDB
ALTER TABLE saleitems
MODIFY quantity INT DEFAULT 1;
-- MSSQL
ALTER TABLE saleitems
ADD DEFAULT 1 FOR quantity;
42
Chapter 2 Working with Table Design
-- Oracle
ALTER TABLE saleitems
MODIFY quantity DEFAULT 1;
-- Not Possible in SQLite
The DEFAULT value is the value used if you don’t supply a value of your own. The
column doesn’t have to be NOT NULL, and NOT NULL columns don’t have to have a
DEFAULT. However, in this case, it’s a reasonable combination.
Again, note that each DBMS has its own subtle variation on the syntax.
CHECK (quantity>0)
You could also impose an upper limit by using the BETWEEN expression:
-- PostgreSQL
ALTER TABLE saleitems
ADD CHECK (quantity>0);
-- MySQL/MariaDB
ALTER TABLE saleitems
MODIFY quantity INT CHECK(quantity>0);
-- MSSQL
ALTER TABLE saleitems
ADD CHECK(quantity>0);
-- Oracle
ALTER TABLE saleitems
43
Chapter 2 Working with Table Design
-- PostgreSQL
ALTER TABLE saleitems
ALTER COLUMN quantity SET NOT NULL,
ALTER COLUMN quantity SET DEFAULT 1,
ADD CHECK (quantity>0);
-- MySQL/MariaDB
ALTER TABLE saleitems MODIFY quantity INT
NOT NULL
DEFAULT 1
CHECK(quantity>0);
-- Oracle
ALTER TABLE saleitems MODIFY quantity
DEFAULT 1
NOT NULL
CHECK(quantity>0);
-- Not Possible in MSSQL
-- Not Possible in SQLite
Since you don’t actually make this sort of change terribly often, you lose nothing if
you keep the steps separate.
• Rename a column
• Drop a column
44
Chapter 2 Working with Table Design
However, that’s enough to make the changes we want, as long as we’re happy with a
different column order.
To make all of the preceding changes
3. Copy the data from the old column to the new one:
UPDATE saleitems
SET quantity=oldquantity;
The new column will be at the end, which is not where the original was, but that’s not
really a problem.
Other Adjustments
As often in the development process, it’s not hard to get something working, but the
main effort goes into making it working just right. Here are some suggestions to improve
both the integrity and the performance of the database.
We’ll talk about indexes in the next section: they help in making the data easier to
search or sort.
45
Chapter 2 Working with Table Design
Customers
height CHECK (height>0) – or height BETWEEN 60 and 260
dob CHECK (dob<current_timestamp)
registered CHECK (registered<current_timestamp)
Authors
names INDEX
dates CHECK (born<died)
gender CHECK (gender IN(‘m’,‘f’))
CHECK (givenname IS NOT NULL OR familyname IS NOT NULL)
Books
authorid INDEX
title INDEX
published CHECK (published < year(current_timestamp))
price CHECK (price>=0)
Sales
Saleitems
saleid INDEX
bookid INDEX
quantity NOT NULL CHECK(quantity>0) DEFAULT 1
price CHECK(price>=0)
46
Chapter 2 Working with Table Design
You’ll notice that some of the CHECK constraints aren’t associated with a single
column. Some constraints are more concerned with how one column relates to
another column.
We certainly won’t address all of these suggestions here. After all, this isn’t a
real working database, and it’s quite possibly not your job anyway. We’ll just look at
two more.
-- PostgreSQL
ALTER TABLE books ADD CHECK (price>=0);
-- MySQL/MariaDB
ALTER TABLE books MODIFY price INT CHECK(price>=0);
-- MSSQL
ALTER TABLE books ADD CHECK(price>=0);
-- Oracle
ALTER TABLE books MODIFY price CHECK(price>=0);
Again, to do this with SQLite, you can follow the steps for the quantity in saleitems
earlier.
47
Chapter 2 Working with Table Design
There shouldn’t be any. If there are, then you’re on your own. You’ll have to do your
own research on what the correct dates should be, or, if you’re desperate, you can set
them to NULL.
The next step would be to add the table constraint:
48
Chapter 2 Working with Table Design
Unlike adding a column constraint, the various DBMSs all use the same syntax—
except, of course, for SQLite. There is no simple method for adding a table constraint
in SQLite. Complex methods include dropping and recreating the whole table similar
to dropping and recreating a column or tampering with the internals of the database,
which is definitely not for the fainthearted.
Adding Indexes
SQL doesn’t define what order a table should be in. That leaves it up to the DBMS to
store the table in any way it deems most efficient.
The problem is that when searching for a particular row, it could be anywhere, and
the only way to find it is to look through the whole table and hope that it doesn’t take
too long.
If, on the other hand, the table were in order, it would be much easier to find what
you’re looking for. However, even if it’s in order, it’s just as likely to be in the order of the
wrong thing.
For example, even if the customers table is in, say, id order, it doesn’t help when
searching by familyname. If it’s in familyname order, it doesn’t help when searching
by phone.
The solution is to leave the table alone and then supplement the table with one or
more indexes. An index is an additional listing which is in search order, together with a
pointer to the matching row in the table.
For example, the customers table has an index for the familyname. When the time
comes to search on the familyname, the DBMS automatically looks up the index instead,
finds what it wants, and goes back to the real table to fetch the rest of the data.
There are two costs to having an index:
• Every time you add or change a row in the table, each index will also
need to be updated.
For this reason, you will only find an index on a column if it has been specifically
requested in the table design. And you would only include an index if you considered the
improvement in search ability to be worth the cost in storage and management.
49
Chapter 2 Working with Table Design
Another type of column which might be worth considering is a foreign key. That’s
because it will, of course, be heavily involved in searching and sorting.
Any other column would be a matter of judgment. At least it’s not hard to change
your mind about adding or removing an index at some point in the future.
Some DBMSs do include the ability to store the table in order of one column or
the other. This is called a clustered index or an index organized table. In some
DBMSs, such as Microsoft SQL, the clustering is permanent (the DBMS ensures
that the table is maintained in that order); in some others, it is temporary (the
DBMS sorts the table once, but you’ll have to do it again in the future).
Here, we’re ignoring clustering. In any case, you still can’t keep the table in
multiple orders, so you’ll need indexes anyway.
50
Chapter 2 Working with Table Design
The ON clause identifies the table and the columns you want listed.
It is possible to index multiple columns in a single statement, but that doesn’t create
multiple separate indexes. Instead, you create an index on the combined value. For
example:
This will create a single index of the authors’ familyname, givenname, and
othernames.
Even though the index is built around all three parts of the author’s name, it will still
be used if you just search for, say, the familyname. However, using a partial index that
way presumes that you’re at least using the first components of the index, which is why
the columns are in that order.
Note that in both statements earlier, the index has been given a name. There are
no rules for what that name should be, but developers have their own patterns. For
example, the preceding pattern is something like
ix_table_columns
This isn’t a rigid rule, but it makes things easier to work with.
Why does the index need a name anyway? Most of the time, you don’t really care.
However, there are two reasons:
• If you ever need to drop an index, you need to use its name to
identify it.
Even if you succeed in creating an anonymous index, the DBMS will automatically
assign its own name, which isn’t always a very pretty name.
Another index you might consider is on the foreign key authorid in the books table.
You can add it with
Of course, you might also include an index on customer details or other details.
51
Chapter 2 Working with Table Design
By grouping the names, you can count how many times they appear. Of course, since
you’re only interested in those that appear more than once, you can filter the results with
a HAVING clause:
52
Chapter 2 Working with Table Design
Free Judy 2
Mate Annie 2
Christmas Mary 2
Tuckey Ken 2
Ander Corey 2
Dunnit Ida 2
Bearer Paul 2
Bell Terry 2
In the preceding query, the rows are grouped by familyname and givenname and
summarized. The HAVING clause filters for those groups where there are more than one
instance. The SELECT clause then outputs those names and the number of instances.
You don’t need the count(*) in the SELECT clause, of course, but it helps to make the
result clearer.
Of course, it’s no problem if you find duplicate family names: many people have the
same name as someone else. However, it can be if you find duplicate phone numbers:
phone number
[NULL] 17
In this case, there are no duplicates. What appear to be duplicates are NULLs, because
there are multiple NULLs in the table. They don’t count.
If you do find duplicates, then you have your work cut out for you in trying to work
out whether these duplicates are legitimate. You might even conclude that duplicate
phone numbers are OK, so you wouldn’t go ahead with the next step.
53
Chapter 2 Working with Table Design
Assuming that duplicates are not OK, to protect against duplicates, you add a
UNIQUE INDEX:
-- Not MSSQL
CREATE UNIQUE INDEX uq_customers_phone
ON customers(phone);
Microsoft SQL has a quirk which regards multiple NULLs as duplicates,1 so you will
need this workaround:
-- MSSQL
CREATE UNIQUE INDEX uq_customers_phone
ON customers(phone)
WHERE phone IS NOT NULL;
Note that this time the index name begins with uq as a reminder that this is a unique
index. Again, there are no rules for how to name the index, but this one follows a
common and understandable pattern.
Whether or not you really want to disallow duplicate phone numbers is another
question. Two customers from the same household or organization may well share
the same phone number, so disallowing them would be problematic. This is an
exercise in how to disallow duplicates, but not necessarily on whether to disallow
duplicates. That’s something best left to the needs of the individual database.
Review
A well-designed SQL database needs to follow a few rules to ensure that the data can be
relied upon. There is no guarantee that the data is true, but the data will at least be valid.
Normal Form
A table which follows certain design principles is said to be in a normal form. This
doesn’t mean that it’s commonplace, but rather that it is in a definitive form.
1
This is odd, since constraints normally ignore NULLs, and NULL doesn’t match NULL anyway.
54
Chapter 2 Working with Table Design
• Data is atomic.
Multiple Values
One issue in developing tables is how to handle multiple values and recurring values. In
general, the solution is to have additional tables and to link them using foreign keys.
Altering Tables
When restructuring or hardening a database, you need to make changes to existing
tables and columns. The ALTER TABLE statement can be used to
• Drop columns
Constraints include adding NOT NULL, defaults, and additional CHECK constraints.
Views
A view is a saved SELECT statement. One reason to create a view is for the convenience
of having data from one or more tables in one place.
55
Chapter 2 Working with Table Design
Sometimes, when you create a view with combined data, you end up with a result
which no longer follows all the rules of normalization. In the trade, this would be
referred to as denormalization.
Denormalized data is generally a bad way to maintain data, but very often a
convenient way to extract data. In this sense, it is the best of both worlds: the original
data is still intact in the original tables.
Some DBMSs include the ability to update data in view. In fact, the update doesn’t
affect the view at all, but is rather passed on to the underlying tables.
Indexes
An index is a supplement to a table which stores the selected data in order, together with
a reference to the data in the original table. Using the index, the DBMS can search for
data more quickly.
Indexes are automatically created for primary keys and unique columns. You can
add an index on any other column.
Indexes have some costs, so they shouldn’t be added for no reason. Costs include
storage and maintenance.
Unique indexes can be added to ensure that values in a particular column, or
combination of columns, are unique.
56
Chapter 2 Working with Table Design
Summary
In this chapter, we focused on the properties of individual tables and looked for ways to
make the database more reliable and more efficient.
We looked at
57
Chapter 2 Working with Table Design
The process of improving the database was, of course, incomplete, but it gives us a
better understanding of what makes a database more reliable and more efficient.
Coming Up
In this chapter, we’ve been focused on properties of individual tables, which help to
improve the integrity and efficiency of the tables.
In the next chapter, we’ll look more at how multiple tables interact.
58
CHAPTER 3
Table Relationships
and Joins
A database is not just one table. Well, it can be of course, but any sophisticated database,
such as one which you would use to manage an online bookshop, will comprise a
number of tables, each handling a different collection of data.
While you can get some useful information from examining individual tables, you
will get so much more from combining tables.
In this chapter, we will look at working with multiple tables, how they are related to
each other, and how to combine them when the time comes.
Specifically, we’ll look at
We’ll look at why the database is structured this way with multiple tables and how we
can use joins to combine them into virtual tables.
59
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_3
Chapter 3 Table Relationships and Joins
• Each table has one type of data only and doesn’t include data which
rightly belongs in another table.
That isn’t to say that the books table isn’t aware of the author at all.
We’ll look at that in a moment.
For example, if you were to include the author’s name and other
details in the books table, you would find yourself repeating the
same details for other books by the same author.
These two principles are related: if you mix author details with the books, violating
the first principle, you will end up repeating the details for multiple books, violating the
second principle.
The correct way to manage books and authors is to put author details in a separate
table and for the books table to include a reference to one of the authors. In this way, we
say that there is a relationship between the two tables.
The same would apply to the books and customers tables. Since the goal is for
customers to buy books, there should be a relationship between these tables as well.
However, this relationship is a little more complex, as we shall see later.
There are three main types of relationships:
60
Chapter 3 Table Relationships and Joins
One-to-Many Relationship
This is the most common type of relationship between two tables. The relationship is
between the primary key of one table and a foreign key in another. However, it’s actually
implemented as a reference from a foreign key to the primary key.
The relationship is used to indicate a number of possible scenarios. For example:
• One Author has written many Books.
61
Chapter 3 Table Relationships and Joins
Note that the use of the word many can imply any number from 0 to ∞.
In the preceding cases, one table is referred to as the one table, while the other is
referred to as the many table, which is not very informative. Sometimes, it is helpful to
think of the one table as the parent table, while the many table is the child table.
A one-to-many relationship is implemented as a reference from the child table to the
parent table, for example, for books and authors:
Note that while the child table has a reference to the parent table, the parent table
does not have a reference to the child table.
You can combine parent and child tables using a JOIN:
-- Not Oracle
SELECT
b.id, b.title, -- etc
a.givenname, a.familyname -- etc
62
Chapter 3 Table Relationships and Joins
This will give you the books with their matching authors:
~ 1172 rows ~
Note that Oracle has a quirk which disallows using AS for table aliases. If you’re using
Oracle, you’ll need to remember that in the following examples which may include AS.
Remember, if there are anonymous books (books with a NULL for authorid), you will
need an outer join:
-- Not Oracle
SELECT
b.id, b.title, -- etc
a.givenname, a.familyname -- etc
FROM books AS b LEFT JOIN authors AS a ON b.authorid=a.id;
This will give you all of the books with or without their authors:
~ 1201 rows ~
Remember that SQLite doesn’t support RIGHT JOINs. If you want an outer
join, you need to put the unmatched row table on the left and make sure to use
LEFT JOIN.
64
Chapter 3 Table Relationships and Joins
In the previous example, we opted for a LEFT JOIN. When you join a child table to a
parent table, you generally have four options:
The first option is, of course, an INNER JOIN, or, more simply, JOIN.
The result would look like Figure 3-3.
65
Chapter 3 Table Relationships and Joins
You’ll notice that the join doesn’t include unmatched books or authors.
The second and third options are LEFT OUTER JOIN or RIGHT OUTER JOIN,
depending on whether the unmatched rows are on the left or the right; again, we
can simply write LEFT JOIN or RIGHT JOIN. In this case, a LEFT JOIN would include
unmatched books, as in Figure 3-4.
66
Chapter 3 Table Relationships and Joins
Most DBMSs (not including SQLite or MySQL/MariaDB) have a fourth option: include
all of the unmatched parents and children. That’s called a FULL OUTER JOIN or FULL JOIN
to its friends. This would include unmatched rows from both sides as in Figure 3-5.
In this case, we went for the LEFT JOIN because the child table was on the left, and
we wanted all of them with or without matches.
Despite the apparent symmetry, all joins are not equal. When you join a child to a
parent table, the number of results will generally reflect the child table. That’s because
many of the children would share the same parent.
To get a fair estimate of how many results you might expect, therefore, you should
start by counting the rows.
To get the number of results in an INNER JOIN, you’ll need to count the number of
children which match a parent—that is, where the foreign key is NOT NULL:
67
Chapter 3 Table Relationships and Joins
That should get you the number of rows for the INNER JOIN previously:
Count
1172
To count the number of unmatched child rows, you just need to count the ones
where the foreign key is NULL:
-- Unmatched Children
SELECT count(*) FROM books WHERE authorid IS NULL;
That will give you the number of rows missing in the INNER JOIN:
Count
29
If you add this to the number for the INNER JOIN, you’ll get the total number of
books, which is the number of rows in the child OUTER JOIN earlier.
To get the number of unmatched parent rows is trickier. You’ll need to count the
number of rows in the parent table whose primary key is not one of the foreign keys in
the child table:
-- Unmatched Parents
SELECT count(*) FROM authors
WHERE id NOT IN(SELECT authorid FROM books WHERE authorid IS NOT NULL);
Count
45
The subquery selects for the authors whose id does make an appearance in the
books table. The NOT IN expression selects for the others. The reason that the subquery
includes the WHERE authorid IS NOT NULL clause is due to a quirk in the behavior of NOT
IN with NULLs. This is explained later.
68
Chapter 3 Table Relationships and Joins
Now, you have all the numbers you need to estimate the number of rows in your join.
You can use the following combinations:
JOIN Calculation
That’s the number of rows you can expect from a child outer join: LEFT JOIN or
RIGHT JOIN, depending on where you put the child table.
Of course, that’s not necessarily the end of it. If you have an inner join, and there are
some NULL foreign keys, then you’ll end up with fewer than the estimate. If you opt for a
parent outer join, then there’ll be more rows if you have parents without matching children.
However, this is a good starting point.
To find customers in the other states, you can use NOT IN:
That’s as expected. However, if you include NULL in your list, things get messy. You
need to remember how IN(...) is interpreted. For example:
69
Chapter 3 Table Relationships and Joins
is equivalent to
That last term state=NULL will always fail, since NULL always fails a comparison, but
that’s OK if it matches one of the others.
However, the NOT IN version:
is equivalent to
When you negate a logical expression, you not only negate the individual terms, but
you also negate the operators between them.
Once again, the term state<>NULL always fails, but, since this is now ANDed with the
rest, it fails the whole expression.
The moral of this story is that you can’t use NOT IN if the list contains NULLs.
70
Chapter 3 Table Relationships and Joins
You can drop an existing view using DROP VIEW. For most DBMSs, you can use DROP
VIEW IF EXISTS if you’re not sure that it exists (yet). Not with Oracle, however.
Microsoft SQL has an additional quirk: CREATE VIEW must be the only statement in
its batch, so you need to put the statement between the GO keyword, which marks the
end of one batch and the beginning of another:
71
Chapter 3 Table Relationships and Joins
We’ve included the authorid in case you want to use it to get more author details.
Once you have saved a view, you can pretend it’s another table:
You’ll get the same results as before with a little less effort.
One-to-One Relationships
A one-to-one relationship associates a single row of one table with a single row of
another. It is normally between two primary keys.
If every row in one table is associated with a row in another table, then you can
consider the second table as an extension of the first table. If that’s the case, why not just
put all of the columns in the same table? Reasons include the following:
• You want to add more details, but you don’t want to change the
original table.
• You want to add more details, but you can’t change the original table
(possibly because of permissions).
• The additional table contains details that may be optional: not all
rows in the original table require the additional columns.
• You want to keep some of the details in a separate table so that you
can add another layer of security to the additional details.
One-to-Maybe Relationships
Technically, a true one-to-one relationship requires a reference from both tables to each
other. Among other things, it is hard to implement as it might require adding both rows
at the same time.
Since a row from table A must reference a row from table B, you would need to
have the table B row in place before you add to table A. However, if a row from
table B must also reference a row from table A, then you need to add to table A
first. That’s clearly a contradiction.
72
Chapter 3 Table Relationships and Joins
One way to do this would be to defer the foreign key constraint until after you’ve
added to both tables in either order. Unfortunately, most DBMSs don’t let you do
this, so you’re stuck with this impossible situation.
Here, the secondary table contains additional data for some of the rows in the
main table.
1
One to maybe: My term. Others call it a one-to-zero-or-one, which is less snappy.
73
Chapter 3 Table Relationships and Joins
Note that this relationship is implemented by making the id in the secondary table
both a primary key and a foreign key.
For example, the vip table includes additional features for some customers:
You can see all of the customers, some of whom also have VIP data:
~ 303 rows ~
~ 81 rows ~
You’ll notice that there aren’t as many rows in the vip table as in the customers table.
There might have been, if every customer were a VIP, but not in this case.
You can see how they relate using a join:
74
Chapter 3 Table Relationships and Joins
~ 303 rows ~
This gives all of the customers, with either their VIP data or NULLs in the extra
columns.
Note
• We need the LEFT JOIN to include non-VIP customers. If you
wanted VIP customers only, a simple (inner) JOIN would be better.
• We could have used SELECT *, but using c.*, v.* allows you to
decide which tables you are most interested in.
As a special case, you can also select VIP customers only, without additional VIP
columns, using
SELECT c.*
FROM customers AS c JOIN vip AS v ON c.id=v.id;
Here, the inner join selects only VIP customers, and the c.* selects only the
customer columns.
Why you would want to do this is, of course, up to you.
75
Chapter 3 Table Relationships and Joins
Multiple Values
One major question in database design is how to handle multiple values. The principles
of properly normalized tables preclude multiple values in a row:
Note that you can do this if there is a clear distinction in the type of
phone number. For example, you could legitimately have separate
columns for fax (does anybody remember these?), mobile, and
landline numbers.
For example, suppose we wish to record multiple genres for a book. Here are two
attempted solutions which are not correct:
The idea is that the genre column would have multiple genres or
genre ids, delimited possibly by a comma. The problem is that the
data is not atomic, and this becomes very difficult to sort, search,
and update. You will also need extra work to use the data.
You cannot have multiple columns with the same name, so these
columns might be called genre1, genre2, etc. Here, the problems
are (a) you will either have too many or not enough columns,
(b) there is no “correct” column for a particular value, and (c)
searching and sorting are impractical.
76
Chapter 3 Table Relationships and Joins
The problem of recording genres is more complicated, because not only can one
book have multiple genres, one genre can apply to multiple books. This is an example of
a many-to-many relationship.
This cannot be achieved directly between the two tables; rather, it involves an
additional table between them.
If you have the courage to look at the script which created and populated the
sample database, you’ll find a table called booksgenres (not to be confused
with the bookgenres table, which is, of course, completely different) which does
indeed have the genres combined in a single column. This is, of course, cheating.
This is one case where you might break the rules for the purpose of transferring or
backing up the data only. However, the data should never stay in this format.
Many-to-Many Relationships
To represent a many-to-many relationship between tables, you will need another table
which links the two others.
Such a table is called an associative table or a bridging table.
It looks like Figure 3-7.
77
Chapter 3 Table Relationships and Joins
-- Book Table
CREATE TABLE books (
id int PRIMARY KEY,
title varchar,
-- etc
);
-- Genre Table
CREATE TABLE genres (
id int PRIMARY KEY,
name varchar,
description varchar
-- etc
);
The genres table includes a surrogate primary key. It also contains the actual genre
name and a description so that the use of the particular genre is clear.
You can see what’s in the two tables with simple SELECT statements:
78
Chapter 3 Table Relationships and Joins
id Genre
1 Biology
2 Ancient History
3 Academia
4 Science
5 College
6 Comics
~ 166 rows ~
Neither table refers to the other. Instead, you need an additional table.
The associative table will then link books with genres:
-- Associative Table
CREATE TABLE book_genres (
bookid int REFERENCES books(id),
genreid int REFERENCES genres(id)
);
79
Chapter 3 Table Relationships and Joins
bookid Genreid
456 8
789 8
123 52
456 38
789 38
123 80
456 94
356 1
789 113
123 9
1914 1
936 1
1198 1
918 1
456 35
789 68
456 146
789 80
456 101
456 145
1618 2
844 3
~ 8011 rows ~
This table is a simple table which has one job only: record which books are related to
which tables.
80
Chapter 3 Table Relationships and Joins
Each column must be a foreign key to the other table; otherwise, the whole point
of the association is lost. This association allows a book to be associated with multiple
genres and a genre to be associated with multiple books.
In the preceding table, for example, book 123 has multiple genres. Book 456 also
has two genres. Some of those genres appear for both books and, for all we know, other
books later on. That is, one book can have many genres, and one genre can associate
with many books.
There is one more requirement. The combination should be unique. There is no
point in associating a book with the same genre more than once. Since there is no
other data in the table, it would be appropriate to make the combination a compound
primary key:
-- Associative Table
CREATE TABLE book_genres (
bookid int REFERENCES books(id),
genreid int REFERENCES genres(id),
PRIMARY KEY (bookid,genreid)
);
81
Chapter 3 Table Relationships and Joins
Count
8011
SELECT *
FROM
bookgenres AS bg
JOIN books AS b ON bg.bookid=b.id
JOIN genres AS g ON bg.genreid=g.id
;
This gives a very long list, because the bookgenres table is very long:
~ 8011 rows ~
82
Chapter 3 Table Relationships and Joins
Here, we started the join from the middle, since we’re focusing on the associative
table. You could just as readily have started on one end:
SELECT *
FROM
books AS b
JOIN bookgenres AS bg ON b.id=bg.bookid
JOIN genres AS g ON bg.genreid=g.id
;
You’ll get the same data, but, since the tables are in a different order, the columns
will also be in a different order.
In reality, you end up with too many columns, two of which are duplicated by the
join. You can simplify the result as
id title Genre
~ 8011 rows ~
The book’s id is important because it is quite possible for different books to have the
same title.
83
Chapter 3 Table Relationships and Joins
Since you’ve now listed the specific columns, the column order won’t depend on the
order of the tables in the join.
By its very nature, the associative table cannot include NULLs for either the bookid or
the genreid. As such, there is no need for an outer join.
WITH cte AS (
SELECT b.id, b.title, g.genre
FROM bookgenres AS bg
JOIN books AS b ON bg.bookid=b.id
JOIN genres AS g ON bg.genreid=g.id
)
--etc
;
A Common Table Expression (CTE) saves the results of a SELECT statement into a
virtual table, so you can use those results in the next stage. You’ll see more on CTEs in
Chapters 7 and 9.
You can now summarize the CTE using two functions, count() and string_agg():
WITH cte AS (
SELECT b.id, b.title, g.genre
FROM
bookgenres AS bg
JOIN books AS b ON bg.bookid=b.id
JOIN genres AS g ON bg.genreid=g.id
)
SELECT
id, title,
84
Chapter 3 Table Relationships and Joins
~ 1200 rows ~
85
Chapter 3 Table Relationships and Joins
-- Not SQLite
SELECT
b.id, b.title, b.published, b.price,
g.genre,
a.givenname, a.othernames, a.familyname,
a.born, a.died, a.gender, a.home
FROM
authors AS a
RIGHT JOIN books AS b ON a.id=b.authorid
LEFT JOIN bookgenres AS bg ON b.id=bg.bookid
JOIN genres AS g ON bg.genreid=g.id
;
~ 8011 rows ~
In the preceding example, we have included most of the columns from the four
tables, omitting the foreign keys and most of the other primary keys.
The tables are joined in a line from the authors table to the genres table. Since we
want all of the books, regardless of whether they have associated authors or genres, we
use two outer joins. As it turns out, we see examples of each of the three main join types.
SQLite doesn’t support the RIGHT JOIN, so this won’t work.
86
Chapter 3 Table Relationships and Joins
You can write the joins starting from the books table if you like:
SELECT
b.id, b.title, b.published, b.price,
g.genre,
a.givenname, a.othernames, a.familyname,
a.born, a.died, a.gender, a.home
FROM
books AS b
LEFT JOIN bookgenres AS bg ON b.id=bg.bookid
JOIN genres AS g ON bg.genreid=g.id
LEFT JOIN authors AS a ON b.authorid=a.id
;
Visually, this appears to put the emphasis on the books table, but it will give exactly
the same results as before.
This time, SQLite will be happy.
However, there is an alternative, which takes advantage of the view previously
created.
First, you can replace the references to the individual books and authors tables with
the bookdetails view:
SELECT
bd.id, bd.title, bd.published, bd.price,
g.genre,
bd.givenname, bd.othernames, bd.familyname,
bd.born, bd.died, bd.gender, bd.home
FROM
bookdetails AS bd
LEFT JOIN bookgenres AS bg ON bd.id=bg.bookid
JOIN genres AS g ON bg.genreid=g.id;
87
Chapter 3 Table Relationships and Joins
If you want to combine the genre names, you can do that in a CTE and join the
results with the view:
WITH cte AS (
SELECT bg.bookid, string_agg(g.genre,', ') AS genres
FROM bookgenres AS bg JOIN genres AS g ON bg.genreid=g.id
GROUP BY bg.bookid
)
SELECT *
FROM bookdetails AS b JOIN cte ON b.id=cte.bookid;
~ 1200 rows ~
You might want to filter the genres. You can do that inside the CTE:
WITH cte AS (
SELECT bg.bookid, string_agg(g.genre,', ') AS genres
FROM bookgenres AS bg JOIN genres AS g ON bg.genreid=g.id
WHERE g.genre IN('Fantasy','Science Fiction')
GROUP BY bg.bookid
)
SELECT *
FROM bookdetails AS b JOIN cte ON b.id=cte.bookid;
88
Chapter 3 Table Relationships and Joins
589 The Story of ... ... Fantasy ... Stanley Waterloo ...
96 Bee: The Pri ... ... Fantasy ... Anatole France ...
880 The Journey ... ... Fantasy, Science F… Ludvig Holberg ...
591 The Story of ... ... Fantasy ... E. Nesbit ...
1938 Histoire Com ... ... Science Fiction ... Cyrano de Bergerac ...
128 The Year 3000… ... Science Fiction ... Paolo Mantegazza ...
~ 163 rows ~
Note that the concatenated genres column has been aliased to genres. That’s the
same name as the genres table, so you might get confused. The good news is that SQL
doesn’t, so you can get away with it. On the other hand, if you’re worried about that
you can always use double quotes: AS "genres". Of course, you can also choose a
better alias.
A word of warning, however. When you start using views inside queries, you will have
to consider some possible side effects:
• Since the view is not part of the original database, you may lead to
some confusion with other users, since views look like tables, but
aren’t with the rest of the tables.
• If you have too many views in a query, the DBMS optimizer may not
be able to work out the most efficient plan for running the query.
This can be because some views produce more than you need for the
next query, and the optimizer may not be able to work out what you
really want.
• If there are any changes to the view, they will, of course, affect the
outcome of the query.
• These side effects will be more pronounced if you start to create
views using other views. It is often safer to create the new view from
scratch.
89
Chapter 3 Table Relationships and Joins
This doesn’t mean that you shouldn’t use views in your queries—that’s the whole
point of creating a view. It does mean, however, that you should be careful when piling
them up.
• The association isn’t the whole story. You also want to record other
sales details such as the date, the amount paid, and so on.
In this case, the sales tables are not purely associative, since they contain new data of
their own, but they are still doing an associative job.
90
Chapter 3 Table Relationships and Joins
There are a few extra tables, not part of the main database, which we can use to make
the concept clear:
91
Chapter 3 Table Relationships and Joins
You won’t need to run the preceding example, as the data is already in the
sample tables.
You can fetch the associated data using something like the following:
SELECT *
FROM
multibooks AS b
JOIN authorship AS ba ON b.id=ba.bookid
JOIN multiauthors AS a ON ba.authorid=a.id;
~ 31 rows ~
You can also combine the authors for each book using a CTE and an aggregate query:
WITH cte AS (
SELECT
ba.bookid,
string_agg(a.givenname||' '||a.familyname,' & ')
AS authors
FROM authorship AS ba JOIN multiauthors AS a
ON ba.authorid=a.id
GROUP BY ba.bookid
92
Chapter 3 Table Relationships and Joins
)
SELECT b.id, b.title, cte.authors
FROM multibooks AS b JOIN cte ON b.id=cte.bookid
ORDER BY b.id;
id title Authors
~ 23 rows ~
The main sample database doesn’t include multiple authors simply because it
doesn’t happen often enough with classic literature to make it worth complicating the
sample further.
However, the point is that whenever you have multiple values, you will need
additional tables rather than additional columns or compound columns. Multiple values
should appear in rows, not columns.
This is because the child table will refer to the parent table.
• When adding to the parent table, you need to remember the new primary key.
93
Chapter 3 Table Relationships and Joins
3. If you have just added a new author, fetch its primary key.
94
Chapter 3 Table Relationships and Joins
First, we’ll check to see whether the author has already been added to the
authors table:
Adding an Author
In principle, you would add the new author with the following statement:
As the comment says, don’t run this statement yet. Because the author’s id is
autogenerated, we’ll need to get the new id after inserting the row. You can do a search
for it after adding the row, but it may be possible to have the DBMS tell you what the
new id is.
Different DBMSs have different methods of getting this id.
For PostgreSQL, you can simply use a RETURNING clause at the end of the INSERT
statement:
-- PostgreSQL
INSERT INTO authors(givenname, othernames, familyname,
born, died, gender,home)
VALUES('Agatha','Mary Clarissa','Christie',
'1890-09-15','1976-01-12','f',
'Tourquay, Devon, England')
RETURNING id; -- Take note of this!
95
Chapter 3 Table Relationships and Joins
For MySQL/MariaDB, Microsoft SQL, and SQLite, you run a separate function after
the event. Note that you should run both statements together by highlighting both before
you run:
The additional SELECT statements earlier all fetch the newly generated id.
Oracle, on the other hand, makes it pretty tricky. It does support a RETURNING clause,
but only into variables. You can get the newly generated id, but that involves some extra
trickery in hunting for sequences. The simplest method really is to select the row you’ve
just inserted using data you’ve just entered:
-- Oracle
INSERT INTO authors(givenname, othernames, familyname,
born, died, gender,home)
VALUES('Agatha','Mary Clarissa','Christie',
96
Chapter 3 Table Relationships and Joins
Of course, you don’t necessarily need to filter all of the new values: just enough to be
sure you’ve got the right one.
Adding a Book
After that, the rest is easy.
Whether or not you have just added the new author, you can simply search for the
authors table to get the author id:
Taking note of the id in particular, you can insert the book with the following
statement:
Of course, you will need to supply the correct id in the preceding statement, either
from the INSERT statements in the previous section or from the SELECT statement earlier.
Note that we’ve picked an arbitrary value of 16.00 for the price. It didn’t need the
decimal part, of course, but it makes the purpose clearer.
97
Chapter 3 Table Relationships and Joins
Data Value
Customer ID 42
Book IDs 123, 456, 789
Quantities 3, 2, 1
-- PostgreSQL
INSERT INTO sales(customerid, ordered)
VALUES (42,current_timestamp)
RETURNING id;
-- MSSQL
INSERT INTO sales(customerid, ordered)
VALUES (42,current_timestamp);
SELECT scope_identity();
-- MySQL / MariaDB
INSERT INTO sales(customerid, ordered)
VALUES (42,current_timestamp);
SELECT last_insert_id();
-- SQLite
INSERT INTO sales(customerid, ordered)
VALUES (42,current_timestamp);
SELECT last_insert_rowid();
-- Oracle
INSERT INTO sales(customerid, ordered, total)
VALUES (42,current_timestamp,0);
SELECT id FROM sales WHERE id=42 AND total=0;
For Oracle, we’ve taken a slightly different approach in including a dummy total
of 0. When the sale is fully added, the value shouldn’t be zero, so we’re using it as a
temporary placeholder to help identify the new sale.
Remember that we’ll need to remember the new sale id.
99
Chapter 3 Table Relationships and Joins
-- Not Oracle
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES
( ... , 123, 3),
( ... , 456, 1),
( ... , 789, 2);
-- Oracle
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES ( ... , 123, 3);
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES ( ... , 456, 1);
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES ( ... , 789, 2);
UPDATE saleitems
SET price=(SELECT price FROM books WHERE
books.id=saleitems.bookid)
WHERE saleid = ... ;
The correlated subquery fetches the price from the saleitems table for the matching
book (WHERE books.id=saleitems.bookid).
The WHERE clause in the main query ensures that only the new sale items get the
prices. This is important because you don’t want to copy the prices into the old sale
items: there might have been a price change since the older sales were completed, and
you shouldn’t let that change affect old transactions.
100
Chapter 3 Table Relationships and Joins
SELECT sum(quantity*price)
FROM saleitems
WHERE saleid = ... ;
The result would be correct, but it would also be incomplete. What’s missing is the
tax and any VIP discount applicable.
Let’s assume a tax of 10%—it varies from country to country, of course, so you might
want to make an adjustment. That means you’ll end up paying (1 + 10%) times the total:
In real life, of course, you would simply write 1.1, but the preceding expression
is a reminder of where the value came from and how you might adapt it for different
tax rates.
The VIP discount depends on the customer. You can read that from the VIP table:
The reason you subtract it from 1 is that it’s a discount: it comes off the full price.
You can use that in a subquery with the calculated total:
SELECT
sum(quantity*price)
* (1 + 0.1)
* (SELECT 1 - discount FROM vip WHERE id = 42)
FROM saleitems
WHERE saleid = ... ;
101
Chapter 3 Table Relationships and Joins
except not necessarily. Some customers aren’t VIPs, so the subquery might return
a NULL. That would destroy the whole calculation. Since a missing VIP value means no
discount, we should coalesce the subquery to 1:
SELECT
sum(quantity*price)
* (1 + 0.1)
* coalesce((SELECT 1 - discount FROM vip
WHERE id = 42),1)
FROM saleitems
WHERE saleid = ... ;
UPDATE sales
SET total = (
SELECT
sum(quantity*price)
* (1 + 0.1)
* coalesce((SELECT 1 - discount FROM vip
WHERE id = 42),1)
FROM saleitems
WHERE saleid = ...
)
WHERE id = ... ;
There’s a lot going on here. First, the UPDATE query sets a value to a subquery, which,
in turn, uses a subquery to fetch a value. You’ll also find that the query uses the sale id
twice, once to filter the sale items and once to select the sale.
Review
A main feature of SQL databases is that there are multiple tables and that these tables are
related to each other.
Relationships are generally established through primary keys and foreign keys which
reference the primary keys in related tables. The foreign key is normally in the form of a
102
Chapter 3 Table Relationships and Joins
constraint, which guarantees that the foreign key references a valid primary key value in
the other table, if not necessarily the correct one.
There may also be ad hoc relationships which are not planned or enforced.
Types of Relationships
There are three main relationship types:
In any reasonably sized database, the fact that there are many
tables in one-to-many relationships results in many-to-many
relationships.
It’s a basic principle in a database that a column shouldn’t have multiple values and
that you shouldn’t have multiple columns doing the same job. The way to handle multiple
values is with additional tables, either in one-to-many or many-to-many relationships.
Joining Tables
When there is an established relationship between tables, you can combine their
contents using joins.
103
Chapter 3 Table Relationships and Joins
Sometimes, you may want to count the number of expected results to check whether
your join type matches what you want.
When you do join tables, you often end up with several rows with the same repeated
data coming from the parent table. You may be able to simplify this by grouping on
parent data and aggregating on the child data. Because you can only select what you
summarise, you may need to join the results again to get more details.
Views
Selecting what you want from multiple related tables can be inconvenient. You can save
your complex joined query in a view for future use and use it as you might a simple table
afterward.
Summary
In this chapter, we looked at how multiple tables are related through foreign keys
matching with primary keys. We also looked at different types of relationships and why
tables were designed this way.
Using this, we were able to combine tables using one or more joins to match rows
from one table to another. We looked at different types of joins and when you might
choose between them.
Coming Up
Most of the data we’ve worked with have been simple values, though in a few cases we
calculated values such as tax and discounts.
In the next chapter, we’re going to take a small detour and concentrate on
performing calculations in SQL.
104
CHAPTER 4
DBMSs vary widely in their ability to perform calculations. This is especially the
case with functions, which vary not only in scope but even in what the DBMSs
call them.
In particular, SQLite has a very limited ability to perform calculations, particularly
with functions.
In this chapter, we’ll be working with various types of data, including strings. If
you’re using MariaDB/MySQL, we thoroughly recommend that you set your session
to ANSI mode, so that string behavior works as for standard SQL.
You can begin your session with SET SESSION sql_mode = 'ANSI';
105
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_4
Chapter 4 Working with Calculated Data
Calculation Basics
We’ll look at more details later, but here is an overview of how calculations work in SQL.
You can calculate values based on individual columns or multiple columns. For
example:
SELECT
height/2.54, -- single column
givenname||' '||familyname -- multiple columns
-- givenname+' '+familyname -- MSSQL
FROM customers;
?column? ?column?
~ 303 rows ~
SELECT
'active', -- hard-coded
(SELECT name FROM towns WHERE id=townid) -- sub query
FROM customers;
106
Chapter 4 Working with Calculated Data
You get
?column? ?column?
~ 303 rows ~
SELECT
upper(familyname) -- upper case function
FROM customers;
?column?
KNOTT
SHAW
ANDY
DOWNE
ISK
DIVERS
~ 303 rows ~
In all cases, you’ll notice that a calculated value doesn’t have a proper name.
107
Chapter 4 Working with Calculated Data
Using Aliases
Calculated columns cause a minor inconvenience for SQL. Generally, each column should
have a distinct name, but SQL has no clear idea what to call the newly generated column.
Some SQL clients will leave the calculated column unnamed, while some will
generate a dummy name. When experimenting with simple SELECT statements, this is
OK, but when taking the statement seriously, such as when you plan to use the results
later, you will need to give each column a better name.
An alias is a new name for a column, whether it’s a calculated column or an original.
You create an alias using the AS keyword. For example:
SELECT
id AS customer,
height/2.54 AS height,
givenname||' '||familyname AS fullname,
-- givenname+' '+familyname AS fullname -- MSSQL
'active' AS status,
(SELECT name FROM towns WHERE id=townid) AS town,
length(email) AS length
-- len(email) AS length -- MSSQL
FROM customers;
~ 303 rows ~
108
Chapter 4 Working with Calculated Data
Note .
Apart from the fact that each calculated column must have a distinct name, other
reasons to include aliases are as follows:
At this point, we’re not worried about whether the preceding aliases are the best
possible names for their columns; we’re just looking at how they work.
Alias Names
By and large, the rules for alias names are the same as those for the names of columns.
That means
• Aliases should not contain spaces, can’t start with a number, and
can’t contain other special characters.
If you really need to work around the preceding second and third rules, you can
enclose the alias in double quotes. For example:
SELECT
ordered AS "order",
shipped AS "shipped date"
FROM sales;
109
Chapter 4 Working with Calculated Data
Here, the name order is an SQL keyword, while shipped date contains a space.
~ 5549 rows ~
You should resist the urge to do this. Aliases, as with column names, are for technical
rather than aesthetic use. A SELECT statement is not actually a report.
Some DBMSs offer alternatives to double quotes for special names:
Whatever names you choose, remember that they are meant to be purely functional.
Don’t get carried away trying to use upper and lower case, or spaces, or anything else
that might look better. That’s up to the software handling the output of your queries. In
SQL, you just need a suitable name to refer to the data.
AS Is Optional
You will discover soon enough that AS is optional:
SELECT
id customer,
height/2.54 height,
givenname||' '||familyname fullname,
-- givenname+' '+familyname fullname -- MSSQL
'active' status,
110
Chapter 4 Working with Calculated Data
Some developers justify leaving out the AS as it saves time or makes them look more
professional. However, you will also make this kind of mistake soon enough:
SELECT
id,
email
givenname, familyname,
height,
dob
FROM customers;
~ 303 rows ~
At first glance, this looks OK, as it is not a technical error. However, on closer
inspection, you’ll see that the email has been aliased to familyname, since there is no
comma between them. Aliasing one column to another is legitimate, though it’s not
often that you would really want to.
You can’t stop SQL from allowing this, but you can make mistakes like this slightly
easier to spot if you develop a pattern which always includes AS for aliases.
111
Chapter 4 Working with Calculated Data
1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY
This is different to the way you write SQL in that you write the SELECT clause first.
This creates a major point of confusion in a statement like this:
This will result in an error, since although the price*0.1 AS tax expression is
written in the first clause, it isn’t actually processed until after the WHERE clause. As a
result, tax is not yet available for the WHERE clause.
It becomes more confusing if you alias a calculation to an original column name:
SELECT
id, title,
price*1.1 AS price -- adjust to include tax
FROM books
WHERE price<15;
112
Chapter 4 Working with Calculated Data
This will work. Here, the price has been increased to include tax and aliased to the
original name, which is legitimate.
id Truncate price
~ 521 rows ~
However, the WHERE clause will filter on the original price column, not the adjusted
version.
Again, there’s not much you can do about this directly, as you don’t have the option
to write the SELECT clause further down, and you can’t create aliases in any other clause.
Later, we will see how using Common Table Expressions can help preprocess
calculated columns.
It’s probably not a good idea to alias a calculation to an original column name if
you’re planning to use it later.
SQL has a clear idea of what it’s going to do with the aliased name, but the human
reader may well get confused.
113
Chapter 4 Working with Calculated Data
If you are calculating with a single column which includes NULL, it makes sense that
the result is also a NULL. For example:
SELECT
id, givenname, familyname,
height/2.54 AS height -- sometimes NULL
FROM customers;
~ 303 rows ~
Since all we’re doing is converting a single value, it is perfectly acceptable to leave
NULLs as they are—if you don’t know what the height is in centimeters, then you still
don’t know what it is in inches. However, we’ll see shortly how you might sometimes
replace the NULL with something you feel is better.
On the other hand, this behavior becomes more of a nuisance if you’re calculating
on multiple columns, most of which are not NULL:
SELECT
id, givenname, othernames, familyname,
givenname||' '||othernames||' '||familyname AS fullname
-- MSSQL:
-- givenname+' '+othernames+' '+familyname AS fullname
FROM authors;
114
Chapter 4 Working with Calculated Data
~ 488 rows ~
In the preceding example, most authors don’t have a value for othernames, so it is
NULL. Some don’t even have a givenname value. There’s nothing wrong for the most part
with the givenname or the familyname, but the NULL for othernames destroys the whole
calculation.
With Oracle, however, you won’t get NULLs. However, you will see extra spaces where
the missing names are.
Technically, the result is correct. If you don’t know some of the names, then you
don’t know the full name. However, that’s unhelpful.
Coalesce
SQL has a function called coalesce() which can replace NULL with a preferred
alternative. The word “coalesce” actually means to combine, but how it came to be the
name of this operation is one of those mysteries lost in the depths of ancient history.
The function is used this way:
coalesce(expression,planB)
If planB also happens to be NULL, coalesce() will try the next alternative and so on
until either there is a real value, or the alternatives have been exhausted.
115
Chapter 4 Working with Calculated Data
SELECT
id, givenname, familyname,
phone
FROM employees;
~ 34 rows ~
It would be reasonable to replace these missing phone numbers with the company’s
main phone number:
SELECT
id, givenname, familyname,
coalesce(phone,'1300975711') -- coalesce to main number
FROM employees;
116
Chapter 4 Working with Calculated Data
~ 34 rows ~
The thing about coalesce() is that you can’t always get away with it. You need to be
sure that your substitute makes sense and that your guess is a good one. There are many
times when it wouldn’t make sense, such as a missing price for a book or an author’s
date of birth; NULL is often the best thing you can do.
In Chapter 2, you guessed at a missing quantity using coalesce() and then fixed it so
that the quantity can’t be NULL in the future. Sometimes, that’s the best solution.
• You’ll also want to leave out the spaces after the missing names.
For the second point, we won’t coalesce just the name, but the combination of
the name and the space, which should also be NULL—except for Oracle, which doesn’t
behave the same way. We’ll take a different approach for Oracle.
To coalesce the names and spaces to an empty string, we can use
-- MSSQL
SELECT
117
Chapter 4 Working with Calculated Data
This gives us
~ 488 rows ~
Since Oracle will happily concatenate a NULL string, we can’t use coalesce().
Instead, we’ll use the ltrim() function. This function removes leading spaces from a
string. Since we’re adding a space to the end of the string, it will only be a leading space if
the name is empty. This gives us
-- Oracle
SELECT
id, givenname, othernames, familyname,
ltrim(givenname||' ')||ltrim(othernames||' ')
||familyname AS fullname
FROM authors;
118
Chapter 4 Working with Calculated Data
One obvious use for calculations is in the WHERE clause. For example, you can find
books with shorter titles:
SELECT *
FROM books
WHERE length(title)<24; -- MSSQL: len(title)
giving
~ 762 rows ~
You may need this if your database is case sensitive and you need to match a string in
an unknown case:
SELECT *
FROM books
WHERE lower(title) LIKE '%journey%';
giving
880 777 The Journey of Niels Klim to the Wor … 1741 12.50
946 704 Following the Equator: A Journey Aro … 1897 19.50
1314 606 Mozart’s Journey to Prague 1856 17.00
1092 295 A Journey to the Western Islands of … 1775 14.50
502 [NULL] Journey to the Center of the Earth 1864 15.50
1454 914 A Sentimental Journey 1768 13.50
119
Chapter 4 Working with Calculated Data
SELECT *
FROM customers
WHERE height<(SELECT avg(height) FROM customers);
giving
~ 128 rows ~
You can also use calculations in the ORDER BY clause, such as when you want to sort
by the length of a title:
SELECT *
FROM books
ORDER BY length(title); -- MSSQL: length(title)
which gives
~ 1200 rows ~
120
Chapter 4 Working with Calculated Data
However, you’re likely to want to select what you’re sorting by, so it would make
more sense to calculate the value in the SELECT clause and sort by the result:
~ 1200 rows ~
Here’s an interesting use for coalesce in the ORDER BY clause. Some DBMSs support
NULLS FIRST or NULLS LAST to decide where to put the NULLs in the sort order. If your
DBMS doesn’t support it, you can coalesce the column to an extreme value. For example:
SELECT *
FROM customers
ORDER BY coalesce(height,0); -- NULLS FIRST
SELECT *
FROM customers
ORDER BY coalesce(height,1000); -- NULLS LAST
By coalescing all of the NULLs to an extreme value, SQL will sort them to one end or
the other accordingly.
As for the FROM clause, you’ll need a calculation which generates a virtual table.
That’s usually going to be a view, a join, or even a subquery. A Common Table
Expression, in this context, is like a subquery. We’ll do more of that sort of thing later.
121
Chapter 4 Working with Calculated Data
SQL, like most coding languages, needs some help with certain literal values to
distinguish them from other code. Numeric literals are entered as they are (bare)
because they obviously can’t be anything else.
String or date literals, on the other hand, are wrapped in single quotes (' ... ') to
mark them as such. That’s so that SQL can distinguish between strings and other words,
such as SQL keywords or table and column names.
The actual value of a string or date literal doesn’t include the quotes. However, the
quotes are required when writing them into the code.
For much of what follows, we’ll be using literals for examples.
Casting
The cast() function is used to interpret a value as a different data type. Recall that SQL
has three main data types: numbers, strings, and dates. You can use cast to do one of
two things:
122
Chapter 4 Working with Calculated Data
• You can cast within a main type. For example, you can cast between
integer and decimal numbers or between dates and datetimes.
For what follows, remember that SQLite doesn’t have a date type, so that’s one cast
you won’t have to worry about. Later, we’ll have a quick look at the equivalent in SQLite.
Here are some examples of casting within types:
123
Chapter 4 Working with Calculated Data
If you cast a string to a longer type, one of two things will happen. If you cast it to a
CHAR (fixed length) type, the extra length will be padded with spaces. If you cast it to a
VARCHAR type, the string will be unchanged. However, the string will be permitted to grow
to a longer string.
Casting between types is a different matter. Most DBMSs will automatically cast to a
string if necessary. For example:
-- Not MSSQL
SELECT id || ': ' || email
FROM customers;
?column?
42: [email protected]
459: [email protected]
597: [email protected]
186: [email protected]
352: [email protected]
576: [email protected]
~ 303 rows ~
As you see, MSSQL won’t do this automatically, possibly due to a confusion with
their concatenation operator (+). There you’ll have to force the issue:
-- MSSQL
SELECT cast(id as varchar(5)) + ': ' + email
FROM customers;
124
Chapter 4 Working with Calculated Data
You can do the same with dates, too. We’ll do that with the customers’ dates of birth,
but we’ll run into the complication of the fact that some dates of birth are missing. Using
coalesce should do the job:
For SQLite, it wasn’t much effort as we’ve stored the dates as a string anyway.
Here, we’ve coalesced the entire concatenated value ' Born: ' || dob. That’s
because we want to replace the whole expression with the empty string if the dob is
missing. Concatenating with a NULL should result in a NULL.
For Oracle, you run again into the quirk of treating NULL strings as empty strings, so
they won’t coalesce. We can work around it using CASE:
-- Oracle
SELECT
id || ': ' || email
|| CASE
WHEN dob IS NOT NULL THEN ' Born: ' || dob
END
FROM customers;
Basically, you can think of coalesce as a simplified CASE expression. With Oracle,
you need to spell it out more.
One reason you might want to change data types is to mix them with other values,
such as concatenating the preceding strings. We’ll also see casting being used when
we want to combine data from multiple tables or virtual tables, such as with joins
and unions.
125
Chapter 4 Working with Calculated Data
Another reason to change data types is for sorting. All string data will normally sort
alphabetically, but you may need to cast them as non-strings for sorting. For example:
-- Integers
SELECT * FROM sorting
ORDER BY numberstring;
SELECT * FROM sorting
ORDER BY cast(numberstring as int); -- not MySQL
-- ORDER BY cast(numberstring as signed); -- MySQL
In the sorting table, there are some values stored as strings which represent
numbers or dates. The only way to sort them properly is to cast them first.
Note that MySQL won’t let you cast to an integer directly. You have to use SIGNED
(which means the same thing) or UNSIGNED. MariaDB is OK with integers.
Not all casts from strings are successful, since the string may not resemble the
correct type. For example:
-- This works:
SELECT cast('23' as int) -- MySQL: as signed
-- FROM dual -- Oracle
;
-- This doesn’t:
SELECT cast('hello' as int) -- MySQL: as signed
-- FROM dual -- Oracle
;
126
Chapter 4 Working with Calculated Data
Numeric Calculations
A number is normally used to count something—it’s the answer to the question “how
many.” For example, how many centimeters in the customer’s height, or how many
dollars were paid for this item?
Numbers aren’t always used that way. Sometimes, they’re used as tokens or as codes.
The calculations you might perform on a number would depend on how the number is
being used.
Basic Arithmetic
You can always perform the basic operations on numbers:
SELECT
3*5 AS multiplication,
4+7 AS addition,
8-11 AS subtraction,
20/3 AS division,
20%3 AS remainder, -- Oracle: mod(20,3),
24/3*5 AS associativity,
1+2*3 AS precedence,
2*(3+4) + 5*(8-5) AS distributive
-- FROM dual -- Oracle
;
15 11 -3 6 2 40 7 29
127
Chapter 4 Working with Calculated Data
Note that you’ll need to add FROM dual if you’re testing this in Oracle.
Also note
If you know someone who’s forgotten the basic rules of arithmetic, you can tell them
Of course, these expressions work just the same whether the value is a literal or some
stored or calculated value.
Mathematical Functions
There are some mathematical functions as well. For the most part, the mathematical
functions won’t get a lot of use unless you’re doing something fairly specialized.
SELECT
pi() AS pi, -- Not Oracle
sin(radians(45)) AS sin45, -- Not Oracle
sqrt(2) AS root2, -- √2
log10(3) AS log3,
ln(10) AS ln10, -- Natural Logarithm
power(4,3) AS four_cubed -- 4³
-- FROM dual -- Oracle
;
128
Chapter 4 Working with Calculated Data
So, now you can use SQL to find the length of a ladder leaning against a wall or the
distance between two ships lost at sea.
Approximation Functions
There are also functions which give an approximate value of a decimal number. Here is a
sample with variations between DBMSs:
SELECT
ceiling(200/7.0) AS ceiling,
-- SQLite: round(200/7.0 + 0.5),
-- Oracle: ceil(200/7.0),
floor(200/7.0) AS floor,
-- SQLite: round(200/7.0 - 0.5),
round(200/7.0,0) AS rounded_integer,
-- or round(200/7), -- not MSSQL
round(200/7.0,2) AS rounded_decimal
29 28 29 28.57
129
Chapter 4 Working with Calculated Data
If you use the cast() function to another narrow number type, you’ll also lose
precision. However, what happens next depends on the DBMS:
SELECT
cast(234.567 AS int) AS castint,
-- cast(234.567 AS unsigned), -- MySQL
cast(234.567 AS decimal(5,2)) AS castdec
-- FROM dual -- Oracle
;
• With MSSQL, casting to a shorter decimal will round off the number,
but casting to an integer will truncate it. If you want the integer
truncated, you can use something like decimal(3,0).
• With SQLite, casting to an integer will truncate, while casting to a
decimal is ignored and retains the original value.
Formatting Numbers
Formatting functions change the appearance of a number. Unlike approximation and
other functions, the result of a formatting function is not a number but is a string; that’s
the only way you can change the way a number appears.
For numbers, most of what you want to do is change the number of decimal places,
display the thousands separator, and possibly currency symbols.
Again, the different DBMSs have wildly different functions. As an example, here are
some ways of formatting a number as currency with thousands separators:
130
Chapter 4 Working with Calculated Data
-- PostgreSQL, Oracle
SELECT
to_char(total,'FM999G999G999D00') AS local_number,
to_char(total,'FML999G999G999D00') AS local_currency
FROM sales;
SELECT to_char(total,'FM$999,999,999.00') FROM sales;
-- MariaDB/MySQL
SELECT
format(total,2) AS local_number,
format(total,2,'de_DE') AS specific_number
FROM sales;
-- MSSQL
SELECT
format(total,'n') AS local_number,
format(total,'c') AS local_currency
FROM sales;
-- SQLite
SELECT printf('$%,d.%02d',total,round(total*100)%100)
FROM sales;
local_number local_currency
28.00 $28.00
34.00 $34.00
58.50 $58.50
50.00 $50.00
17.50 $17.50
13.00 $13.00
~ 5549 rows ~
131
Chapter 4 Working with Calculated Data
Note .
Note that if you do run a number through a formatting function, it is no longer a number!
If all you do is look at it, then that doesn’t matter. However, if you have plans to do any further
calculations, or to sort the results, then a formatted number is likely to backfire on you.
When all is said and done, formatting is probably something you won’t do much
in SQL. The main purpose of SQL is to get the data and prepare it for the next step.
Formatting comes last and is often done in other software.
String Calculations
A string is a string of characters, hence the name. In SQL, this is referred to as
character data.
Traditionally, SQL has two main data types for strings:
In principle, CHAR() is more efficient for processing since it’s always the same length,
and the DBMS doesn’t need to worry about working out the size and making things
fixed. VARCHAR() is supposed to be more efficient for storage.
In reality, modern DBMSs are much cleverer than their ancestors, and the difference
between the two types is not very important anymore. For example, PostgreSQL
recommends always using VARCHAR since it actually handles that type more efficiently.
Most DBMSs offer a third type, TEXT, which is, in principle, unlimited in length.
Again, modern DBMSs allow longer standard strings than they used to, so again this is
not so important. Microsoft has deprecated TEXT in favor of VARCHAR(MAX) which does
the same job.
A string literal is written between single quotes:
When working with strings, you normally simply want to save them and fetch them.
However, you can process the strings themselves. This is usually one of the following
operations:
Case Sensitivity
SQL will store the upper/lower case characters as expected, but you may have a hard
time searching for them. That’s because some databases ignore case, while others don’t.
How a database handles case is a question of collation. Collation refers to how it
interprets variations of letters. In English, the only variation to worry about is upper or
lower case, but other languages may have more variations, such as accented letters in
French or German.
133
Chapter 4 Working with Calculated Data
Collation will have an impact on how strings are sorted and how they compare.
In English, you’re mainly worried about whether upper case strings match lower case
strings and possibly whether upper and lower case strings are sorted together or sorted
separately. In some other languages, the same questions might apply to whether
accented and nonaccented characters match and how they, too, are sorted.
You can set a collation when you create the database or a table, but if you don’t worry
about it, the DBMS will have a default collation for new databases.
In PostgreSQL, Oracle, and SQLite, the default collation is case sensitive, so upper
and lower case won’t match. With MySQL/MariaDB and MSSQL, the default collation is
case insensitive, so they will match.
If you’re not sure whether your particular database is case sensitive or not, you can
try this simple test:
If the database is case sensitive, you won’t get any rows, since a won’t match A; if it’s
not, you will get the whole table.
134
Chapter 4 Working with Calculated Data
Some DBMSs support NCHAR and NVARCHAR data types in addition to CHAR and
VARCHAR. If the database tables are set to use Unicode, then CHAR and VARCHAR
will do the job. Otherwise, you might use NCHAR and NVARCHAR to specify Unicode
on particular columns.
Concatenation
Concatenation means joining strings together. This is the simplest string operation and
the only one which can be done without a function.
The concatenation operator is usually ||. Microsoft SQL Server uses + instead. For
example:
SELECT
id,
givenname||' '||familyname AS fullname
-- givenname+' '+familyname AS fullname -- MSSQL
FROM customers;
135
Chapter 4 Working with Calculated Data
Id fullname
42 May Knott
459 Rick Shaw
597 Ike Andy
186 Pat Downe
352 Basil Isk
576 Pearl Divers
~ 303 rows ~
Note that MySQL in traditional mode doesn’t support the concatenation operator in
any form. In ANSI mode, it supports the standard || operator.
Many DBMSs also support a non-standard function concat(string,string,...).
For example:
-- Not SQLite
SELECT
id,
concat(givenname,' ',familyname) AS fullname
FROM customers;
136
Chapter 4 Working with Calculated Data
String Functions
Other operations with strings require functions. Here are some examples.
For the following examples, we’ve included SELECT * for context—except that in
Oracle you need to write SELECT table.* if you’re mixing it with other data, so
we’ve done that with all of the examples which include Oracle.
The length of a string is the number of characters in the string. To find the length,
you can use
To find where part of a string is, you can use the following:
-- replace(original,search,replace)
SELECT books.*, replace(title,' ','-') AS hyphens
FROM books;
137
Chapter 4 Working with Calculated Data
lower(title) AS lower
FROM books;
-- PostgreSQL, Oracle
SELECT books.*, initcap(title) AS lower FROM books;
To remove extra spaces at the beginning or the end of a string, you can use trim() to
remove from both ends, or ltrim() or rtrim() to remove from the beginning or end of
the string:
WITH vars AS (
SELECT ' abcdefghijklmnop ' AS string
-- FROM dual -- Oracle
)
SELECT
string,
ltrim(string) AS ltrim,
rtrim(string) AS rtrim,
trim(string) AS trim AS trim,
ltrim(rtrim(string)) AS same
FROM vars;
All modern DBMSs support trim(), but MSSQL didn’t until version 2017.
PostgreSQL also calls it btrim(). You may not notice when the spaces on the right are
trimmed.
You can get substring with substring() or substr(), depending on your DBMS:
WITH vars AS (
SELECT 'abcdefghijklmnop' AS string
FROM dual -- Oracle
)
SELECT
-- PostgreSQL, MariaDB/MySQL, Oracle, SQLite
substr(string,3,5) AS substr,
-- PostgreSQL, MariaDB/MySQL, MSSQL, SQLite
substring('abcdefghijklmnop',3,5) AS substring
FROM vars;
138
Chapter 4 Working with Calculated Data
Some DBMSs include specialized functions to get the first or last part of a string. In
some cases, you can use a negative start to get the last part of a string:
WITH vars AS (
SELECT 'abcdefghijklmnop' AS string
FROM dual -- Oracle
)
SELECT
-- Left
-- PostgreSQL, MariaDB/MySQL, MSSQL:
left('abcdefghijklmnop',4) AS lstring
-- All DBMSs including SQLITE and Oracle:
-- substr(string,1,n) AS lstring,
-- Right
-- PostgreSQL, MariaDB/MySQL, MSSQL:
right('abcdefghijklmnop',4) AS rstring
-- MariaDB/MySQL, Oracle, SQLite
-- substr('abcdefghijklmnop',-4) AS rstring
FROM vars;
Just note that if you spend a lot of time extracting substrings from your data, it’s
possible that you’re trying to store too much in a single value.
On the other hand, you can often use substrings to reformat raw data into something
more friendly.
Date Operations
From an SQL point of view, dates are problematic. That’s because, despite their
overwhelming presence in daily life, measuring dates is a mess.
One problem is that we measure dates using a number of incompatible cycles all
at the same time: the day, week, month, and year. To make things worse, we all live in
different time zones, so we can’t even agree on what time it is.
Most DBMSs have a number of related data types to manage dates, specifically the
date which is for dates with times and datetime which includes the time. Generally, you
can expect variations on these types, as well as the ability to include time zones.
139
Chapter 4 Working with Calculated Data
The exception is SQLite, which expects you to use numbers or strings and run the
values through a few functions to do the date arithmetic.
There are a number of things you would expect to do with dates and times:
1. Enter and store a date/time
5. Add to a date/time
7. Format a date/time
SQLite has a completely different approach to working with dates. That’s partly
because it doesn’t actually support dates. As a result, SQLite will be missing from
much of the following discussion. The Appendix has some information on handling
dates in SQLite.
140
Chapter 4 Working with Calculated Data
The normal way to enter a date or datetime literal is to use one of the following:
• date: '2013-02-15'
You can also omit the seconds or include decimal parts of a second.
The format is a variation of the ISO8601 format. In pure ISO8601 format, the time
would be written after a T instead of a space.
Note that with Oracle, datetime literals generally use a different format. To use the
preceding formats, prefix the literal with date or datetime, respectively:
SELECT *
FROM customers
WHERE dob<'1980-01-01'; -- Oracle dob<date '1980-01-01';
~ 133 rows ~
Note that in simple expressions like dob<'1980-01-01', SQL doesn’t get confused
about whether the expression is a date or a string: the context makes it clear.
141
Chapter 4 Working with Calculated Data
SELECT
current_timestamp AS now,
current_date AS today -- Not MSSQL
-- FROM dual -- Oracles
;
Note .
As noted earlier, MSSQL doesn’t have a version of current_date. In any case, you
may have an existing datetime which you want to simplify to a date. The simplest way is
to cast the datetime:
-- Not Oracle
SELECT
current_timestamp AS now,
cast(current_timestamp as date) AS today
-- FROM dual -- Oracle
;
This won’t quite work with Oracle; it will let you do the cast all right, but it doesn’t
change anything. Instead, you should use the trunc() function:
-- Oracle
SELECT
current_timestamp AS now,
142
Chapter 4 Working with Calculated Data
trunc(current_timestamp) AS today
FROM dual -- Oracle
;
This will still have a time component, but it’s set to 00:00.
SELECT *
FROM sales
ORDER BY ordered;
WITH cte AS (
SELECT
cast(ordered as date) AS ordered, total -- Not Oracle
-- trunc(ordered) AS ordered, total -- Oracle
FROM sales
)
SELECT ordered, sum(total)
FROM cte
GROUP BY ordered
ORDER BY ordered;
ordered sum
2022-05-04 43.00
2022-05-05 150.50
2022-05-06 110.50
2022-05-07 142.00
143
Chapter 4 Working with Calculated Data
ordered sum
2022-05-08 214.50
2022-05-09 16.50
~ 389 rows ~
WITH chelyabinsk AS (
SELECT
timestamp '2013-02-15 09:20:00' AS datetime
FROM dual
)
SELECT
datetime,
EXTRACT(year FROM datetime) AS year,
EXTRACT(month FROM datetime) AS month,
EXTRACT(day FROM datetime) AS day,
-- not Oracle or MariaDB/MySQL:
EXTRACT(dow FROM datetime) AS weekday,
EXTRACT(hour FROM datetime) AS hour,
144
Chapter 4 Working with Calculated Data
Note that Oracle and MariaDB/MySQL don’t have a direct way of extracting the day
of the week, which can be a problem if, say, you want to use it for grouping. However, as
you will see later, you can use a formatting function to get the day of the week, as well as
the preceding values.
PostgreSQL also includes a function called date_part('part',datetime) as an
alternative to the preceding function.
WITH chelyabinsk AS (
SELECT cast('2013-02-15 09:20' as datetime) AS datetime
)
SELECT
datepart(year, datetime) AS year, -- aka year()
datename(year, datetime) AS yearstring,
datepart(month, datetime) AS month, -- aka month()
datename(month, datetime) AS monthname,
datepart(day, datetime) AS day, -- aka day()
145
Chapter 4 Working with Calculated Data
Note .
Formatting a Date
As with numbers, formatting a date generates a string.
For both PostgreSQL and Oracle, you can use the to_char function. Here are two
useful formats:
-- PostgreSQL
WITH vars AS (SELECT timestamp '1969-07-20 20:17:40' AS moonshot)
SELECT
moonshot,
to_char(moonshot,'FMDay, DDth FMMonth YYYY') AS fulldate,
to_char(moonshot,'Dy DD Mon YYYY') AS shortdate
FROM vars;
-- Oracle
WITH vars AS (
SELECT timestamp '1969-07-20 20:17:40' AS moonshot FROM dual
)
SELECT
moonshot,
to_char(moonshot,'FMDay, ddth Month YYYY') AS fulldate,
to_char(moonshot,'Dy DD Mon YYYY') AS shortdate
FROM vars;
146
Chapter 4 Working with Calculated Data
You’ll notice that there is a slight difference in the format codes between PostgreSQL
and Oracle.
For MariaDB/MySQL, there is the date_format() function:
For Microsoft SQL, the format() function can also be used for dates:
SQLite has very limited formatting functionality, and you certainly can’t get month
or weekday names without some additional trickery. It’s usually better to leave the date
alone and let the host application do what is needed.
You can learn more about the format codes at
• PostgreSQL: www.postgresql.org/docs/current/functions-
formatting.html#FUNCTIONS-FORMATTING-DATETIME-TABLE
• Oracle: https://docs.oracle.com/en/database/oracle/oracle-
database/21/sqlrf/Format-Models.html
• MariaDB: https://mariadb.com/kb/en/date_format/
• MySQL: https://dev.mysql.com/doc/refman/8.0/en/date-and-
time-functions.html
147
Chapter 4 Working with Calculated Data
Date Arithmetic
Generally, the two things you want to do with dates are
-- PostgreSQL
SELECT
date '2015-10-31' + interval '4 months' AS afterthen,
current_timestamp + interval '4 months' AS afternow,
current_timestamp + interval '4' month -- also OK ;
-- Oracle
SELECT
add_months('31 Oct 2015',4) AS afterthen,
current_timestamp + interval '4' month AS afternow,
add_months(current_timestamp,4) -- also OK
FROM dual;
-- MariaDB/MySQL
SELECT
date_add('2015-10-31',interval 4 month) AS afterthen,
date_add(current_timestamp,interval 4 month)
AS afternow,
current_timestamp + interval '4' month -- also OK
;
afterthen Afternow
148
Chapter 4 Working with Calculated Data
You’ll notice that PostgreSQL and Oracle use the addition operator, while MariaDB/
MySQL uses a special function. Oracle also has a special function to add months.
For Microsoft SQL, you use dateadd, specifying the units and number of units:
-- MSSQL
SELECT
dateadd(month,4,'2015-10-31') AS afterthen,
dateadd(month,4,current_timestamp) AS afternow
;
SQLite uses the strftime() function to convert from a string, together with
modifiers to adjust the date:
-- SQLite
SELECT
strftime('%Y-%m-%d','2015-10-31','+4 month')
AS afterthen,
strftime('%Y-%m-%d','now','+4 month') AS afternow
;
The other thing you’ll want to do is calculate the difference between two dates. Here
again, every DBMS does it differently. For example, to find the age of your customers,
you can use
-- PostgreSQL
SELECT
dob,
age(dob) AS interval,
date_part('year',age(dob)) AS years,
extract(year from age(dob)) AS samething
FROM customers;
-- MariaDB/MySQL
SELECT
dob,
timestampdiff(year,dob,current_timestamp) AS age
FROM customers;
149
Chapter 4 Working with Calculated Data
-- Oracle
SELECT
dob,
trunc(months_between(current_timestamp,dob)/12)
AS age
FROM customers;
-- SQLite
SELECT
dob,
cast(
strftime('%Y.%m%d', 'now')
- strftime('%Y.%m%d', dob)
as int) AS age
FROM customers;
For PostgreSQL, you’ll get the following results. The other DBMSs won’t have the
age column:
[NULL] [NULL] 0 0
1945-07-03 77 years 10 mons 29 days 77 77
1998-08-09 24 years 9 mons 23 days 24 24
1990-04-12 33 years 1 mon 19 days 33 33
1960-01-13 63 years 4 mons 19 days 63 63
[NULL] [NULL] 0 0
~ 303 rows ~
150
Chapter 4 Working with Calculated Data
Of the preceding calculations, MSSQL has a simple function which is too simple. All
it does is calculate the difference between the years, which is way out if the date of birth
is at the end of the year but the asking date is at the beginning of the year. To get a more
correct result takes a lot more work.
SELECT
id,title,
CASE
WHEN price<13 THEN 'cheap'
WHEN price<=17 THEN 'reasonable'
WHEN price>17 THEN 'expensive'
-- ELSE NULL
END AS price
FROM books;
id Title price
~ 1200 rows ~
151
Chapter 4 Working with Calculated Data
Note that if all conditions fail, then the result will be NULL, which is commented out
earlier. If you want an alternative to NULL, use the ELSE expression:
SELECT
id,title,
CASE
WHEN price<13 THEN 'cheap'
WHEN price<=17 THEN 'reasonable'
WHEN price>17 THEN 'expensive'
ELSE ''
END AS price
FROM books;
Also, note that the CASE expression is short-circuited: once it finds a match, it stops
evaluating.
SELECT
c.id,
givenname||' '||familyname AS name,
-- givenname+' '+familyname AS name, -- MSSQL
CASE status
WHEN 1 THEN 'Gold'
WHEN 2 THEN 'Silver'
WHEN 3 THEN 'Bronze'
CASE AS status
FROM customers AS c LEFT JOIN VIP ON c.id=vip.id;
-- Oracle:
-- FROM customers c LEFT JOIN VIP ON c.id=vip.id;
152
Chapter 4 Working with Calculated Data
id Name status
~ 303 rows ~
This form isn’t much shorter, but it makes the intention clear.
You can also use the IN expression:
SELECT
id, givenname, familyname,
CASE
WHEN state IN('QLD','NSW','VIC','TAS') THEN 'East'
WHEN state IN ('NT','SA') THEN 'Central'
ELSE 'Elsewhere'
END AS region
FROM customerdetails;
153
Chapter 4 Working with Calculated Data
~ 303 rows ~
SELECT
id, givenname, familyname,
coalesce(phone,'-') AS coalesced,
CASE
WHEN phone IS NOT NULL THEN phone
ELSE '-'
END AS cased
FROM customers;
~ 303 rows ~
154
Chapter 4 Working with Calculated Data
It’s not necessarily a convenient alternative, of course, but it helps to appreciate the
overlapping use of the two. It’s particularly useful with Oracle, where you can happily
concatenate a NULL without ending up with a NULL, so it’s hard to coalesce otherwise.
• Else Shipped
• Else Overdue
Before we get going, however, note that some sales have no ordered value:
That might be, for example, if the customer never checked out the order. We
probably should get rid of them, but, for now, we’ll just filter them out:
The first thing you’ll have to do is to calculate the difference between dates. This
varies between DBMSs:
155
Chapter 4 Working with Calculated Data
~ 5295 rows ~
156
Chapter 4 Working with Calculated Data
Note that with SQLite, the simplest way to get an age is to convert dates to a Julian
date, which is the number of days since Noon, 24 November 4714 BC. Long story.
You know by now that you can’t use the calculated values in other parts of the
SELECT clause, so that’s awkward if you need them. You can, however, do the query in
two steps.
If you put the preceding query in a Common Table Expression, you can then use the
results in the main query.
First, you need to distinguish between those which have been shipped and those
which haven’t:
WITH salesdata AS (
-- one of the above queries WITHOUT the semicolon
)
SELECT
salesdata.*,
CASE
WHEN shipped IS NOT NULL THEN
-- One of two statuses
ELSE
-- One of three statuses
END AS status
FROM salesdata;
WITH salesdata AS (
-- one of the above queries WITHOUT the semicolon
)
SELECT
salesdata.*,
CASE
WHEN shipped IS NOT NULL THEN
CASE
WHEN shipped_age>14 THEN 'Shipped Late'
ELSE 'Shipped'
END
ELSE
157
Chapter 4 Working with Calculated Data
CASE
WHEN ordered_age<7 THEN 'Current'
WHEN ordered_age<14 THEN 'Due'
ELSE 'Overdue'
END
END AS status
FROM salesdata;
~ 5295 rows ~
Summary
Data in an SQL table should be stored in its purest, simplest form. However, this data can
be recalculated to increase its usefulness.
Calculations can take a number of forms:
• Results of a subquery
158
Chapter 4 Working with Calculated Data
Aliases
All calculated values should be renamed with an alias. The word AS is optional, but is
recommended to reduce confusion.
You can also alias noncalculated columns if the new name makes more sense.
Aliases are given in the SELECT clause, which is evaluated last before ORDER BY. For
most DBMSs, this means that you can’t use the alias in any other clause but the
ORDER BY.
NULLs
A table may, of course, include NULLs in various places. As a rule, a NULL will wipe out any
calculation, leaving NULL in its wake.
You can bypass NULLs with the coalesce() function which replaces NULL with an
alternative value. You might also use a CASE ... END expression.
Casting Types
SQL works with three main data types:
• Numbers
• Strings
You may need to change the data type. This is done with the cast() function.
When you cast within a major type, the effect is to change the precision or size of
the type.
When you cast between major types, it is usually for compatibility. While casting to
a string is usually possible and often automatic, casting from a string may not always
succeed. Different DBMSs have different reactions to an unsuccessful cast.
159
Chapter 4 Working with Calculated Data
• Mathematical functions
• Approximation functions
There are also formatting functions which generate a formatted result as a string.
• Formatting a date
160
Chapter 4 Working with Calculated Data
Coming Up
Now that we’ve worked with table data, we can now start looking at analyzing it.
The next chapter will look at summarizing data with aggregate functions and
grouping. We’ll cover how data is aggregated in SQL, the basic aggregate functions, and
summarizing into one or more groups.
We’ll also look at combining aggregates at various levels, as well as some basic
statistics on numerical data.
161
CHAPTER 5
Aggregating Data
Databases store data. That’s obvious, but the data itself is pretty inert—you save it, you
retrieve it, and you sometimes change it. That’s OK for some things, but sometimes you
want the data to work a little harder.
You can put the data to work when you start to summarize it. You can then see
trends, see where it’s going, or just get an overview of the data.
Aggregate functions are used to calculate summaries of data. They have three
contexts:
You’ll learn about window functions in Chapter 8. In this chapter, we look at how to
calculate summaries, either wholly or in groups, using SQL’s built-in aggregate functions.
• count
163
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_5
Chapter 5 Aggregating Data
There are various other aggregate functions, depending on the DBMS, but the
preceding ones are fairly typical.
For example:
-- Book Data
SELECT
-- Count Rows:
count(*) AS nbooks,
-- Count Values in a column:
count(price) AS prices,
-- Cheapest & Most Expensive
min(price) AS cheapest, max(price) AS priciest
FROM books;
164
Chapter 5 Aggregating Data
1201 1096 10 20
-- Customer Data
SELECT
-- Count Rows:
count(*) AS ncustomers,
-- Count Values in a column:
count(phone) AS phones,
-- Height Statistics
stddev_samp(height) AS sd -- MSSQL: stdev(height)
FROM customers;
ncustomers phones sd
All of these functions are applicable to numbers, but only the following may be used
for other data, such as strings and dates:
• count
For example:
SELECT
-- Count Values in a column:
count(dob) AS dobs,
-- Earliest & Latest
min(dob) AS earliest, max(dob) AS latest
FROM customers;
165
Chapter 5 Aggregating Data
gives you
• Any table with a WHERE clause will be filtered before the aggregates are
applied.
NULL
Aggregate functions do not include NULLs. The only time this is not obvious is when
using the sum function. However, it is significant to note that
To put it another way, there is a world of difference between NULL on one hand and 0
or '' on the other.
We’ll take advantage of this fact when we look at aggregate filters later.
Understanding Aggregates
Using aggregates sometimes runs into a few problems and seems to have a few quirky
rules. It all makes more sense if you understand how aggregates really work.
166
Chapter 5 Aggregating Data
When you aggregate data, the original data is effectively transformed into a new
virtual table, with summaries for one or more groups.
For example, the query
SELECT
count(*) AS rows,
count(phone) AS phones
FROM customers;
can be regarded as
SELECT
count(*) AS rows,
count(phone) AS phones
FROM customers
GROUP BY () -- PostgreSQL, MSSQL, Oracle only
;
Note that the clause GROUP BY () doesn’t work for all DBMSs, such as MariaDB/
MySQL or SQLite. That doesn’t matter, since the grouping is happening anyway.
The thing is, with or without the GROUP BY () clause, SQL will generate the virtual
summary table as soon as it finds an aggregate function in the query.
In the preceding example, the data is summarized into a single virtual summary
table of one row. In turn, this virtual table has grand totals for every column as in
Figure 5-1.
167
Chapter 5 Aggregating Data
This is why you can’t include individual row data with an aggregate query. For
example, this won’t work:
SELECT
id, -- oops
count(*) AS rows,
count(phone) AS phones
FROM customers;
You’ll get an error message basically telling you that you can’t use the id in
the query.
Note that in MariaDB/MySQL in traditional mode, you can indeed run this statement
successfully. However, the DBMS will pick the first id it can find, and that really
has no meaningful value. It’s mainly useful if you can be sure that all of the
non-aggregate values are the same.
When you include a more meaningful GROUP BY clause, the result is similar,
except that
168
Chapter 5 Aggregating Data
For example:
SELECT
town, state, -- grouping columns
count(phone) AS phones, -- summaries for each group:
min(dob) AS oldest
FROM customerdetails
GROUP BY town, state;
~ 92 rows ~
(You may get a group of NULLs either at the beginning or the end, because we haven’t
filtered out the NULL addresses.)
In the overall scheme of things, the (virtual) GROUP BY clause appears after the FROM
and possibly WHERE clauses and is evaluated at that point:
SELECT ...
FROM ...
WHERE ...
GROUP BY ...
-- SELECT
ORDER BY ...
As usual, SELECT is evaluated last before ORDER BY, even though it is written first, as
in Figure 5-3.
169
Chapter 5 Aggregating Data
SQL neither knows nor cares about the actual meaning of the data, so there are no
checks over whether you should apply these aggregate functions to particular columns.
Distinct Values
Most aggregate functions can be applied to distinct values, but it is probably statistically
invalid. However, it can be meaningful if you count distinct values, such as in the
following example:
SELECT
count(state) AS addresses,
count(DISTINCT state) AS states
FROM customerdetails;
This will count how many distinct states are in the customer details. That’s not to say
that you can’t count the state column anyway, as it indicates the number of rows which
have any address information at all:
Addresses states
278 8
170
Chapter 5 Aggregating Data
Be careful, though. It’s possible that the column doesn’t give the whole picture. For
example, if you try
you’d get a result, but it might be open to misinterpretation. What you’re getting is
distinct town names, but many of these town names appear in more than one state. You
shouldn’t interpret this as meaning distinct towns.
As for the other aggregate functions, generally, it is meaningless to apply any other
statistical calculation to only one of each sample.
Aggregate Filter
Normally, aggregate functions apply to the whole table or to the whole group. For
example, count(*) will count all the rows in the table or group.
A relatively new feature allows you to apply an aggregate function to some of the
rows. This can be applied multiple times in the query.
For example, the following will count all the customers in the customers table:
Suppose you want to separate the customers into the younger and older customers.
You might instinctively try something like this:
-- PostgreSQL:
SELECT
count(*) FILTER (WHERE dob<'1980-01-01') AS older,
count(*) FILTER (WHERE dob>='1980-01-01') AS younger
FROM customers;
171
Chapter 5 Aggregating Data
Older younger
133 106
SELECT
count(CASE WHEN dob<'1980-01-01' THEN 1 END) AS old,
count(CASE WHEN dob>='1980-01-01' THEN 1 END) AS young
FROM customers;
This uses the CASE expression to separate the dob values. They will either be 1 or
NULL, and the count() function counts only the 1s.
You can also use this technique with other aggregate functions. For example:
-- New Standard
SELECT
sum(total),
sum(total) FILTER (WHERE ordered <'...') AS older,
sum(total) FILTER (WHERE ordered>='...') AS newer
FROM sales;
-- Alternative
SELECT
sum(total),
sum(CASE WHEN ordered<'...' THEN total END) AS older,
SELECT
sum(total),
sum(CASE WHEN ordered<'...' THEN total END) AS older,
sum(CASE WHEN ordered>='...' THEN total END) AS newer
FROM sales;
Here, the value is either total or NULL, and sum() politely ignores the NULLs.
172
Chapter 5 Aggregating Data
If you’re interested in filtering for different categories, however, you might get more
of what you want with grouping.
You can also group by a derived value. For example, you can group your customers
by their month of birth:
-- PostgreSQL, Oracle
SELECT EXTRACT(month FROM dob) as monthnumber,
count(*) AS howmany
FROM customerdetails
GROUP BY EXTRACT(month FROM dob)
ORDER BY monthnumber;
-- MSSQL
SELECT month(dob) AS monthnumber, count(*) AS howmany
FROM customerdetails
GROUP BY month(dob)
ORDER BY monthnumber;
-- MySQL / MariaDB
SELECT month(dob) AS monthnumber, count(*) AS howmany
FROM customerdetails
173
Chapter 5 Aggregating Data
GROUP BY month(dob)
ORDER BY monthnumber;
-- SQLite
SELECT strftime('%m',dob) as monthnumber,
count(*) AS howmany
FROM customerdetails
GROUP BY strftime('%m',dob)
ORDER BY monthnumber;
In this example, the month number is called monthnumber, which is also used to sort
the results.
Monthnumber howmany
1 19
2 14
3 17
4 23
5 24
6 15
7 27
8 18
9 18
10 24
11 17
12 23
[NULL] 64
Note that the calculation appears twice, once in the SELECT clause and once in the
GROUP BY clause. This is because the SELECT is evaluated after GROUP BY, so, alas, its alias
is not yet available to GROUP BY.
This is not a real problem, as the SQL optimizer will happily reuse the calculation, so
it’s not really doing it twice.
174
Chapter 5 Aggregating Data
Unfortunately, the month number isn’t very friendly, so we could use the month name.
However, inconveniently, the month name is in the wrong sort order, so we will need both:
-- Not SQLite
-- PostgreSQL, Oracle
SELECT EXTRACT(month FROM dob) as monthnumber,
to_char(dob,'Month') AS monthname,
count(*) AS howmany
FROM customerdetails
GROUP BY EXTRACT(month FROM dob), to_char(dob,'Month')
ORDER BY monthnumber;
-- MSSQL
SELECT month(dob) AS monthnumber,
datename(month,dob) AS monthname, count(*) AS howmany
FROM customerdetails
GROUP BY month(dob), datename(month,dob)
ORDER BY monthnumber;
-- MySQL / MariaDB
SELECT month(dob) AS monthnumber,
monthname(dob) AS monthname, count(*) AS howmany
FROM customerdetails
GROUP BY month(dob), monthname(dob)
ORDER BY monthnumber;
1 January 19
2 February 14
3 March 17
4 April 23
5 May 24
(continued)
175
Chapter 5 Aggregating Data
6 June 15
7 July 27
8 August 18
9 September 18
10 October 24
11 November 17
12 December 23
[NULL] [NULL] 64
As you see, you can’t quite do this in SQLite since it doesn’t have a function to get the
month name.
Technically, grouping by both is redundant, since there is only one month name per
month. However, we need both so that we can display one, but order by the other.
Although repeating the calculations is not a problem, it does make the query less readable
and harder to maintain. We can take advantage of using a Common Table Expression:
WITH cte AS (
...
)
SELECT monthname, count(*)
FROM cte
GROUP BY monthnumber, monthname
ORDER BY monthnumber;
You can use GROUP BY with any calculated field, but note that
The second point earlier can be alleviated with the use of Common Table
Expressions. The first point can be addressed by the use of CASE statements.
176
Chapter 5 Aggregating Data
CASE
WHEN dob<'1980-01-01' THEN 'older'
WHEN dob IS NOT NULL then 'younger'
-- ELSE NULL
END
Remember that some dobs may be NULL, so you need to filter them to get the younger
ones. Remember, too, that the default ELSE is NULL, so we don’t need to include it.
To count them, we could include this in the GROUP BY clause as follows:
SELECT count(*)
FROM customers
GROUP BY CASE
WHEN dob<'1980-01-01' THEN 'older'
WHEN dob IS NOT NULL then 'younger'
END;
Count
64
133
106
but it’s useless without some sort of labels. We can do this by repeating the calculation in
the SELECT clause:
SELECT
CASE
WHEN dob<'1980-01-01' THEN 'older'
177
Chapter 5 Aggregating Data
Agegroup count
[NULL] 64
Older 133
Younger 106
but from the point of view of coding, it’s worse than the calculated columns in the
previous section, so this would definitely benefit from the use of a Common Table
Expression:
WITH cte AS (
SELECT
*,
CASE
WHEN dob<'1980-01-01' THEN 'older'
WHEN dob IS NOT NULL then 'younger'
END AS agegroup FROM customers
)
SELECT agegroup,count(*)
FROM cte
GROUP BY agegroup;
178
Chapter 5 Aggregating Data
WITH salesdata AS (
-- PostgreSQL, MariaDB / MySQL, Oracle
SELECT
ordered, shipped, total,
current_date - cast(ordered as date) AS ordered_age,
shipped - cast(ordered as date) AS shipped_age
FROM sales
-- MSSQL
SELECT
ordered, shipped, total,
datediff(day,ordered,current_timestamp)
AS ordered_age,
datediff(day,ordered,shipped) AS shipped_age
FROM sales
-- SQLite
SELECT
ordered, shipped, total,
julianday('now')-julianday(ordered) AS ordered_age,
julianday(shipped)-julianday(ordered) AS shipped_age
FROM sales
)
SELECT
ordered, shipped, total,
CASE
WHEN shipped IS NOT NULL THEN
CASE
WHEN shipped_age>14 THEN 'Shipped Late'
ELSE 'Shipped'
END
ELSE
CASE
179
Chapter 5 Aggregating Data
~ 5549 rows ~
If you want to summarize this into status groups, you can again put the whole
statement into a CTE and then summarize the CTE. You already have one CTE to
precalculate the age, so we’ll need another to hold the preceding results:
WITH
salesdata AS (
-- as above
),
statuses AS (
SELECT
ordered, shipped, total,
CASE
WHEN shipped IS NOT NULL THEN
CASE
WHEN shipped_age>14
180
Chapter 5 Aggregating Data
Status Number
Due 94
Current 78
Shipped 3808
Overdue 1273
Shipped Late 296
• You can include a number at the beginning of each string and then
use ORDER BY. That’s cheating and won’t look right.
181
Chapter 5 Aggregating Data
• You can have another table with the status values and a position
number and then join this table to the main query. That’s
complicated, but may be useful in some cases.
• You can duplicate the CASE expression with numbers instead of the
strings and ORDER BY that column instead. Unfortunately, there’s no way
to get two columns out of a single CASE expression. That’s really messy.
We’ll take the last approach earlier, since it’s easy to implement and doesn’t
otherwise affect the results.
Most DBMSs include a function to find a substring in a larger string. It has various
names and forms:
-- Postgresql
POSITION(substring IN string)
-- MariaDB / MySQL & SQLite
INSTR(substring,string)
-- Oracle
INSTR(string,substring)
-- MSSQL
CHARINDEX(substring,string)
In this case, we can find the position of the status string inside a longer string with
the status values in order:
'Shipped,Shipped Late,Current,Due,Overdue'
The commas aren’t necessary, but they make the string more readable. What’s more
important is that the status strings are in your preferred order, and the position function
will return a lower value for strings it finds earlier. The rest is up to the ORDER BY clause.
We can order the preceding query using the positioning function like this:
WITH
salesdata AS (
-- as above
),
statuses AS (
-- as above
182
Chapter 5 Aggregating Data
)
SELECT status, count(*) AS number
FROM cte
GROUP BY status
-- Postgresql
ORDER BY POSITION(status IN
'Shipped,Shipped Late,Current,Due,Overdue')
-- MariaDB / MySQL & SQLite
ORDER BY INSTR(status,
'Shipped,Shipped Late,Current,Due,Overdue')
-- Oracle
ORDER BY INSTR(status,
'Shipped,Shipped Late,Current,Due,Overdue')
-- MSSQL
ORDER BY CHARINDEX(status,
'Shipped,Shipped Late,Current,Due,Overdue')
;
Status number
Shipped 3808
Shipped Late 296
Current 78
Due 94
Overdue 1273
You can use this technique for any nonalphabetical string order, such as days of the
week or colors in the rainbow.
Group Concatenation
There is an additional function which can be used to aggregate string data. This function
will concatenate strings with an optional delimiter.
183
Chapter 5 Aggregating Data
DBMS Function
For example, you can get a list of all the books for each author this way:
SELECT
a.id, a.givenname, a.familyname,
-- PostgreSQL, MSSQL
string_agg(b.title, '; ') AS works
-- SQLite
-- group_concat(b.title, '; ') AS works
-- Oracle
-- listagg(b.title, '; ') AS works
-- MariaDB / MySQL
-- group_concat(b.title SEPARATOR '; ') AS works
FROM authors AS a LEFT JOIN books AS b ON a.id=b.authorid
GROUP BY a.id, a.givenname, a.familyname;
You’ll get something like this:
146 Washington Irving Rip Van Wink ...; Tales of the ...; The ...
963 Richard Marsh The Beetle ...
390 Jean Racine Andromaque ...; Britannicus ...; Bérénice ...
(continued)
184
Chapter 5 Aggregating Data
~ 488 rows ~
The works column has all of the book titles concatenated with a ; between them.
Note that the GROUP BY clause uses the author id but includes the redundant author
names to allow them to be selected.
Be careful, though. It’s easy to get carried away with this function, and you’ll see that
the list of books can be very long, and the concatenated string can be very, very long.
Normally, you think of the word total as adding values and subtotal as a total of
a subgroup. This would imply using the sum() function. In this discussion, we’ll
use the terminology more loosely and use the word for any aggregates, such as
count(). Here, a subtotal would imply counting a subgroup.
In the preceding example, there are four possible totals that you could get:
185
Chapter 5 Aggregating Data
• The count() of each town group. In this example, it’s not so useful,
since some town names are duplicated across states, so you’d be
combining values which shouldn’t be. However, in other examples,
this would be useful.
Apart from the last one, the others would all be considered subtotals at some level.
When we work with the example shortly, we’ll aggregate by three columns, and
there’ll be eight combinations, so eight totals and subtotals we can calculate.
Modern SQL allows you to generate a result set which is a combination of totals and
subtotals of table data and aggregate data. Depending on the DBMS, this might include a
modification of the GROUP BY clause:
• CUBE is also a specialized version of GROUPING SETS which produces all of the
possible subtotals. In the preceding example, it’s all four of the possible totals.
Here, we’ll have a look at generating such a summary. However, rather than work
with customers’ addresses, we’ll have a look at sales data.
186
Chapter 5 Aggregating Data
• The customer id
Note that all three columns are independent of each other, unlike the state and town
in the original example. That means totaling any combination is meaningful.
To prepare the data, we can use the following query:
SELECT
-- PostgreSQL, Oracle
to_char(s.ordered,'YYYY-MM') AS ordered,
-- MariaDB / MySQL
-- date_format(s.ordered,'%Y-%m') AS ordered,
-- MSSQL
-- format(s.ordered,'yyyy-MM') AS ordered,
-- SQLite
-- strftime('%Y-%m',s.ordered) AS ordered,
s.total, c.id, c.state
FROM sales AS s JOIN customerdetails AS c
ON s.customerid=c.id
WHERE s.ordered IS NOT NULL;
2022-05 28 28 NSW
2022-05 34 27 NSW
2022-05 58.5 1 WA
2022-05 50 26 VIC
2022-05 17.5 26 VIC
2022-07 15 105 VIC
~ 5295 rows ~
When working with this, you could use this in a CTE, but it’s not quite convenient, so
we’ll save it as a view instead:
If you’re using Microsoft SQL, remember to surround your CREATE VIEW statement
between a pair of GOs:
To begin with, we’ll generate the summaries separately and combine them with a
UNION clause.
188
Chapter 5 Aggregating Data
~ 1802 rows ~
The next step is to generate summaries for the state and customer ids:
-- state summaries
SELECT
189
Chapter 5 Aggregating Data
~ 266 rows
and
Don’t worry about the missing column names, as we’ll get them from the UNION.
The reason to include all those NULLs is to line up the columns when you combine
them in a UNION.
190
Chapter 5 Aggregating Data
-- grand total
SELECT
NULL, NULL, NULL, count(*) AS nsales,
sum(total) AS total
FROM salesdata
-- GROUP BY ()
;
? ? ? nsales total
Note that this includes the commented out GROUP BY () clause, just as a reminder
that this is a grand total; of course, you don’t need it.
The UNION clause can be used to combine the results of multiple SELECT
statements. The only requirement is that they match in the number and types of
columns.
-- grand total
UNION
SELECT NULL, NULL, NULL, count(*), sum(total)
FROM salesdata
-- Sort
ORDER BY state,id,ordered;
~ 2077 rows
Note that only the first query has aliases for the number of sales and the total; in a
UNION, the column names for the first query apply to the whole result. You can alias the
rest if it makes you feel better, but it won’t make any difference.
192
Chapter 5 Aggregating Data
When combining different levels of summaries, the higher-level summaries will have
NULL instead of actual values. This is correct, but inconvenient:
• When sorted, NULL may appear at the beginning or the end of the list.
The SQL standard is ambivalent on this, and different DBMSs have
different opinions, while some give you a choice.
To resolve the sorting problem, we can add a contrived value to force a sorting order:
193
Chapter 5 Aggregating Data
FROM salesdata
-- Sort
ORDER BY state_level, state, id_level, id,
ordered_level, ordered;
To get the results in the right order, we have introduced two values, state_level and
town_level, so that we can push the totals below the other values.
~ 2077 rows ~
To eliminate the sorting columns from the result set, you can turn this into a
Common Table Expression:
194
Chapter 5 Aggregating Data
WITH cte AS (
-- UNION query above
)
SELECT state, id, ordered, nsales, total
FROM cte
ORDER BY state_level,state,id_level,id,ordered_level,ordered;
This isn’t so much work to get the results, but there may be a simpler method.
SELECT columns
FROM table
GROUP BY GROUPING SETS ((set),(set));
Recall that the previous example had SELECT statements, grouped by state, customer
id, ordered date, and a grand total. This can be generated as follows:
SELECT state,town,count(*)
FROM customers
GROUP BY GROUPING SETS ((state,id,ordered),(state,id),(state),());
195
Chapter 5 Aggregating Data
The CUBE variation works best when you don’t have too many grouping columns and
when they’re all unrelated to each other. Remember three columns would give you eight
possible combinations. You can calculate the number of possibilities as 2n, where n is the
number of columns. In this case, it’s 23 = 8. If you had even four columns, you would have
16 possible totals and subtotals, which might start to get overwhelming.
Both forms will give you the same result. Note that MSSQL gives you the choice to
use either form.
ROLLUP makes an important assumption that the columns form some sort of
hierarchy. In the case of the customer state and the customer id, that’s obvious. Whether
you consider the ordered date as the end of the hierarchy is up to you.
196
Chapter 5 Aggregating Data
You can see the hierarchy in the results and in the fact that this matches the GROUPING
SETS example earlier. You will get results for
3. (state) values
4. () – grand totals
Clearly, using ROLLUP is a much simpler way to get these results, and you probably
won’t miss the flexibility of GROUPING SETS very much.
197
Chapter 5 Aggregating Data
Note that while MySQL does support grouping(), MariaDB doesn’t support the
grouping() function!
In PostgreSQL and MySQL, you can use grouping() with multiple columns. This will
give you a combined level which is a binary combination of the 1s and 0s. In MSSQL and
Oracle, you would use the grouping_id() function for that.
To solve the first problem, that of meaningless NULL markers, we have to be more
creative with the SELECT clause. In this case, we can use coalesce to pick up the NULL and
supply an alternative value:
-- PostgreSQL, MSSQL;
SELECT
coalesce(state,'National Total') AS state,
coalesce(cast(id as varchar),state||' Total') AS id,
coalesce(ordered,'Total for '||cast(id as varchar))
AS ordered,
count(*), sum(total)
FROM salesdata
GROUP BY ROLLUP (state,id,ordered)
ORDER BY grouping(state), state,
grouping(id), id, grouping(ordered), ordered;
-- NOT Oracle
This will give you something meaningful for the summary rows.
198
Chapter 5 Aggregating Data
SELECT
coalesce(state,'National Total') AS state,
grouping(state) AS statelevel,
CASE
WHEN state IS NULL THEN NULL
WHEN id IS NULL THEN 'Total for '||state
ELSE cast(id AS varchar(3))
END AS id,
grouping(id) AS idlevel,
CASE
WHEN id IS NULL THEN NULL
WHEN ordered IS NULL THEN
'Total for '||cast(id as varchar(3))
ELSE ordered
END AS ordered,
grouping(ordered) AS orderedlevel,
count(*) AS count, sum(total) AS sum
FROM salesdata
GROUP BY ROLLUP (state,id,ordered)
ORDER BY statelevel, state, idlevel, id, orderedlevel, ordered
;
199
Chapter 5 Aggregating Data
~ 2077 rows ~
Here, the grouping() function is used in the SELECT clause and then used for sorting.
The id and ordered columns are calculated with a CASE ... END expression to get
around the problem of the NULL strings.
Of course, now you have those three extra columns used for sorting. To hide them,
you can use a CTE:
WITH cte AS (
-- SELECT statement as above
-- don't bother with the ORDER BY clause
)
200
Chapter 5 Aggregating Data
Incidentally, the previous statement included an ORDER BY clause. You can include
it in the CTE (you can’t in MSSQL), but it’s unnecessary as we’re sorting it anyway in the
main query, so you should leave it out.
• The median is the middle of the values (if they’re all placed in order).
You can use the frequency table to generate a histogram, which is what spreadsheet
programs call a bar chart. For example, you can generate a frequency table and
histogram of the number of customers per height (in centimetres). It looks like
Figure 5-4.
201
Chapter 5 Aggregating Data
If your values are based on a number of different factors, there is a tendency for
them to be distributed along the well-known “bell curve.” Most values occur around the
middle, and the further you are from the middle, the fewer times the value occurs. This is
more technically referred to as the normal distribution.
Your height is dependent on a number of factors, some of which include genetics,
diet, and other lifestyle factors. As a result, customer heights tend to follow the normal
distribution as you see very roughly in the figure above. Of course, the tendency
becomes stronger if we have a larger collection of data: if you have only a few hundred
samples, then the data won’t be such a tight fit.
202
Chapter 5 Aggregating Data
For the purpose of this discussion, we’ll focus on customer heights, as they tend to be
easy to analyze this way.
Although the sample data was randomized, it was generated to follow the normal
distribution as well as might be expected in the small sample.
For adults in Australia, the mean height is about 168.7 cm. Actually, there are
two mean heights, one for female and one for male adults, but between them the
average is 168.7 cm. The standard deviation is 7 cm. You can get more information
at https://en.wikipedia.org/wiki/Average_human_height_by_
country.
mean
170.844
If you want, you can calculate the mean using sum(height)/count(height), but
there’s no point other than to show they’re the same.
This figure is fairly reasonable. The mean height for adults is about 168.7 cm.
203
Chapter 5 Aggregating Data
Here, the heights are measured to 0.1 of a centimeter. We’ll prepare it by rounding it
off to whole centimeters:
Height
169
171
153
176
156
176
~ 267 rows ~
We could possibly have used a round() function to do the rounding off, but some
DBMSs prefer to round to the nearest even number, so the preceding method will do more
reliably.
Putting the data into a CTE, we can then use a simple GROUP BY query:
WITH heights AS (
SELECT floor(height+0.5) AS height
FROM customers
WHERE height IS NOT NULL
)
SELECT height, count(*) AS frequency
FROM heights
GROUP BY height
ORDER BY height;
204
Chapter 5 Aggregating Data
Height frequency
153 1
154 3
156 1
157 3
158 1
159 2
~ 36 rows ~
Note that there may be some missing values. That’s natural, especially in a relatively
small sample such as we have. However, with these gaps it’s not quite ready for a
histogram. Later, when we have a closer look at recursive common table expressions,
we’ll see how to fill in the gaps.
WITH
heights AS (
SELECT floor(height+0.5) AS height
FROM customers
WHERE height IS NOT NULL
), -- don't forget to add a comma here
frequency_table AS (
SELECT height, count(*) AS frequency
205
Chapter 5 Aggregating Data
FROM heights
GROUP BY height
)
...
WITH
heights AS (
...
),
frequency_table AS (
...
), -- don't forget to add a comma here
limits AS (
SELECT max(frequency) AS max FROM frequency_table
)
...
Finally, you can cross join the frequency table to the limits CTE to find the mode(s):
WITH
heights AS (
...
),
frequency_table AS (
...
),
limits AS (
...
)
SELECT height, frequency
FROM frequency_table,limits
WHERE frequency_table.frequency=limits.max
ORDER BY height;
206
Chapter 5 Aggregating Data
Height frequency
172 22
In a perfect set of normal data, the mode should match the mean exactly. In real life,
it should be close.
percentile_cont
171.2
207
Chapter 5 Aggregating Data
What’s the alternative to an aggregate function? It’s one of those window functions
which we’ll be looking at later. A window function is like an aggregate function, except
that it’s calculated for every row, not just as a summary.
With the window function version, we can use
The problem is that you’ll get the same value for multiple rows. To finish the job, you
can use DISTINCT:
208
Chapter 5 Aggregating Data
You might wonder why there are two. If you know that you have all the values, you
could use stddev_pop(height) (or stdevp(height) for MSSQL). However, we don’t, so
we can regard what we do have as a sample. For that, we use stddev_samp(height) (or
stdev(height) for MSSQL):
SELECT
stddev_pop(height) AS sd
-- stdevp(height) AS sd -- MSSQL
FROM customers;
Sd
6.979
Remember that the standard deviation only has meaning when you believe that the
underlying data follows a normal distribution.
Summary
In this chapter, we had a look at aggregating sets of data.
• max and min, which find the lowest and highest values of any type of
column. In effect, they find the first and last values when sorted by
that column.
• stddev, stddev_samp, stddev_pop (PostgreSQL, MySQL/MariaDB,
Oracle) stdev, stdevp (MSSQL). This calculates a population or
standard deviation of a column of numbers; we assume that the data
is normally distributed.
209
Chapter 5 Aggregating Data
The sum, avg, and standard deviation functions can only be applied to numeric data.
NULLs
Aggregate functions all skip NULLs. This is particularly important when counting values,
but also when calculating averages.
The fact that NULLs are skipped can also be used when calculating selective
aggregates.
Aggregate Filters
It’s possible to filter what data is used for a single aggregate function.
There is a standard FILTER (WHERE ...) clause which allows you to filter a column.
However, it’s not (yet) widely supported.
The common way to filter data is to use the CASE ... END expression on what you’re
aggregating. Set a value of, say, 1 for the values you want, allow the rest to default to NULL,
and let the aggregate functions ignore them for the rest.
You can also aggregate on DISTINCT values. This makes the most sense when you are
counting.
210
Chapter 5 Aggregating Data
GROUP BY
The GROUP BY clause can be used to generate a virtual table of group summaries.
In some DBMSs, you can use GROUP BY () to generate a grand summary. This is the
default without the GROUP BY () clause and is automatically done whenever SQL sees an
aggregate function. It’s never truly needed.
You can group by basic values, but also by calculated values.
Grouping by calculated values can get complicated, since the SELECT and ORDER BY
clauses can only use what’s in the GROUP BY. Because of the clause order, you may find
yourself repeating the same calculations in various clauses.
Since the SELECT clause is only evaluated near the end, and selecting and ordering
can only be done on what’s in the GROUP BY clause, you may find the following
techniques helpful:
• Putting aggregate queries in a CTE and joining that with other tables
to get the rest of the results
When grouping by a column, your results may not be in the correct order. Since the
group names are all strings, sorting on the group name will only put them in alphabetical
order, which isn’t always suitable. However, you can also sort them by their position in
another string, which can be in any order you like.
Mixing Subtotals
By and large, aggregate queries produce simple aggregates on one level. Sometimes, you
need to combine them with various levels of subtotals.
You can generate subtotals in separate queries and combine them with UNION. You
might need some extra work to get the results sorted in your preferred order.
Most DBMSs include subtotaling operations to create the combined result
automatically. They may include GROUPING SETS, ROLLUP, or CUBE. Most include the
ROLLUP which is the most common variation. There are additional grouping functions to
assist with sorting and labeling.
211
Chapter 5 Aggregating Data
Statistics
In general, aggregate functions are basically statistical in nature. Although SQL is not
as powerful as dedicated statistical software, you can use aggregates and grouping to
generate some of the basic statistics.
Coming Up
In some cases, we have used a query in a Common Table Expression to prepare data.
However, in one case we created a view instead, so that we could reuse the query.
In the next chapter, we’ll have a closer look at creating and using views to improve
our workflow.
212
CHAPTER 6
You can spend the rest of your life writing SQL statements, and the job would get
done. However, you might get to the point where writing the same thing over and over
again loses its charm, and so you’ll want to find ways of reusing previous queries.
First, let’s have a look at what we mean by tables and what happens when you use
the SELECT statement.
SQL databases store data in tables. Actually, they don’t—each table is really stored in
some other structure such as a binary tree, which is more efficient. However, by the time
you see it, it will be presented as a table, and that’s what it’s called in the database.
A table is made up of rows and columns. For our purpose, the table doesn’t have to
be a permanent table, and there are operations which generate table structures without
necessarily being permanently stored. We’ll refer to them as virtual tables.
Here is a list of operations which generate (virtual) tables, in increasing order of
longevity:
213
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_6
Chapter 6 Using Views and Friends
• A view is a saved SELECT query, which will regenerate the virtual table
on call.
The thing about tables and virtual tables is that they can all be used in the
FROM clause.
You will already know about using joins. You have also used Common Table
Expressions, but we’ll discuss them in more depth in the following chapters.
In this chapter, we’ll look at the rest and how we can improve our workflow
with them.
Note the syntax is exactly the same as for tables. From the perspective of a SELECT
statement, there is no distinction between selecting from a view and from a table.
One important consequence of this is that you cannot have views with the same
names as tables—views and tables share the name space.
214
Chapter 6 Using Views and Friends
This doesn’t mean that there are no differences. The DBMS stores views as separate
types of objects and manages them differently. However, once created, you can treat a
view like a table.
Views can be an important part of your workflow. For example:
Creating a view requires permissions which you may not already have as a
database user.
Do whatever is required to get these permissions—badger, bribe, blackmail
as needed.
• You can hide complex processing with a view which just gives a
virtual table of the results.
• You can restrict access to data in tables by creating a view with only
certain rows and columns.
There are some limitations as well. The main limitation is that a view is inflexible.
You can’t, for example, vary the values that a view might use to calculate results. You’ll
see a possible solution to that when we discuss Table Valued Functions later.
Other limitations vary between DBMSs. Some DBMSs support temporary views.
MSSQL doesn’t allow an ORDER BY in a view without additional trickery. Some DBMSs
support views on temporary tables, while others don’t.
For the most part, however, views will simplify your workflow and allow you to move
on to more complex tasks.
215
Chapter 6 Using Views and Friends
Creating a View
A view starts off as a simple SELECT statement. For example, we can start developing a
pricelist view which will comprise some information about books, their authors, and
the price, including tax:
/* Notes
===================================================
MSSQL: Use + for concatenation
Oracle: No AS for tables:
FROM books b JOIN authors a ON ...
=================================================== */
SELECT
b.id, b.title, b.published,
coalesce(a.givenname||' ','')
|| coalesce(othernames||' ','')
|| a.familyname AS author,
b.price, b.price*0.1 AS tax, b.price*1.1 AS inc
FROM books AS b LEFT JOIN authors AS a ON b.authorid=a.id
WHERE b.price IS NOT NULL;
2078 The Duel 1811 Heinrich von Kleis ... 12.50 1.25 13.75
503 Uncle Silas 1864 J. Sheridan Le Fan ... 17.00 1.70 18.70
2007 North and South 1854 Elizabeth Gaskell 17.50 1.75 19.25
702 Jane Eyre 1847 Charlotte Brontë 17.50 1.75 19.25
1530 Robin Hood, The Pr ... 1862 Alexandre Dumas 12.50 1.25 13.75
1759 La Curée 1872 Émile Zola 16.00 1.60 17.60
~ 1096 rows ~
216
Chapter 6 Using Views and Friends
In other words, the virtual table must conform to the rules of a real table.
To create a view, prepend the SELECT statement with a CREATE VIEW ... AS clause:
/* Notes
===================================================
MSSQL: Use + for concatenation
MSSQL: Surround the CREATE VIEW with GO
Oracle: No AS for tables:
FROM books b JOIN authors a ON ...
=================================================== */
-- GO
CREATE VIEW aupricelist AS
SELECT
b.id, b.title, b.published,
coalesce(a.givenname||' ','')
|| coalesce(othernames||' ','')
|| a.familyname AS author,
b.price, b.price*0.1 AS tax, b.price*1.1 AS inc
FROM books AS b JOIN authors AS a ON b.authorid=a.id;
-- GO
217
Chapter 6 Using Views and Friends
We’ve called the price list aupricelist because the tax is set to 10%, which is the rate
in Australia. Feel free to use any tax rate and name that you like.
SELECT *
FROM aupricelist
WHERE published BETWEEN 1700 AND 1799;
1608 The Autobiography ... 1791 Benjamin Franklin 18.50 1.85 20.35
2303 The Metaphysics of ... 1797 Immanuel Kant 12.00 1.20 13.20
1305 An Essay on Critic ... 1711 Alexander Pope 11.00 1.10 12.10
1963 A Treatise of Huma ... 1740 David Hume 18.50 1.85 20.35
1196 Equiano’s Travels: ... 1789 Olaudah Equiano 12.50 1.25 13.75
1255 Discourse on the O ... 1755 Jean-Jacques Rouss ... 19.00 1.90 20.90
~ 166 rows ~
218
Chapter 6 Using Views and Friends
541 120 Days of Sodom 1904 Marquis de Sade 12.50 1.25 13.75
729 A Cartomante e Out ... 1884 Machado de Assis 16.00 1.60 17.60
2092 A Chaste Maid in C ... 1613 Thomas Middleton 15.00 1.50 16.50
1437 A Child’s Garden o ... 1885 Robert Louis Steve ... 11.00 1.10 12.10
454 A Christmas Carol 1843 Charles Dickens 13.50 1.35 14.85
1094 A Confession 1882 Leo Tolstoy 17.50 1.75 19.25
~ 1096 rows ~
With the exception of MSSQL, you could have included the ORDER BY clause in the
view itself. Although it’s convenient, it’s probably not a good idea: you forcing the DBMS
to sort the result whether you need it or not, and you may end up sorting it again in a
different order afterward.
Among other things, this will allow you to create an ordered view without the need to
include extra columns just for sorting.
However, you need to be aware that an ordered view does place an extra burden on
the database, so it should only be used when needed.
Some of the fine points will vary between DBMSs, and the DBMS will do its best to
work as efficiently as possible. Nevertheless, it’s a good idea to keep these ideas in mind.
• Changing the underlying view may make a mess of the view. Some
DBMSs won’t even allow you to change a view if another view
depends on it.
• SQL tries to optimize your queries, but if your views are too deeply
nested, it may not be able to optimize well.
• One of the views may have more than you need for the new view, so
you’re wasting processing time generating what you don’t need.
That doesn’t mean you shouldn’t build on existing views, just that you should do it
judiciously.
• You may want your own column order rather than the default one.
220
Chapter 6 Using Views and Friends
What you should do, however, is make sure that your view includes whatever
columns you’ll be sorting on later.
Table-Valued Functions
Views are a powerful tool, but there’s one shortcoming: you can’t change any of the
values used in a view. For example, the aupricelist view has a hard-coded tax rate of
10%. A more flexible type of view would allow you to input your own tax rate. Such a view
would then be called a parameterized view.
Parameterized views are not generally supported in SQL. Some DBMSs support
functions which generate a virtual table, known as a Table-Valued Function, or TVF if
you’re in a hurry. This will give more or less the same result.
Of our popular DBMSs, only PostgreSQL and Microsoft SQL Server support a
straightforward method of creating a TVF. We’ll explore these two in the following
discussion.
Most DBMSs allow you to create custom functions. The notable exception is SQLite,
which does, however, allow you to create functions externally and hook them in.
A function which generates a single value at a time is called a scalar function. Built-
in functions such as lower() and length() are scalar functions.
When creating a function, there is, in a sense, a contract. The function definition
includes what input data is expected and what sort of data will be returned. If the input
data doesn’t fit, then don’t expect a result.
A TVF works the same way: you define what input is expected, and you promise to
return a table of results. Here, we’ll create a more generic price list which allows you to
tell it what the tax rate is, rather than hard-coding it.
To use the TVF, you use it like any virtual table:
SELECT *
FROM pricelist(15);
221
Chapter 6 Using Views and Friends
Here, the TVF is called pricelist() and the input parameter is 15, meaning 15%.
The code should handle converting that to 0.15:
2078 The Duel 1811 Heinrich von Kleis ... 12.50 1.88 14.38
503 Uncle Silas 1864 J. Sheridan Le Fan ... 17.00 2.55 19.55
2007 North and South 1854 Elizabeth Gaskell 17.50 2.63 20.13
702 Jane Eyre 1847 Charlotte Brontë 17.50 2.63 20.13
1530 Robin Hood, The Pr ... 1862 Alexandre Dumas 12.50 1.88 14.38
1759 La Curée 1872 Émile Zola 16.00 2.40 18.40
~ 1070 rows ~
TVFs in PostgreSQL
The outline of a TVF in PostgreSQL looks like this:
In this outline
222
Chapter 6 Using Views and Friends
• The actual code is contained in one big string. Because there might
be other strings in the code, the $$ at either end acts as an alternative
delimiter.
• The code is then placed between BEGIN and END; in this case, it will
return the results of a SELECT query.
Filling in the details, we can write
The output table is the most tedious part. In it, we have to list all of the column
names and types we’re expecting to generate.
As for the calculation, we’ve taken a user-friendly approach and allowed the tax rate
to resemble the percentage we might have used in real life. We can’t use %, especially as
that has another meaning, but other than that, we can use the value. However, we then
need to divide by 100 to get its real value.
223
Chapter 6 Using Views and Friends
GO
CREATE FUNCTION pricelist(...) RETURNS TABLE AS
RETURN SELECT ...
GO
There are two types of TVF in MSSQL. There is a more complex type, but the simpler
type earlier is very similar to creating a view.
In this outline
• The actual code is almost the same as for the view, except that it will
include the value from the input parameter.
The input parameter is called @taxrate. Actually, it’s really called taxrate, but
MSSQL uses the @ character to prefix all variables.
As with the PostgreSQL version, we’ve taken a user-friendly approach and allowed
the tax rate to resemble the percentage we might have used in real life. We can’t use %,
especially as that has another meaning, but other than that, we can use the value.
However, we then need to divide by 100 to get its real value.
Convenience
The most immediate use of a view is as a convenient way of packaging a useful SELECT
query. For example:
Both of the preceding views include joins, and one includes a number of
calculations. It’s much more convenient to use the saved view when you need it.
As an Interface
A second use of views is to present a consistent interface for existing data.
For example, when we refactored the customers table by referencing another table
and dropping a few columns, we ran the risk of invalidating any other queries which
depended on the old structure. By creating the customerdetails view, you have a new
virtual table which can be read the same way as the old table.
It can also be handy if you’re in the process of renaming or rearranging tables and
columns. Suppose, for example, you’re in the process of developing a new version of the
customers table, with some of the following columns:
225
Chapter 6 Using Views and Friends
/* Notes
========================================================
MSSQL: Use + for concatenation
Oracle, SQLite: Use substr(phone,2) instead of right()
======================================================= */
-- CREATE VIEW newcustomers AS
SELECT
id AS customerid,
givenname AS firstname, familyname AS lastname,
cast(height/2.54 as decimal(3,1))
AS height_in_inches,
'+61' || right(phone,9) AS au_phone
-- etc
FROM customers;
~ 303 rows ~
(The CREATE VIEW clause is commented out, because we’re not really going to go
ahead with this.)
This approach will also be useful if you’re preparing data for an external application.
226
Chapter 6 Using Views and Friends
Such software typically has very limited ability in manipulating data, so it makes
sense to do as much preprocessing as possible. When seen from the external software,
your views will be perceived as single tables (though often they’ll still indicate that they
are actually views).
227
Chapter 6 Using Views and Friends
same data so often, but more expensive on storage. As usual, you may find that extra
storage is cheaper than processing power.
Materialized views aren’t widely supported and are sometimes limited in usefulness.
However, you go a long way with temporary tables.
In principle, all SQL tables are temporary, in that it’s always possible to drop a
table—in SQL, as in life, nothing’s truly permanent. However, a temporary table is one
destined to be short-lived and will self-destruct when you close the session.
You can create a temporary table as you might a real table, but using the
TEMPORARY prefix:
-- Oracle
CREATE GLOBAL TEMPORARY TABLE somebooks (
id INT PRIMARY KEY,
title VARCHAR(255),
author VARCHAR(255),
price DECIMAL(4,2)
);
-- MSSQL
CREATE #somebooks (
id INT PRIMARY KEY,
title VARCHAR(255),
author VARCHAR(255),
price DECIMAL(4,2)
);
228
Chapter 6 Using Views and Friends
Note
• Oracle distinguishes between GLOBAL and PRIVATE temporary
.
By “global,” we mean that other uses of the database can access the temporary table.
Private ones are, well, private to the session.
If you’re in a desperate hurry, PostgreSQL and SQLite allow you to save time by writing
TEMP instead of TEMPORARY. It probably took you more time to read this paragraph.
The temporary table in this example has a simple integer primary key. If you intend
adding more data as you go, you might also use an autoincremented primary key.
Once you have created your temporary table, you can copy data into it using the
SELECT statement. For example:
The INSERT ... SELECT ... statement copies data into an existing table, temporary
or permanent.
You can create a new table and populate it in one statement with the following
statement:
-- PostgreSQL, SQLite
SELECT id,title,author,price
INTO TEMPORARY otherbooks
FROM aupricelist
WHERE price IS NULL;
-- MSSQL
SELECT id,title,author,price
INTO #otherbooks
FROM aupricelist
WHERE price IS NULL;
As you see, this statement takes one of two forms; PostgreSQL supports both.
Note that either form requires that you have permissions to create either a temporary
or permanent table.
Remember, however, that the data is a copy, so it will go stale unless you update it.
Why would you want a temporary table? There’s nothing in our sample database
which could be regarded as in any way heavy-duty. However, in the real world, you might
be working with a query which involves a huge number of rows, complex joins, filters
and calculations, and sorting. This could end up taking a great deal of time and effort,
especially if you’re constantly regenerating the data.
The reasons you would use a temporary table rather than a view include
• Sometimes, you want the data to be out of date, such as when you
need to work with a snapshot of the data from earlier in the day.
If you need to work with the snapshot at some point in the future,
a temporary table may be too fleeting. Everything we’ve done will
also apply to specially created permanent tables.
A database should never keep multiple copies of data. However, there are times
when you need a temporary table for further processing, experimenting, or in transit to
migrating data.
230
Chapter 6 Using Views and Friends
Computed Columns
Modern SQL allows you to add a column to a table which in principle shouldn’t be in
a table. A computed column, or calculated column, is an additional column which is
based on some calculated value. When you think about it, that’s the sort of thing you
would do in a view.
Think of the computed column as embedding a mini-view in the table. It’s
particularly handy if you commonly use one calculation but don’t want the overhead of a
view. It can also be handy if you have the option to cache the results.
A computed column is a read-only virtual column. You can’t write anything into
the column, and, if it saves any data at all, it’s a cached value to save the effort of
recalculating it later. For example, you might store the full name of the customer as a
convenience.
You can create a computed column when you create the table, or you can add it to
the table after the event.
For example, suppose we want to add a shortened form of the ordered datetime
column, with just the date. This will be handy for summarizing by day.
You can add the new column as follows:
-- PostgreSQL >= 12
ALTER TABLE sales
ADD COLUMN ordered_date date
GENERATED ALWAYS AS (cast(ordered as date)) STORED;
-- MSSQL
ALTER TABLE sales
ADD ordered_date AS (cast(ordered as date)) PERSISTED;
-- MariaDB / MySQL
ALTER TABLE sales
ADD ordered_date date
GENERATED ALWAYS AS (cast(ordered as date)) STORED;
-- SQLite>=3.31.0
ALTER TABLE sales
ADD ordered_date date
GENERATED ALWAYS AS (cast(ordered as date)) VIRTUAL;
231
Chapter 6 Using Views and Friends
-- Oracle (STORED)
ALTER TABLE sales
ADD ordered_date date
GENERATED ALWAYS AS (trunc(ordered));
As you see, most DBMSs use the standard GENERATE ALWAYS syntax. MSSQL,
however, uses its own simpler syntax which doesn’t specify the data type but infers it
from the calculation.
You’ll also notice different types of computed column:
• VIRTUAL columns are not stored and are recalculated. This is the
default in MSSQL.
• STORED columns save a copy of the result and will only recalculate if
the underlying value has changed.
MSSQL calls this PERSISTED. In Oracle, it’s the default. SQLite does
support this as well, but only if you create the table that way; if you
add the column later, it can only be VIRTUAL.
You can now fetch the data complete with virtual column:
~ 5549 rows ~
If you have the option, the better option is STORED or equivalent. It takes a little more
space, but saves on processing later.
232
Chapter 6 Using Views and Friends
Summary
Much of your work will involve not only real tables but generated virtual tables. Virtual
tables include
• A join
Views
A view is a saved SELECT statement. It can be made as complex as you like and then
fetched as a virtual table.
The benefits of views include
• They offer a simple table view of complex data when accessed from
external applications.
233
Chapter 6 Using Views and Friends
Temporary Tables
There are times when it is better to store results rather than regenerate them every time.
You can save them into a caching table.
The benefits include
• It’s more efficient not to have to recalculate what will be the same
results.
Computed Columns
In modern DBMSs, you can create virtual columns in a table which give the results of a
calculation.
A VIRTUAL computed column will regenerate the value every time you fetch from
the table. A STORED computed column, a.k.a. PERSISTED in MSSQL, will cache the results
until other data has changed.
A computed column can be used for convenience. If it’s a STORED column, it also has
the benefit of saving on processing.
Coming Up
A SELECT statement doesn’t have to be the end of the story. In some cases, it can be one
step in a more complex story.
A subquery allows you to embed a SELECT statement inside a query. This can be
used to fetch values from other tables or to use one table to filter another. It’s particularly
handy if you want to incorporate aggregate data in another query.
The next chapter will look at subqueries in more detail.
234
CHAPTER 7
• One row and one column: You get just one value, though technically
it’s still in a table. We’ll call this a single value.
• One column and multiple rows: When the time comes, we’ll call
this a list.
Id
392
235
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_7
Chapter 7 Working with Subqueries and Common Table Expressions
Email
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
~ 64 rows ~
~ 64 rows ~
236
Chapter 7 Working with Subqueries and Common Table Expressions
That last category, the virtual table, could also be the result of a very broad query
such as SELECT * FROM customerdetails. It all works the same way.
Any of these results, depending on the context, can be used in a subsequent query
where a single value, a list, or a (virtual) table might have been expected. For example,
using a single value:
SELECT *
FROM saleitems
WHERE bookid=(SELECT id FROM books WHERE title='Frankenstein');
Here, the single value query is wrapped inside parentheses and used the way you
would if you already knew the value of the bookid you’re matching:
~ 14 rows ~
SELECT *
FROM books
WHERE authorid IN (
SELECT id FROM authors WHERE born BETWEEN '1700-01-01' AND '1799-12-31'
);
237
Chapter 7 Working with Subqueries and Common Table Expressions
The IN operator expects a list of values, which we get from the one column in the
nested SELECT statement:
~ 256 rows ~
Some modern DBMSs are now adding support for returning values
for more than one column, but it’s not widely supported for now.
• A subquery in the WHERE clause must return a single value when used with a
comparison operator, or a single column when used with an IN() expression.
238
Chapter 7 Working with Subqueries and Common Table Expressions
The whole thing with subqueries is that with a subquery you can combine multiple
parts to make a more complex query:
239
Chapter 7 Working with Subqueries and Common Table Expressions
~ 165 rows ~
-- Oldest Customers
SELECT *
FROM customers
WHERE dob=(SELECT min(dob) FROM customers);
(You’ll note that there’s more than one oldest customer, because they happen to be
born on the same day. It happens.)
In both cases, the subquery is evaluated once, and the results are used in the main
query. The result may be a list, as in the female authors, or a single value as in the oldest
customer.
A non-correlated subquery is independent of the main query. If you highlight the
subquery alone and run it, you’ll get a result.
Here’s an example of a correlated subquery:
240
Chapter 7 Working with Subqueries and Common Table Expressions
-- MSSQL
SELECT
id, title, (
SELECT coalesce(givenname+' ',''
+ coalesce(othernames+' ','')
+ familyname
FROM authors
WHERE authors.id=books.authorid
) AS author
FROM books;
-- Oracle
SELECT
id, title, (
SELECT ltrim(givenname||' ')
||ltrim(othernames||' ')
||familyname
FROM authors
WHERE authors.id=books.authorid
) AS author
FROM books;
id title author
241
Chapter 7 Working with Subqueries and Common Table Expressions
id title author
~ 1201 rows ~
In this case, the subquery is evaluated once for every row. Look at the subquery in
the first example earlier, spread out to be more readable:
(
SELECT
coalesce(givenname||' ','')
|| coalesce(othernames||' ','')
|| familyname
FROM authors
WHERE authors.id=books.authorid
)
The SELECT clause is expecting a single value for the author column, and so the
subquery should deliver a single value, which it does. You can’t use multiple columns in
this context, so you need to concatenate the names to give the single value.
Just as importantly, you can’t have multiple rows either. Here, the WHERE clause filters
the result to a single row, where the id matches the authorid in the main query: WHERE
authors.id=books.authorid.
For every row in the books table, the subquery runs again to match the next
authorid.
If there’s no match, the subquery comes back with a NULL.
You can recognize a correlated subquery by the fact that the query references
something from the main query. As a result, you can’t highlight the subquery and run it
alone, because it needs that reference to be complete.
Incidentally, note the WHERE clause in the subquery. In a sense, it’s overqualified,
and we could have used this: WHERE id=authorid. This is in spite of the fact that an id
column appears in both the subquery and the main query.
242
Chapter 7 Working with Subqueries and Common Table Expressions
When the subquery is evaluated, column names will be defined from the inside
out. For the id column, there’s one in the inner authors table, so SQL doesn’t bother to
notice that there’s also one in the outer books table. For the authorid column, there isn’t
one in the authors table, so it falls through the one in the books table.
That’s how it works in SQL, but it’s probably better to qualify the columns as we did
in this example to minimize confusion for us humans.
As a rule, a correlated subquery is an expensive operation because it’s reevaluated
so often. That doesn’t mean you shouldn’t use one, just that you should consider the
alternatives, if there are any. You don’t generally get to choose which type of subquery
you will need, but it will help in deciding whether there’s a better alternative.
SELECT
id, title, (
SELECT coalesce(givenname||' ','')
|| coalesce(othernames||' ','')
|| familyname
FROM authors
WHERE authors.id=books.authorid
) AS author,
(SELECT born FROM authors
WHERE authors.id=books.authorid) AS born,
(SELECT born FROM authors
WHERE authors.id=books.authorid) AS died
FROM books;
Apart from being tedious, it’s also expensive, and, of course, there’s a better way to do
it, using a join:
SELECT
id, title,
coalesce(givenname||' ','')
243
Chapter 7 Working with Subqueries and Common Table Expressions
In fact, you’ll probably find that a correlated subquery is often best replaced by a
join. There’s also some cost in the join, but after that, the rest of the data is free.
On the other hand, if the subquery is non-correlated, then it’s not so expensive. For
example, here’s the difference between customers’ heights and the average height:
SELECT
id, givenname, familyname,
height,
height-(SELECT avg(height) FROM customers) AS diff
FROM customers;
~ 303 rows ~
Even though the average is involved in a calculation in every row, it’s only calculated
once in the non-correlated subquery.
By the way, there’s an alternative way to do the preceding query involving window
functions, which we’ll look at in Chapter 8. However, in this case, there’s not much
difference in the result.
244
Chapter 7 Working with Subqueries and Common Table Expressions
You’ll have noticed that, in this case, the subquery references the same table as
the main query. That doesn’t make it a correlated subquery, as it doesn’t reference the
actual rows in the main query. You can verify that if you highlight the subquery and run
it by itself—it will work.
The subquery in this example was an aggregate query. You can also use an aggregate
in a correlated query. Here’s a way of generating a running total:
~ 5549 rows
We’ve had to alias the table in the subquery to something like ss (subsales?) to
distinguish it from the same table in the main query. That’s so that the expression
ss.ordered<=sales.ordered can reference the correct tables.
Here, the subquery calculates the sum of the totals up to and including the current
sale, ordered by the ordered column.
You possibly noticed that the query took a little while to run. As we noted, a
correlated subquery is costly, and one which involves aggregates is especially costly.
Fortunately, there’s also a window function for that, as we’ll see in the next chapter.
245
Chapter 7 Working with Subqueries and Common Table Expressions
SELECT *
FROM customers
WHERE dob=(SELECT min(dob) FROM customers);
You can also do the same to find customers shorter than the average:
SELECT *
FROM customers
WHERE height<(SELECT avg(height) FROM customers);
In both cases, the aggregate query was on the same table as the main query. You
might have thought that you could use an expression like WHERE dob=min(dob) or WHERE
height<avg(height), but it wouldn’t work; aggregates are calculated after the WHERE clause.
Big Spenders
Suppose you want to identify your “big spenders”—the customers who have spent the
highest amounts. For that, you will need data from the customers and sales tables.
Here, we’ll use subqueries as part of a multistep process.
To begin with, you’ll want to identify what you regard as large purchases:
246
Chapter 7 Working with Subqueries and Common Table Expressions
80 32 168.00 … 2022-05-22
216 13 160.50 … 2022-06-11
483 59 176.50 … 2022-07-11
726 68 173.00 … 2022-08-02
823 86 165.50 … 2022-08-09
891 140 162.50 … 2022-08-16
~ 35 rows ~
In here, we’re only interested in the customerid, which we’ll use to select from the
customers table:
SELECT *
FROM customers
WHERE id IN(SELECT customerid FROM sales WHERE total>160);
id … familyname givenname …
42 … Knott May …
58 … Ting Jess …
91 … North June …
140 … Byrd Dicky …
40 … Face Cliff …
141 … Rice Jasmin …
~ 32 rows ~
Note that the IN operator requires a list of values. In a subquery, this is a single
column of values.
Note also that you may have few results than in the previous query; that would be if
some of the customer ids appear more than once.
247
Chapter 7 Working with Subqueries and Common Table Expressions
SQL also has an ANY operator which will do the same job:
SELECT *
FROM customers
WHERE id=ANY(SELECT customerid FROM sales WHERE total>=160);
To recreate what we had in the previous query, we’ve qualified the star
(customers.*) and used DISTINCT to remove duplicates of customers who may have
appeared in the list more than once.
The advantage of using a join is that you can also get sales data for the asking, so this
gives a slightly richer result:
SELECT *
FROM customers JOIN sales ON customers.id=sales.customerid
WHERE sales.total>=160;
Here, we’ve removed the DISTINCT and the customers., so you’ll get a lot of data:
~ 35 rows ~
To find customers with large total sales will require an aggregate subquery:
SELECT *
FROM customers
WHERE id IN(
248
Chapter 7 Working with Subqueries and Common Table Expressions
id … familyname givenname …
42 … Knott May …
58 … Ting Jess …
26 … Twishes Bess …
91 … North June …
69 … Mentary Rudi …
140 … Byrd Dicky …
~ 57 rows ~
Max
2023-05-15 00:46:00.864446
2023-05-25 00:42:26.783461
2023-05-16 05:27:53.810977
2023-05-06 01:40:02.346894
2023-05-19 07:41:25.104524
2023-05-07 19:01:06.756387
~ 269 rows ~
249
Chapter 7 Working with Subqueries and Common Table Expressions
~ 266 rows ~
If you count the rows, you may find that the main query returned fewer rows than the
subquery. That would happen if there were some NULL ordered datetimes. At some point,
we should learn to ignore these, either by filtering them out or removing them altogether.
The question is, why weren’t those sales included in the full query? And the answer
is that it’s all about the IN() operator.
Remember in Chapter 3, we discussed the NOT IN quirk. The discussion also applies
to a plain IN. The NULL datetimes in the subquery would result in the equivalent of
testing WHERE ordered=NULL, which, as we all know, always fails.
Now that we have sales for each customer, it’s a simple matter to join that to the
customers table to get more details:
SELECT *
FROM sales JOIN customers ON sales.customerid=customers.id
WHERE ordered IN(SELECT max(ordered) FROM sales GROUP BY customerid);
250
Chapter 7 Working with Subqueries and Common Table Expressions
~ 266 rows ~
You can now extract any customer or sales data you might want to work with.
D
uplicated Customers
We’ve seen in Chapter 2 how to find duplicates. Suppose, for example, you want to find
duplicate customer names:
SELECT
givenname||' '||familyname AS fullname,
-- MSSQL: givenname+' '+familyname AS fullname,
count(*) as occurrences
FROM customers
GROUP BY familyname, givenname
HAVING count(*)>1;
You get
fullname Occurrences
Judy Free 2
Annie Mate 2
Mary Christmas 2
(continued)
251
Chapter 7 Working with Subqueries and Common Table Expressions
fullname Occurrences
Ken Tuckey 2
Corey Ander 2
Ida Dunnit 2
Paul Bearer 2
Terry Bell 2
Remember, having the same name doesn’t necessarily mean they’re duplicates. It’s
probably just a coincidence.
We’ve concatenated the name because of what we’re going to do in the next step.
The problem with aggregate queries is that you can only select what you’re grouping,
so we can’t see the rest of the customer details. Any attempt to include them would
destroy the aggregate.
We can, however, use the duplicate query as a subquery to filter the customers table:
/* Note
================================================
MSSQL: Use givenname+' '+familyname
================================================ */
SELECT *
FROM customers
WHERE givenname||' '||familyname IN (
SELECT givenname||' '||familyname FROM customers
GROUP BY familyname, givenname
HAVING count(*)>1
);
This will give us the rest of the customer details. The reason we had to concatenate
the customers’ names is that you can only have a single column in the IN() expression.
252
Chapter 7 Working with Subqueries and Common Table Expressions
For example, suppose you want to look at your books in price groups. You can create
a simple query like this:
SELECT
id, title,
CASE
WHEN price<13 THEN 'cheap'
WHEN price<=17 THEN 'reasonable'
WHEN price>17 THEN 'expensive'
END AS price_group
FROM books;
id title price_group
~ 1201 rows ~
Now, suppose you want to summarize the table. The problem is that you can’t
do this:
count(*) as num_books
FROM books
GROUP BY price_group;
We’ve commented out the columns we’re not grouping, but it still won’t work
because of that pesky clause order thing: the alias price_group is created in the SELECT
clause which comes after the GROUP BY clause, so it’s not available for grouping. Of
course, you can then reproduce the calculation in the GROUP BY clause:
254
Chapter 7 Working with Subqueries and Common Table Expressions
FROM books
) AS sq -- Oracle: ( ... ) sq
GROUP BY price_group;
price_group num_books
expensive 320
[NULL] 105
reasonable 467
cheap 309
Remember that the default fall through for the CASE expression is NULL. Those books
which are unpriced will end up in the NULL price group. Depending on the DBMS, you’ll
see this somewhere in the result set as a separate group.
Remember that a SELECT statement generates a virtual table. As such, it can be used
in a FROM clause in the form of a subquery.
Note that there’s a special requirement for a FROM subquery: it must have an alias,
even if you’ve no plans to use it. We have no special plans here, so it’s just called sq
(“SubQuery”) for no particular reason. If you want to, say, join the subquery with
another table or virtual table, then the alias will be useful.
Nested Subqueries
A subquery is a SELECT statement with its own FROM clause. In turn, that FROM clause
might be from another subquery. If you have a subquery within a subquery, it’s a nested
subquery.
For example, let’s look at duplicate customer names again. You can find candidates
with the following aggregate query:
255
Chapter 7 Working with Subqueries and Common Table Expressions
They’re just the names. Suppose you want more details. For that, you can join the
customers table with the preceding query:
SELECT
c.id, c.givenname, c.familyname, c.email
FROM customers AS c JOIN (
SELECT familyname, givenname
FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
) AS n ON c.givenname=n.givenname AND c.familyname=n.familyname;
We’ve seen something like this before. You’ll now get the candidate customers:
~ 16 rows ~
We’ve aliased the customers table to c for convenience (don’t forget no AS in Oracle),
and the subquery needs a name anyway, so we’ve called it n. In the SELECT clause, we’ve
just fetched the id, the names, and the email address.
Now, let’s combine this in another aggregate query, which will give us one row per
name, and combine the other details:
SELECT
givenname, familyname,
-- PostgreSQL, MSSQL:
string_agg(email,', ') AS email,
string_agg(cast(id AS varchar(3)),', ') AS ids
-- MariaDB/MySQL:
group_concat(email SEPARATOR ', ') AS email,
256
Chapter 7 Working with Subqueries and Common Table Expressions
~ 8 rows ~
SELECT ...
FROM ...
WHERE EXISTS(subquery);
The subquery will either return a result or not. If it does, then the WHERE EXISTS is
satisfied, and the row is passed; if it doesn’t, then the WHERE EXISTS isn’t satisfied, and
the row will be filtered.
For example, you can test the idea with the following statement:
Since 1=1 is always true, you’ll get all of the rows from the authors table.
Although you would normally only use FROM dual with Oracle, MariaDB and MySQL
also support this. In this case, MariaDB and MySQL don’t like the WHERE clause without a
FROM, so we’ve thrown it in to keep them happy.
Similarly, you can return nothing:
The subquery is in a special position in that it doesn’t matter what columns are
actually being selected: what matters is that there is or isn’t a row. That’s why we’ve
included a dummy SELECT 1.
258
Chapter 7 Working with Subqueries and Common Table Expressions
You can also choose SELECT NULL or even SELECT 1/0. The former would give the
(false) impression that we’re looking for nothing, and the latter would have resulted
in an error if run by itself. It’s tempting to take it more seriously by selecting a more
meaningful value, but there’s no need.
The subquery selects some rows, which is enough to satisfy the WHERE clause, so
you’ll get all the authors. If you had tried WHERE price<0, then you’d get none of the
authors.
~ 443 rows ~
Here, the subquery looks for a row where books.authorid matches authors.id, and
if there is such a row (as there is with most authors), the author row will be returned.
This variation is, of course, simpler. However, it’s quite likely that, on the inside, SQL
does exactly the same thing, so how you write it is really a matter of taste.
On the other hand, if you’re looking for authors without books (in our catalogue),
then it’s a different matter.
This won’t work:
Well, technically, it will work, but not the way we would have wanted. Recall again
from Chapter 3 the “NOT IN quirk.” Since there are some NULLs in the authorid column,
the NOT IN operator eventually evaluates something like ... AND id=NULL AND .... The
id=NULL always fails, and the ... AND ... combines that failure with the rest and causes
the whole expression to fail.
Using WHERE NOT EXISTS will, however, work:
260
Chapter 7 Working with Subqueries and Common Table Expressions
~ 45 rows ~
You won’t see WHERE EXISTS much in the wild, since you can generally do the same
thing with either a join or the IN operator. However, there are times where it has an
advantage or is more intuitive. That’s especially because WHERE EXISTS can be more
expressive and particular when NOT IN doesn’t work.
SELECT
id, title,
price, price*0.1 AS tax, price+tax AS inc
FROM books;
261
Chapter 7 Working with Subqueries and Common Table Expressions
It won’t work. That’s because each column is independent of the rest. You can’t
use an alias as part of another calculation in the SELECT clause. We got around this by
calculating the inc column separately: price*1.1 AS inc.
It gets worse if you try something like this:
SELECT
id, title,
price, price*0.1 AS tax
FROM books
WHERE tax>1.5;
Here, the problem is that the SELECT clause is evaluated after the WHERE clause, so
the aliased calculation for tax isn’t available yet in the WHERE clause. Again, we could
recalculate the value in the WHERE clause: WHERE price*1.1>1.5.
Except with SQLite. You can indeed use aliases in the WHERE clause and also in the
GROUP BY clause.
Finally, if, for example, you want to get multiple columns from a subquery in the
SELECT clause, this won’t work either:
SELECT
id, title,
(SELECT givenname, othernames, familynames
FROM authors WHERE authors.id=books.authorid)
FROM books
WHERE tax>1.5;
A subquery in the SELECT clause can only return one value, which is all right if you
concatenate the names and then return the result. Otherwise, you’re stuck with three
subqueries, which is both costly and tedious.
SQL can solve this by applying a subquery to each row. This is called a LATERAL JOIN
in some DBMSs, or an APPLY in some others.
262
Chapter 7 Working with Subqueries and Common Table Expressions
Adding Columns
In the first two examples earlier, you can use an expression like this:
263
Chapter 7 Working with Subqueries and Common Table Expressions
~ 525 rows ~
Note .
• The subquery must be given an alias, even though it’s not used.
• PostgreSQL, MySQL, and MSSQL allow you to put the column
aliases in the subquery aliases instead: (SELECT price*0.1)
AS sq(tax). Not Oracle.
• The example for PostgreSQL and MySQL uses the dummy condition
ON true. MySQL will allow you to leave this out, but PostgreSQL
requires it.
Note in particular that the second subquery will happily calculate the expression
price+tax AS inc. This is because the subqueries are evaluated one after the other, so
the expressions can accumulate.
The LATERAL or CROSS APPLY subquery is applied to every row of the main query.
In principle, that could be pretty expensive, but, as it turns out, it’s not so bad. It’s
particularly useful if you need to include a series of intermediate steps in a more
complex calculation—it’s easy to understand and easy to maintain.
SQL also has a type of join called CROSS JOIN. In a cross join, each row of one
table is joined with each row of the other table. This result is also known as a
Cartesian product. That’s a lot of combinations, and it’s usually not what you want.
264
Chapter 7 Working with Subqueries and Common Table Expressions
A CROSS APPLY is not the same thing, though it is a type of join. It’s closer to an
OUTER JOIN.
You’ll see a use for a cross join later when we cross join with a single row
virtual table.
Multiple Columns
As we noted, SQL won’t let you fetch multiple columns from a single subquery in the SELECT
clause, because everything in the SELECT clause is supposed to be scalar—a single value.
However, you can fetch multiple columns if the context is table-like, such as in the
FROM clause. For example:
~ 1201 rows ~
In this case, you can just as readily use a normal outer join to get the same results:
SELECT
books.id, title,
givenname, othernames, familyname,
home
FROM books LEFT JOIN authors ON authors.id=books.authorid;
The latter form is definitely simpler (we’ve left off the table aliases for simplicity and
qualified the books.id column out of necessity).
On the other hand, if the subquery is an aggregate query, the lateral join is
convenient, since you’re going to need a subquery anyway: remember you can’t mix
aggregate and non-aggregate data in a single SELECT statement.
For example, suppose you want a list of customers with the total sales for each
customer. You’ll need an aggregate query to get the totals, joined to the customers table.
You could do this:
~ 303 rows ~
Although there may be alternatives, as you’ll need when working with SQLite or
MariaDB, the lateral join can sometimes make this sort of query a little more intuitive.
267
Chapter 7 Working with Subqueries and Common Table Expressions
Common Table Expressions (CTEs) are a relatively new feature in SQL, but have
been around for some time, and are available in almost all modern DBMSs. The
notable laggards are MariaDB which added support in version 10.2 (released in
2016) and MySQL which added support in version 8.0 (released in 2018). If you’re
stuck with an older version of MariaDB or MySQL, maybe you can learn to enjoy
nested subqueries.
Complex subqueries may have one subquery referring to another. This involves
nesting subqueries.
• CTEs can reference previous CTEs without the need for nesting.
Both these benefits relate to readability and maintainability. The third benefit is one
which is not available for ordinary subqueries.
• CTEs can refer to themselves; thus, they can be recursive.
Syntax
A Common Table Expression is defined as part of the query, before the main part:
The CTE is given a name, though not necessarily cte of course. Thereafter, it is used
as a normal table in the main query. You can define multiple CTEs as follows:
WITH
cte AS (subquery),
another AS (subquery)
SELECT columns FROM ...;
268
Chapter 7 Working with Subqueries and Common Table Expressions
-- Prepare Data
WITH sq AS (
SELECT
id, title,
CASE
WHEN price<13 THEN 'cheap'
WHEN price<=17 THEN 'reasonable'
WHEN price>17 THEN 'expensive'
END AS price_group
FROM books
)
-- Use Prepared Data
SELECT price_group, count(*) AS num_books
FROM sq
GROUP BY price_group;
It doesn’t look much different, but the important part is that you now have your
query in two parts: the first part defines the subquery, and the second uses it. It’s a much
better way of organizing your code.
269
Chapter 7 Working with Subqueries and Common Table Expressions
The subquery has been transferred to a CTE at the beginning of the query. From
there on, the main SELECT statement references the CTE as if it were just another table.
The advantage is that the query is written according to the plan: first prepare the
data, and then use the data.
MSSQL currently doesn’t require a semicolon at the end of a statement, but you
should be in the habit of using it anyway.
However, the WITH clause has an alternative meaning at the end of a previous
SELECT statement, so it will be misinterpreted if you don’t end the previous
SELECT statement with the semicolon.
Just use the semicolon at the end of every statement, and all will be fine. Don’t fall
for this nonsense:
;WITH (...)
Here’s another example, which we’ll use further in the next few chapters. If you look
at the sales table:
39 28 28.00 … 2022-05-15
40 27 34.00 … 2022-05-16
42 1 58.50 … 2022-05-16
43 26 50.00 … 2022-05-16
45 26 17.50 … 2022-05-16
518 50 13.00 … [NULL]
~ 5549 rows ~
270
Chapter 7 Working with Subqueries and Common Table Expressions
If you want to summarize the table, such as to get monthly totals, the data is too
fine-detailed. Instead, you can prepare the data by formatting the ordered as a year-
month value:
WITH salesdata AS (
SELECT
-- PostgreSQL, Oracle
to_char(ordered,'YYYY-MM') AS month,
-- MariaDB/MySQL
-- date_format(ordered,'%Y-%m') AS month,
-- MSSQL
-- format(ordered,'yyyy-MM') AS month,
-- SQLite
-- strftime('%Y-%m',ordered) AS month,
total
FROM sales
)
SELECT month, sum(total) AS daily_total
FROM salesdata
GROUP BY month
ORDER BY month;
Month daily_total
2022-05 6966.50
2022-06 12733.00
2022-07 17314.00
2022-08 19093.00
2022-09 20295.50
2022-10 27797.50
~ 14 rows ~
271
Chapter 7 Working with Subqueries and Common Table Expressions
In real life, much of what you want to summarize isn’t in the right form, but you can
prepare it in a CTE to get it ready.
We’ll have another look at CTEs in Chapter 9, where we’ll see more techniques we
can apply.
Summary
In this chapter, we’ve had a look at using variations on subqueries in a query. We’ve
already seen some subqueries in previous chapters, but here we had a closer look at how
they work.
Subqueries can be used in any clause. The results of the subquery must match the
context of the clause:
You can also use subqueries in the ORDER BY clause, though you’d probably want to
use the expression in the SELECT clause instead.
You can also use subqueries with the WHERE EXISTS expression or in LATERAL joins.
Subqueries in the FROM clause can be nested, though you would probably want to use
a Common Table Expression instead.
A correlated subquery can be expensive, since it’s evaluated multiple times, so there
may be more suitable alternatives.
272
Chapter 7 Working with Subqueries and Common Table Expressions
• You can also use a LATERAL JOIN to add multiple columns from a
subquery.
Coming Up
In Chapter 5, we had a look at aggregating data. Generally, aggregate values can’t be
mixed with non-aggregate values without throwing a few subqueries into the mix.
273
Chapter 7 Working with Subqueries and Common Table Expressions
Window functions are a group of functions which do the job of applying subqueries
to each row. There are two main groups of window functions:
With window functions, you’ll be able to generate datasets which combine plain data
with more analytical data.
274
CHAPTER 8
Window Functions
So far, you have seen two main groups of calculations:
• Most calculations have been based on table columns: For each row, a
value is calculated from one or more columns.
• Aggregate queries are used to summarize rows: For the whole table,
some or all rows are summarized.
Window functions are a group of functions which add row data as columns. We’ll be
working with three groups of window functions:
Among other things, you’ll see how this can be used to generate
running totals.
• Value functions: You can get data from rows which precede or
follow the current row. You can also get the first and last values in
each group.
This will, for example, get you the difference in values between
this and some other row.
In this chapter, we’ll look at all of these.
275
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_8
Chapter 8 Window Functions
Window functions are relatively new to SQL, but most modern DBMSs now support
them. Again, the laggards are MariaDB, which introduced them in version 10.2, and
MySQL which introduced them in version 8.
Before we get started, some of the samples will be working with the sales table. That
table includes some NULLs for the ordered date/time. Presumably, those sales never
checked out.
We’ve been pretty forgiving so far and filtered them out from time to time, but the
time has come to deal with them. We can delete all of the NULL sales as follows:
You’ll notice that there’s a foreign key from the saleitems table to the sales table,
which would normally disallow deleting the sales if there are any items attacked.
However, if you check the script which generates the sample database, you’ll notice the
ON DELETE CASCADE clause, which will automatically delete the orphaned sale items.
The important part is the OVER() clause which generates the window to be
summarized.
There are three main window clauses:
• PARTITION BY: This calculates the function for the group defined. It is
equivalent to GROUP BY.
276
Chapter 8 Window Functions
This order does not need to be the same as the table’s ORDER
BY clause.
In the following samples, there is normally an ORDER BY clause at the end of the
SELECT statement, which is the same as what’s in the OVER() clause. This isn’t necessary,
but it makes the results easier to follow.
SELECT
id, givenname, familyname,
count(*)
FROM customerdetails;
SELECT
id, givenname, familyname,
count(*) OVER ()
FROM customerdetails;
277
Chapter 8 Window Functions
~ 303 rows ~
The OVER() clause changes the aggregate function into a window function. This
aggregate function will now be generated for each column. You’ll see later that the OVER()
clause defines any grouping, known as partitions, the order, and the number of rows to be
considered in the aggregate.
For such a simple case, you can get the same result with a subquery:
SELECT
id, givenname, familyname,
(SELECT count(*) FROM customers)
FROM customerdetails;
The window function becomes more interesting when you apply one of the window
clauses. For example:
SELECT
id, givenname, familyname,
count(*) OVER (ORDER BY id)
FROM customerdetails;
This will give the running count up to and including the current row, in order of id.
The actual table results may or may not be in row order, especially if you include other
expressions, so it’s better to add that to the end:
278
Chapter 8 Window Functions
SELECT
id, givenname, familyname,
count(*) OVER (ORDER BY id) AS running_count
FROM customerdetails
ORDER BY id;
1 Pierce Dears 1
2 Arthur Moore 2
5 Ray King 3
6 Gene Poole 4
9 Donna Worry 5
10 Ned Duwell 6
~ 303 rows ~
The running_count column looks very much like a simple row number. We’ll see
later that it’s not necessarily the same if the ORDER BY column isn’t unique.
Aggregate Functions
Normally, you can’t use aggregate functions in a normal query unless you squeeze them
into a subquery. However, they can be repurposed as window functions.
Previously, you saw that you can use the expression count(*) OVER () to give the
total number on every row. You can also do something similar with the sum() or avg()
functions.
For example, suppose you want to compare sales totals with the overall average:
SELECT
id, ordered, total,
total-avg(total) OVER () AS difference
FROM sales;
279
Chapter 8 Window Functions
~ 5549 rows ~
In a more complicated example, suppose you want to compare how sales each day
compare to the rest of the week.
First, you could extract only the day of the week and total from the sales table. You
can use either the day name or the day number for this, but let’s use the day number:
-- PostgreSQL: Sunday=0
SELECT
EXTRACT(dow FROM ordered) AS weekday_number,
total
FROM sales;
-- MSSQL: Sunday=1
SELECT
datepart(weekday,ordered) AS weekday_number,
total
FROM sales;
-- Oracle: Sunday=1
SELECT
to_char(ordered,'D')+0 AS weekday_number,
total
FROM sales;
-- MariaDB/MySQL: Sunday=1
SELECT
dayofweek(ordered) AS weekday_number,
total
280
Chapter 8 Window Functions
FROM sales;
-- SQLite: Sunday=0
SELECT
strftime('%w',ordered) AS weekday_number
total
FROM sales;
You’ll see they all have a different way to do it, and they can’t even agree on the day
number. Fortunately, they all agree on the first day of the week:
weekday_number Total
0 28
1 34
1 58.5
1 50
1 17.5
0 13
~ 5549 rows ~
WITH
data AS (
SELECT
... AS weekday,
total
FROM sales
)
-- to be done
;
WITH
data AS (
SELECT
281
Chapter 8 Window Functions
... AS weekday_number,
total
FROM sales
),
summary AS (
SELECT weekday_number, sum(total) AS total
FROM data
GROUP BY weekday_number
)
-- etc
Finally, you can compare the daily totals to the grand totals using a window aggregate:
WITH
data AS (...),
summary AS (...)
SELECT
weekday_number, total,
total/sum(total) OVER()
FROM weekday_number
ORDER BY weekday_number;
0 48182.22 0.147
1 49304 0.151
2 45156.5 0.138
3 45959.5 0.141
4 47528 0.145
5 42372.5 0.13
6 48415.5 0.148
282
Chapter 8 Window Functions
Note that the expression total/sum(total) OVER() is confusing as the OVER() clause
seems a little uninvolved. You might prefer to write it as total/(sum(total) OVER ())
to make it clearer that it is, in fact, a single expression. We’ll leave that to your preference,
but it isn’t normally written that way.
You can finish off by giving the calculation an alias, displaying it as a percentage, and
sorting by weekday:
WITH
data AS (...),
summary AS (...)
SELECT
weekday, total,
100*total/sum(total) OVER() AS proportion
FROM summary
;
If you want to display the percentage symbol, that’s up to the DBMS. You can try one
of the following:
-- PostgreSQL
to_char(100*total/sum(total) OVER(),'99.9%')
-- MariaDB/MySQL
format(100*total/sum(total) OVER(),2) || '%'
-- MSSQL
format(100*total/sum(total) OVER(),'0.0%')
-- SQLite: aka printf(...)
select format('%.1f%%',100*total/sum(total) OVER())
-- Oracle
to_char(100*total/sum(total) OVER(),'99.9') || '%'
0 48182.22 14.7%
1 49304 15.1%
2 45156.5 13.8%
(continued)
283
Chapter 8 Window Functions
3 45959.5 14.1%
4 47528 14.5%
5 42372.5 13.0%
6 48415.5 14.8%
We’ve used OVER() to calculate the grand total for the table. However, we can also
use a sliding window, as we’ll see in the next section.
SELECT
id, givenname, familyname,
count(*) OVER (ORDER BY id) AS running_count
FROM customerdetails
ORDER BY id;
In this example, the id, being the primary key, is unique. That will give us a false idea
of how this works, so let’s look at using the height, which is not unique. We’ll also filter
out the NULL heights to make it more obvious:
SELECT
id, givenname, familyname,
height,
count(*) OVER (ORDER BY height) AS running_count
FROM customerdetails
WHERE height IS NOT NULL
ORDER BY height;
284
Chapter 8 Window Functions
You’ll see some repeated heights and how they affect the window function:
~ 267 rows ~
When using ORDER BY in the OVER clause, it means count the number of rows up to
the current value. That may or may not be what you wanted.
That’s quite a mouthful, but that’s the way the SQL language is developing: Why say
something in two words if you can say it in twenty1?
Here, the word RANGE refers to the value of height. For example, in the fifth row
earlier, the value is the same as the next row, so count(*) includes both.
The obvious alternative is
SELECT
id, givenname, familyname,
height,
count(*) OVER (ORDER BY height
1
You’ll see this sort of thing in all of the newer features in SQL. You might say that SQL is the
new COBOL.
COBOL was (and still is) an early programming language which was supposed to appeal to less
mathematical business programmers. It is noted for its verbosity.
285
Chapter 8 Window Functions
The subtle change is from RANGE BETWEEN to ROWS BETWEEN. It now counts the
number of rows up to the current row.
~ 267 rows ~
It’s a little bit unfair: two customers on the same height are arbitrarily positioned one
before the other. We’ll see more of this unfairness later.
The framing clause can take the following form:
As we saw, the difference between ROWS and RANGE is that RANGE includes all the rows
which match the current value, while ROWS doesn’t.
The start and end expressions, a.k.a. the frame borders, can take one of the
following forms:
Expression Meaning
286
Chapter 8 Window Functions
ROWS|RANGE start
-- PostgreSQL, Oracle
to_char(ordered_date,'YYYY-MM') AS ordered_month,
-- MariaDB/MySQL
-- date_format(ordered_date,'%Y-%m')
AS ordered_month,
-- MSSQL
-- format(ordered_date,'yyyy-MM') AS ordered_month,
-- SQLite
-- strftime('%Y-%m',ordered_date) AS ordered_month,
sum(total) AS daily_total
FROM sales
WHERE ordered IS NOT NULL
GROUP BY ordered_date;
287
Chapter 8 Window Functions
2022-05-04 2022-05 43
2022-05-05 2022-05 150.5
2022-05-06 2022-05 110.5
2022-05-07 2022-05 142
2022-05-08 2022-05 214.5
2022-05-09 2022-05 16.5
~ 389 rows ~
A Sliding Window
Here’s an example of using a sliding window with the framing clause. Suppose we want
to generate the daily totals for each day and the week up to the day. We can use
SELECT
ordered_date, daily_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS 6 PRECEDING) AS week_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS UNBOUNDED PRECEDING) AS running_total
FROM daily_sales
ORDER BY ordered_date;
For both framing clauses, we’ve used the shorter form, since we want to go up to the
current row. We could have left off the framing clause altogether for the running total,
but we needed to change from the default RANGE BETWEEN just in case two daily totals
were the same.
288
Chapter 8 Window Functions
2022-05-04 43 43 43
2022-05-05 150.5 193.5 193.5
2022-05-06 110.5 304 304
2022-05-07 142 446 446
2022-05-08 214.5 660.5 660.5
2022-05-09 16.5 677 677
2022-05-10 160 837 837
2022-05-11 115 909 952
2022-05-12 205 963.5 1157
2022-05-13 164.5 1017.5 1321.5
2022-05-14 46.5 922 1368
2022-05-15 457.5 1165 1825.5
~ 389 rows ~
Note that for the first seven days, the week and running totals are the same, because
there are no totals from before then. However, from there on, the running total keeps
accumulating while the week total is clamped to the current seven days.
If you look hard enough, you may also see some gaps in the dates. That means that
there were no sales on those days and can also mean trouble for interpreting what you
mean, since one row is not necessarily one day. We’ll address that problem in Chapter 9.
Remember, you’re not limited to the count() and sum() functions. For example, you
can create sliding averages as well:
SELECT
ordered_date, daily_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS 6 PRECEDING) AS week_total,
avg(daily_total) OVER(ORDER BY ordered_dat
ROWS 6 PRECEDING) AS week_average,
sum(daily_total) OVER(ORDER BY ordered_date
289
Chapter 8 Window Functions
The week average is the average over the seven days including the current day:
2022-05-04 43 43 43 43
2022-05-05 150.5 193.5 96.75 193.5
2022-05-06 110.5 304 101.333 304
2022-05-07 142 446 111.5 446
2022-05-08 214.5 660.5 132.1 660.5
2022-05-09 16.5 677 112.833 677
2022-05-10 160 837 119.571 837
2022-05-11 115 909 129.857 952
2022-05-12 205 963.5 137.643 1157
2022-05-13 164.5 1017.5 145.357 1321.5
2022-05-14 46.5 922 131.714 1368
2022-05-15 457.5 1165 166.429 1825.5
~ 389 rows ~
You can also select sliding minimums and maximums or averages so far. You’ll have
to decide which of them is useful for your own purposes.
290
Chapter 8 Window Functions
The default partition is the whole table. You can partition by anything that can
be grouped. For example, suppose you want to get monthly totals with the previous
examples, you can use
SELECT
ordered_date, daily_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS 6 PRECEDING) AS week_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS UNBOUNDED PRECEDING) AS running_total,
sum(daily_total) OVER(PARTITION BY ordered_month)
AS monthly_total
FROM daily_sales
ORDER BY ordered_date;
2022-05-04 43 43 43 6966.5
2022-05-05 150.5 193.5 193.5 6966.5
2022-05-06 110.5 304 304 6966.5
2022-05-07 142 446 446 6966.5
2022-05-08 214.5 660.5 660.5 6966.5
2022-05-09 16.5 677 677 6966.5
2022-05-10 160 837 837 6966.5
2022-05-11 115 909 952 6966.5
2022-05-12 205 963.5 1157 6966.5
2022-05-13 164.5 1017.5 1321.5 6966.5
2022-05-14 46.5 922 1368 6966.5
2022-05-15 457.5 1165 1825.5 6966.5
~ 389 rows ~
sum(daily_total) OVER(
PARTITION BY ordered_month
ORDER BY ordered_date ROWS UNBOUNDED PRECEDING
) AS month_running_total
SELECT
ordered_date, daily_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS UNBOUNDED PRECEDING) AS running_total,
sum(daily_total) OVER(PARTITION BY ordered_month)
AS month_total,
sum(daily_total) OVER(ORDER BY ordered_month)
AS running_month_total,
sum(daily_total) OVER(PARTITION BY ordered_month
ORDER BY ordered_date ROWS UNBOUNDED PRECEDING)
AS month_running_total
FROM daily_sales
ORDER BY ordered_date;
You’ll see something like this (the column names have been abbreviated to fit in
the page):
~ 389 rows ~
The names may be somewhat confusing, so here’s a table of what’s going on:
(Again, the column names have been abbreviated to make it all fit.)
Notice how we’re using the group column ordered_month both to partition and for
a running total. Because its default frame is RANGE ..., it will produce the total for all of
the values so far, which effectively is a total for the whole month. This is the sort of thing
you can expect if you order by a non-unique row.
The hardest part of it all is thinking of good names for the results.
293
Chapter 8 Window Functions
-- customer_sales
SELECT c.id AS customerid, c.state, c.town, total
FROM customerdetails AS c JOIN sales AS s
ON c.id=s.customerid
We’ll then want to summarize the data by grouping by state, town, and customer id.
Again, that will go into another CTE:
-- totals
SELECT state, town, customerid, sum(total) AS total
FROM customer_sales
GROUP BY state, town, customerid
WITH
customer_sales AS (
SELECT c.id AS customerid, c.state, c.town, total
FROM customerdetails AS c JOIN sales AS s
ON c.id=s.customerid
),
totals AS (
SELECT state, town, customerid, sum(total) AS total
FROM customer_sales
GROUP BY state, town, customerid
)
294
Chapter 8 Window Functions
~ 269 rows ~
Now for the window functions. First, to get the group total by state, we can use
To get the group total per town, remember that the town name can appear in more
than one state. To use PARTITION BY town would be a mistake, as the town names would
be conflated. Instead, we use
WITH
customer_sales AS (
SELECT c.id AS customerid, c.state, c.town, total
FROM customerdetails AS c JOIN sales AS s
ON c.id=s.customerid
),
totals AS (
SELECT state, town, customerid, sum(total) AS total
295
Chapter 8 Window Functions
FROM customer_sales
GROUP BY state, town, customerid
)
SELECT
state, town, customerid, total AS customer_total,
sum(total) OVER(PARTITION BY state) AS state_total,
sum(total) OVER(PARTITION BY state, town) AS town_total
FROM totals
ORDER BY state, customerid;
~ 269 rows ~
There’s an implied hierarchy between a state and a town: a town is part of a state
(and, for the time being, a customer is in a town). As a result, the PARTITION BY clause
must follow the hierarchy: state,town. You can also use columns which are unrelated,
such as the state and year of birth, in which case the columns can go either way.
Ranking Functions
The window functions used so far are basically aggregate functions given a new context.
The other group of functions are specific to window functions. Generally, they relate to
the position of the current row. Broadly, we can call them ranking functions.
296
Chapter 8 Window Functions
There is one aggregate window function, which we’ve already seen, which also acts
as a ranking function:
SELECT
id, givenname, familyname,
height,
count(*) OVER (ORDER BY height
ROWS UNBOUNDED PRECEDING) AS running_count
FROM customers
WHERE height IS NOT NULL
ORDER BY height;
As long as you use the framing clause ROWS UNBOUNDED PRECEDING (shortened from
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), the count(*) will count the
number of rows up to the current row, which is basically the row number in the result set.
There’s a simpler alternative to that:
SELECT
id, givenname, familyname,
height,
row_number() OVER (ORDER BY height) AS running_count
FROM customers
WHERE height IS NOT NULL
ORDER BY height;
The row_number() function basically generates just that: a number for each row in
the result set.
297
Chapter 8 Window Functions
If two values in the ORDER BY clause are the same, they will get the
same rank. The next different value will not get the next rank; it
will catch up with the row number above.
• count(*): If you leave the framing clause out and let it default to
RANGE, it will behave like rank() with one difference. We’ll look at the
difference later.
If the partition isn’t specified (there is no PARTITION BY clause), then the preceding
functions apply to the whole table. Otherwise, they will give the position within
the group.
The difference between rank() and dense_rank() is that for equal values, rank()
will pick up from the next row_number(), while dense_rank() won’t.
If the ORDER BY value is not unique
• row_number() is arbitrary.
If the ORDER BY value is unique, these all give the same results.
We can test this with customer heights, where we know some heights are repeated:
SELECT
id, givenname, familyname,
height,
row_number() OVER (ORDER BY height) AS row_number,
count(*) OVER (ORDER BY height) AS count,
rank() OVER (ORDER BY height) AS rank,
dense_rank() OVER (ORDER BY height) AS dense_rank
FROM customers
WHERE height IS NOT NULL
ORDER BY height;
298
Chapter 8 Window Functions
597 … 153 1 2 1 1
283 … 153 2 2 1 1
451 … 153.8 3 3 3 2
194 … 154.3 4 4 4 3
534 … 156.4 5 6 5 4
352 … 156.4 6 6 5 4
~ 267 rows ~
Your actual results may, of course, be different. However, in the preceding example,
we can see
• The rank() is the same for equal values. The next value matches the
row_number().
• The count(*) is also the same for equal values. The next value also
matches the row_number().
• The rank() is the same as the first row_number() for equal values; the
count(*) is the same as the last row_number() for equal values.
• The dense_rank() is also the same for equal values. The next value
gets the next rank. By the time you get to the end of the result set, it
will be very different to the row number.
With most DBMSs, the ranking functions all require an ORDER BY window clause.
That makes sense, since ranking is meaningless without order.
The exceptions include PostgreSQL and SQLite, which will allow an empty
window clause:
-- PostgreSQL, SQLite
SELECT
id, givenname, familyname,
height,
299
Chapter 8 Window Functions
However, the results are meaningless. The count(*), rank(), and dense_rank()
expressions all give one value for the whole result set, and the row_number() gives row
numbers in an arbitrary order.
SELECT
id, ordered_date, total,
row_number() OVER (PARTITION BY ordered_date) AS row_number
FROM sales
ORDER BY ordered;
1 2022-05-04 43 1
2 2022-05-05 54.5 1
3 2022-05-05 96 2
6 2022-05-06 18 2
7 2022-05-06 92.5 1
4 2022-05-07 17.5 1
~ 5295 rows ~
The row numbers may not be in the expected order, since the order wasn’t specified.
To finish the job, we should also include that:
300
Chapter 8 Window Functions
SELECT
id, ordered_date, total,
row_number() OVER (
PARTITION BY ordered_date ORDER BY ordered
) AS row_number
FROM sales
ORDER BY ordered;
1 2022-05-04 43 1
2 2022-05-05 54.5 1
3 2022-05-05 96 2
6 2022-05-06 18 1
7 2022-05-06 92.5 2
4 2022-05-07 17.5 1
~ 5295 rows ~
You can use the group row number in a creative way. For example, you might want to
show the date for only the first sale for the day. You can show the date selectively using a
CASE ... END expression:
CASE
WHEN row_number() OVER
(PARTITION BY ordered_date ORDER BY ordered)=1
THEN CAST(ordered_date AS varchar(16))
ELSE ''
END AS ordered_date,
SELECT
id,
CASE
WHEN row_number() OVER
301
Chapter 8 Window Functions
1 2022-05-04 1 43
2 2022-05-05 1 54.5
3 2 96
6 2022-05-06 1 18
7 2 92.5
4 2022-05-07 1 17.5
5 2 63
9 3 61.5
10 2022-05-08 1 67.5
11 2 18.5
8 3 54
13 4 74.5
~ 5295 rows ~
Paging Results
One reason why you might want the overall row number is that you might want to break
up your results into pages. For example, suppose you want your results in pages of, say,
twenty, and you now want to display page 3 of that.
302
Chapter 8 Window Functions
We can start with our pricelist view and include the row_number() window function:
SELECT
id, title, published, author,
price, tax, inc,
row_number() OVER(ORDER BY id) AS row_number
FROM aupricelist;
We haven’t yet included an ORDER BY clause, because there’s more to come. Some
DBMSs may decide to produce the results in id order, but that’s not guaranteed,
of course.
We can now put this in a CTE and filter on the row number:
WITH cte AS (
SELECT
id, title, published, author,
price, tax, inc,
row_number() OVER(ORDER BY id) AS row_number
FROM aupricelist
)
SELECT *
FROM cte
WHERE row_number BETWEEN 40 AND 59
ORDER BY id;
~ 20 rows ~
303
Chapter 8 Window Functions
Oracle has a built-in value called rownum. Sadly, you still need to use it from a CTE
or a subquery.
Of course, you don’t have to order by the id. You can use the title, or the price, as long
as you include it in both the window function and in the ORDER BY clause. And, of course,
you can also use DESC.
There is an alternative way to do this. Officially, you can use the OFFSET ...
FETCH ... clause:
This skips over the first 40 rows and fetches the next 20 rows after that.
Unofficially, some DBMSs support LIMIT ... OFFSET:
MSSQL also supports the simple SELECT TOP syntax, but it’s not so flexible.
Of course, these two alternatives are much simpler than using the window function
technique, but there is an advantage with using the window function.
Suppose you’re sorting by something non-unique, such as the price. The problem
with the normal paging techniques, including the row_number() earlier, is that the page
stops strictly at the number of rows (or less if there are no more).
304
Chapter 8 Window Functions
If you decide to keep the prices together, you can instead use something like
WITH cte AS (
SELECT
id, title, published, author,
price, tax, inc,
rank() OVER(ORDER BY price) AS rank
FROM aupricelist
)
SELECT *
FROM cte
WHERE rank BETWEEN 40 AND 59
ORDER BY price;
As long as the groupings aren’t too big, it should give you nearly the same results, but
with all the books of one price together.
SELECT
id, givenname, familyname, height,
ntile(10) OVER (order by height) AS decile
FROM customers
WHERE height IS NOT NULL;
305
Chapter 8 Window Functions
~ 267 rows ~
Notice that we’ve filtered out the NULL heights. If we hadn’t, then the first or last
decile or so will be filled with NULL heights, depending on your DBMS. This creates a
group that doesn’t really belong, but are included anyway.
That’s just one trap with ntile(). There are two traps, one of which might be a deal
breaker.
First, note that the preceding result has 267 rows, which doesn’t evenly divide by 10.
That’s OK, but SQL has to work this one out, and you’ll find that the first seven groups
will have 27 rows, and the rest 26. Of course, your own results may be different, but the
idea is the same: the remainder rows will fill in from the front.
The second trap might take some hunting and may not be apparent in your own
sample database. If you look hard enough, you may find something like this:
…
388 Ron Delay 166.9 3
546 Pat Ella 167.1 3
106 Jay Walker 167.1 3
77 Lyn Seed 167.1 4
403 Will Knott 167.3 4
314 Jack Potts 167.4 4
306
Chapter 8 Window Functions
In this sample, you’ll see that three customers have the same height (167.1), but one
of them didn’t fit in the earlier decile, so was pushed into the next. That’s more of the
unfairness mentioned earlier, as is due to the fact that ntiles are calculated purely on
the row number and the value.
If you were, for example, awarding prizes or discounts to customers in certain
deciles, it would be unfair to miss out just because the sort order is unpredictable.
This might be a deal breaker, if you rely on the ntile. There is, however, a
workaround.
We’ll call this value bin, which is a common statistical name for groups.
We can put that into a CTE and run the following:
307
Chapter 8 Window Functions
AS count_vigintile,
bin
FROM customers, data
WHERE height IS NOT NULL
ORDER BY height;
SQLite doesn’t have a floor() function, but you can use cast(... AS int) instead:
~ 267 rows ~
Note that the vigintile and row_vigintile values should be the same; the
row_vigintile is there to show how the vigintile was calculated from the row number.
More importantly, you’ll see that the rank_vigintile and count_vigintile
columns are calculated from the rank() and count(*) values, and they always put the
rows with the same height in the same group. It’s up to you to decide which is preferable.
308
Chapter 8 Window Functions
Here, as well as the OVER clause, we need to supply two values. The column value
refers to which data in the other row you want. The number value refers to how many
rows back or forward to get it from. If you want, you can leave it out, in which case it will
default to 1.
For example, suppose you want to look at sales for each day, as well as for the
previous and next days. You can write
SELECT
ordered_date, daily_total,
lag(daily_total) OVER (ORDER BY ordered_date)
AS previous,
lead(daily_total) OVER (ORDER BY ordered_date)
AS next
FROM daily_sales
ORDER BY ordered_date;
You’ll see:
~ 388 rows ~
309
Chapter 8 Window Functions
You’ll notice that the previous for the first row is NULL; so is the next for the last row.
You might think that’s a bit pointless if you can just move your eyes to look up
or down a row. However, you can also incorporate the lag or lead in a calculation.
For example, suppose you want to compare sales for each day to a week before. You
could use
SELECT
ordered_date, daily_total,
lag(daily_total,7) OVER (ORDER BY ordered_date)
AS last_week,
daily_total
- lag(daily_total,7) OVER (ORDER BY ordered_date)
AS difference
FROM daily_sales
ORDER BY ordered_date;
This results in
~ 388 rows ~
310
Chapter 8 Window Functions
Here, the expression lag(total,7) gets the value for seven rows before. As you’d
expect, the first seven rows have NULL for the value.
There are two important conditions if you want to use lag or lead meaningfully:
• There must be only one row for each instance you want to test. For
example, you can’t have two rows with the same date.
That’s because we’re interpreting each row as one day. If you’re just working with a
sequence or sales regardless of the date, it won’t matter.
If you look carefully (and patiently) through the data, you will find that there are
a few missing dates. That means that the previous row isn’t always “yesterday,” and
the seven rows previous isn’t always “last week.” We’ll see how to plug these gaps in
Chapter 9.
Summary
Window functions are functions which give a row-by-row value based on a “window” or
a group of rows.
Window functions include
• Aggregate functions
Window Clauses
A window function features an OVER() clause:
311
Chapter 8 Window Functions
Coming Up
In Chapter 7, we’ve already discussed how Common Table Expressions work. In fact,
we’ve used them pretty extensively throughout the book.
In the next chapter, we’ll have another look at CTEs and examine some of their more
sophisticated features. In particular, we’ll have a look at the dreaded recursive CTE.
312
CHAPTER 9
CTEs As Variables
In Chapter 4, we tested some calculations with a test value:
WITH vars AS (
SELECT ' abcdefghijklmnop ' AS string
-- FROM dual -- Oracle
)
SELECT
string,
-- sample string functions
FROM vars;
Later in this chapter, we’ll see a more sophisticated version of this technique when
we look at table literals. For now, let’s look at how we can use this.
Some DBMSs as well as all programming languages have a concept of variables.
A variable is a temporary named value. Where the DBMS supports it, you declare a
variable name and assign a value which you use in a subsequent step. For example, in
MSSQL, you can write this:
-- MSSQL
DECLARE @taxrate decimal(4,2);
313
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_9
Chapter 9 More on Common Table Expressions
To run this, you would need to highlight all of the statements and run in one go.
This chapter won’t focus on these variables, but you’ll see more on using variables
in Chapter 10. Instead, we’ll have a look at using a common table expression to do a
similar job.
Strictly speaking, what we’re going to use is not variables but constants, which
means that we will set their value once only. However, we can get away with using the
looser term “variable,” as it’s more generic.
There are two main benefits to defining variables:
• You can specify an arbitrary value once, but use it multiple times.
In the preceding CTE example, where we’re not working with real data, we simply
selected from the CTE itself. In more realist examples, we will cross join the CTE with
other tables.
WITH vars AS (
SELECT 0.1 AS taxrate
-- FROM dual -- Oracle
)
314
Chapter 9 More on Common Table Expressions
We can now combine the CTE with the books table, using a simple cross join:
WITH vars AS (
SELECT 0.1 AS taxrate
-- FROM dual -- Oracle
)
SELECT * FROM books, vars;
~ 1201 rows ~
A cross join combines every row from one table to every row from another. Since
the vars CTE only has one row, the cross join simply has the effect of adding another
column to the books table.
SQL has a more modern syntax for a cross join: books CROSS JOIN vars. Here, we’ll
use the older syntax because it’s simpler and more readable.
We can now calculate the price list with tax:
315
Chapter 9 More on Common Table Expressions
This gives us
~ 1201 rows ~
Of course, we could just as readily have used 0.1 instead of the taxrate and
dispensed with the CTE and the cross join. However, the CTE has the benefit of allowing
us to set the tax rate once at the beginning, where it’s easy to maintain and can be used
multiple times later.
Deriving Constants
The values don’t need to be literal values. You can also derive the values from another
query. For example, to get the oldest and youngest customers, first set the minimum and
maximum dates in variables:
-- vars CTE
SELECT min(dob) AS oldest, max(dob) AS youngest
FROM customers
You can then cross join that with the customers table to get the matching customers:
WITH vars AS (
SELECT min(dob) AS oldest, max(dob) AS youngest
FROM customers
)
SELECT *
FROM customers, vars
WHERE dob IN(oldest, youngest);
316
Chapter 9 More on Common Table Expressions
To get the shorter customers, you can set the average height in a variable:
This is the sort of thing you can’t do otherwise, because the average is an aggregate.
317
Chapter 9 More on Common Table Expressions
Customerid last_order
~ 269 rows ~
Here, we have two important pieces of data: the customer id and the date and
time of the most recent order. Using this in a subquery, we can join the results with the
customers and sales tables to get more details:
318
Chapter 9 More on Common Table Expressions
~ 266 rows ~
Note that the CTE was used to join the two tables and act as a filter. We don’t actually
need its results in the output.
-- cte
SELECT familyname, givenname FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
Here, customers are grouped by both names, and the groups are filtered for more
than one instance.
Putting that in a CTE, we can join that to the customers table:
WITH names AS (
SELECT familyname, givenname FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
)
SELECT
c.id, c.givenname, c.familyname,
319
Chapter 9 More on Common Table Expressions
c.email, c.phone
-- etc
FROM customers AS c
JOIN names ON c.givenname=names.givenname
AND c.familyname=names.familyname
ORDER BY c.familyname, c.givenname;
~ 16 rows ~
We’ve joined the CTE and the customers table using two columns and included their
email addresses and phone numbers (if any) so that we can chase them up.
SELECT *
FROM customers, vars
WHERE dob IN(oldest, youngest);
For the most part, it’s a matter of taste whether you do it this way or add the aliases
inside the CTE. If you do include the names, they will override any aliases in the CTE.
One reason you might prefer CTE parameter names is if you think it’s more readable,
as you have all the names in one place. Later, we’ll be writing more complex CTEs
which involve multiple CTEs and unions, and it will definitely be easier to follow with
parameter names, so you’ll be seeing more of that style from here on.
SELECT columns
FROM (
SELECT columns FROM table
) AS sq;
A CTE can make this more manageable by putting this subquery at the beginning:
WITH cte AS (
SELECT columns FROM table
)
SELECT columns
FROM cte;
That’s already an improvement, but where the improvement becomes more obvious
is when the subquery also has a subquery:
SELECT columns
FROM (
SELECT columns FROM (
SELECT columns FROM table
) AS sq1
) AS sq2;
321
Chapter 9 More on Common Table Expressions
That’s called nesting subqueries, and it can become a nightmare if things get too
complex.
Thankfully, CTEs work much more simply:
WITH
sq1 AS (SELECT columns FROM table),
sq2 AS (SELECT columns FROM sql1)
SELECT columns FROM sq2;
You can have multiple CTEs chained this way, as long as you remember to separate
them with a comma. As you see in this example, each subquery can refer to a previous
one in the chain.
We’ll build this up a little more later, and we’ll see that additional CTEs don’t
necessarily have to refer to the previous ones.
WITH names AS (
SELECT familyname, givenname FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
)
SELECT
c.id, c.givenname, c.familyname,
c.email, c.phone
FROM customers AS c
JOIN names ON c.givenname=names.givenname
AND c.familyname=names.familyname
ORDER BY c.familyname, c.givenname;
322
Chapter 9 More on Common Table Expressions
WITH
names AS (
SELECT familyname, givenname FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
),
duplicates(givenname, familyname, info) AS (
SELECT
c.givenname, c.familyname,
cast(c.id AS varchar(5)) || ': ' || c.email
-- MSSQL: Use +
FROM customers AS c -- Oracle: No AS
JOIN names ON c.givenname=names.givenname
AND c.familyname=names.familyname
)
SELECT * from duplicates
ORDER by familyname, giv1enname;
Note .
323
Chapter 9 More on Common Table Expressions
~ 16 rows ~
The next step is to consolidate them by combining the info column values:
WITH
names AS ( ),
duplicates(givenname, familyname, info) AS ( )
SELECT
givenname, familyname, count(*),
-- PostgreSQL, MSSQL
string_agg(info,', ') AS info
-- MySQL/MariaDB
-- group_concat(info SEPARATOR ', ') AS info
-- SQLite
-- group_concat(info,', ') AS info
-- Oracle
-- listagg(info,', ') AS info
FROM duplicates
GROUP BY familyname, givenname
ORDER by familyname, givenname;
324
Chapter 9 More on Common Table Expressions
Recursive CTEs
As you’ve seen, a feature of using CTEs is that one CTE can refer to a previous
CTE. Another feature is that a CTE can refer to itself.
Anything which refers to itself is said to be recursive. If you’re a programmer,
recursive functions are functions which call themselves and are very risky if not handled
properly. Similarly, a recursive CTE can be very risky if you’re not careful.
A recursive CTE takes one of two forms, depending on your DBMS:
UNION ALL
-- Recursive Member
SELECT ... FROM cte WHERE ...
)
As you see, PostgreSQL, MariaDB/MySQL, and SQLite use the RECURSIVE keyword.
MSSQL and Oracle don’t, but require a UNION ALL instead of a simple UNION.
In both cases, you’ll see that the recursive CTE has two parts:
• The anchor defines the starting point or the first member.
In simple cases, there will be one value, but in other queries there
may be more than one.
Again, if there’s more than one anchor member, then there will be
multiple recursive members.
Note that the recursive CTE must define when it’s going to end or, more correctly,
when it can continue. Typically, that’s with a WHERE clause, as you’ve seen earlier, but can
use any other method, such as a join.
A simple example of a recursive CTE is one which generates a simple sequence. For
example:
326
Chapter 9 More on Common Table Expressions
UNION ALL
-- Recursive Member
SELECT n+1 FROM cte WHERE n<10
)
SELECT * FROM cte;
The CTE includes a parameter for convenience (cte(n)). Otherwise, you can put the
alias in the SELECT statement.
The single anchor value, in this case, is the number 1. The recursive (next) value is
n+1, so long as n<10. After that, it stops, and you end up with
1
2
3
...
8
9
10
• Generate a sequence
• Traverse a hierarchy
We’ll also use a recursive CTE to split a string into smaller parts, just to show you a
little creativity can be added to your queries.
1
Some SQLs, but not all, include additional structures such as DO ... WHILE in an SQL script.
They’re not really a standard part of the SQL language, but can be used in situations where you’re
desperate to do something iteratively.
327
Chapter 9 More on Common Table Expressions
Generating a Sequence
We’ve already seen how to generate a sequence of numbers:
WITH cte AS (
-- Anchor
SELECT 0 AS n
UNION ALL
-- Recursive
SELECT n+1 FROM cte WHERE n<100
)
SELECT * FROM cte;
The thing to remember is that the recursive member has a WHERE clause to limit the
sequence. Without that, the recursive query would try to run forever, and as you know,
nothing lasts forever.
MSSQL has a built-in safety limit of 100 recursions, which we’ll have to
circumvent later:
-- MSSQL
WITH cte (
)
SELECT ... FROM cte OPTION(MAXRECURSION ...);
The others don’t, but for PostgreSQL, MariaDB, and MySQL, you can readily set a
time limit:
-- PostgreSQL
SET statement_timeout TO '5s';
-- MariaDB
SET MAX_STATEMENT_TIME=1; -- seconds
-- MySQL
SET MAX_EXECUTION_TIME=1000; -- milliseconds
If you’re sure about your recursion terminating properly, you don’t need to worry
about this. In MSSQL, you will, however, need to increase or disable the recursion limit
for some queries.
However, it won’t hurt to include a simple number sequence in what follows just to
be safe.
328
Chapter 9 More on Common Table Expressions
One case where a sequence can be useful is to get a sequence of dates. This will
simply define a start date and add one day in the recursive member.
The CTE starts simply enough:
Note that the first value, d, has been cast to a date, with the exception of SQLite,
which doesn’t have a date type. The n set to 1 is added as a sequence number, but is
really unnecessary. It’s added here to illustrate how you can use it to stop overrunning
your CTE.
The recursive part is also easy enough, but adding one day varies between DBMSs:
-- PostgreSQL
WITH RECURSIVE dates(d, n) AS (
SELECT date'2023-01-01', 1
UNION
SELECT d+1, n+1 FROM dates
WHERE d<'2023-05-01' AND n<10000
329
Chapter 9 More on Common Table Expressions
)
SELECT * FROM dates;
-- MariaDB / MySQL
WITH RECURSIVE dates(d, n) AS (
SELECT date'2023-01-01', 1
UNION
SELECT date_add(d, interval 1 day), n+1 FROM dates
WHERE d<'2023-05-01' AND n<10000
)
SELECT * FROM dates;
-- MSSQL
WITH dates(d, n) AS (
SELECT cast('2023-01-01' as date), 1
UNION ALL
SELECT dateadd(day,1,d), n+1 FROM dates
WHERE d<'2023-05-01' AND n<10000
)
SELECT * FROM dates;
-- SQLite
WITH RECURSIVE dates(d, n) AS (
SELECT '2023-01-01', 1
UNION
SELECT strftime('%Y-%m-%d',d,'+1 day'), n+1 FROM dates
WHERE d<'2023-05-01' AND n<10000
)
SELECT * FROM dates;
-- Oracle
WITH dates(d, n) AS (
SELECT date '2023-01-01', 1 FROM dual
UNION ALL
SELECT d+1, n+1 FROM dates
WHERE d<date'2023-05-01' AND n<10000
)
SELECT * FROM dates;
330
Chapter 9 More on Common Table Expressions
D n
2023-01-01 1
2023-01-02 2
2023-01-03 3
2023-01-04 4
2023-01-05 5
2023-01-06 6
~ 121 rows ~
You’ll notice that for MSSQL, we’ve added OPTION (MAXRECURSION 0), which
basically disables the recursion limit.
Note also the AND n<10000 in the WHERE clause. That number is pretty big, and it
amounts to over 27 years, but it’s not infinite. If you make an error in when to stop the
CTE, that expression should limit the recursions.
You might wonder why you would want a sequence of dates between 2023-01-01
and 2023-05-01, the answer would be “why not?”, which isn’t very convincing. However,
we’re going to use this technique to overcome a problem mentioned in Chapter 8: some
of the dates will be missing from our summary.
331
Chapter 9 More on Common Table Expressions
-- MSSQL, Oracle
WITH
allyears(year) AS (
SELECT 1940
UNION ALL
SELECT year+1 FROM allyears WHERE year<2010
)
Next, get the customer (id) and the year of birth of the customers:
332
Chapter 9 More on Common Table Expressions
You’ll need the LEFT JOIN to include all of the sequence of years even if it doesn’t
match a customer year; after all, that’s why it’s there.
year nums
1940 1
1941 1
1942 1
1943 1
1944 1
1945 1
~ 71 rows ~
SELECT *
FROM daily_sales
ORDER BY ordered_date;
333
Chapter 9 More on Common Table Expressions
However, if you look hard enough, you’ll find some dates missing. We’re about to fill
them in.
For this, we’ll need the following:
• A CTE with the first and last dates of the daily sales
• A sequence of dates
You already know how to generate a sequence of dates. This time, instead of starting
and stopping on arbitrary dates, we’ll start and stop on the first and last dates of the
daily_sales view. We can put those values in a CTE for reference:
WITH
vars(first_date, last_date) AS (
SELECT min(ordered_date), max(ordered_date)
FROM daily_sales
)
-- PostgreSQL
WITH RECURSIVE
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION
SELECT d+1 FROM vars, dates WHERE d<last_date
)
-- MariaDB / MySQL
WITH RECURSIVE
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION
SELECT date_add(d, interval 1 day)
FROM vars, dates WHERE d<last_date
)
334
Chapter 9 More on Common Table Expressions
-- MSSQL
WITH
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION ALL
SELECT dateadd(day,1,d)
FROM vars, dates WHERE d<last_date
)
-- SQLite
WITH RECURSIVE
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION
SELECT strftime('%Y-%m-%d',d,'+1 day')
FROM vars, dates WHERE d<last_date
)
-- Oracle
WITH
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION ALL
SELECT d+1 FROM vars, dates WHERE d<last_date
)
For those DBMSs which use the keyword RECURSIVE, you use it once at the
beginning, even if some of the CTEs aren’t recursive.
Notice that we’ve cross-joined the vars and dates, which is the usual technique of
applying variables to another table. We could have written CROSS JOIN, but it’s not worth
the effort.
335
Chapter 9 More on Common Table Expressions
We can now complete our query using a LEFT JOIN to get all of the sequence of dates:
ordered_date daily_total
2022-04-08 97.5
2022-04-09 96
2022-04-10 191
2022-04-11 201.5
2022-04-12 91
2022-04-13 160
~ 387 rows ~
Traversing a Hierarchy
Another use case for a recursive CTE is to traverse a hierarchy. The hierarchy we’re going
to look at is in the employees table:
336
Chapter 9 More on Common Table Expressions
Of course, in a real employees table, there would be more details; we’ve only
included enough here to make the point.
In particular, you’ll see that in the employees table, there is a supervisorid column
which is a foreign key to the same table:
employees.supervisorid ➤ employees.id
A more naive approach would be either to include the supervisor’s name, which
is wrong for the same reasons we don’t include the author’s name with the books
table, or to reference another table of supervisors, which is wrong for a different, more
subtle reason.
With books and authors, the point is that an author is not the same as a book. In a
well-designed database, each table has only one type of member. That’s not the case
with employees and supervisors. Put simply, the supervisor is another employee.
We’re going to traverse the employees table to get a list of employees and their
supervisors.
SELECT
e.id AS eid,
e.givenname, e.familyname,
s.id AS sid,
s.givenname||' '||s.familyname AS supervisor
-- s.givenname+' '+s.familyname AS supervisor -- MSSQL
FROM employees AS e LEFT JOIN employees AS s
ON e.supervisorid=s.id -- Oracle: No AS
ORDER BY e.id;
337
Chapter 9 More on Common Table Expressions
~ 34 rows ~
The trick is, when joining a table to itself, you need to give the table two different
aliases to qualify the join.
The columns include some of the raw details, as well as a string of supervisors.
Obviously, for the anchor member, the supervisor id will be NULL, and the string is
empty. You’ll also notice a sequence number, starting at 1. That’s for a trick we’ll resort to
later on.
338
Chapter 9 More on Common Table Expressions
There will be more than one row for the anchor. That’s all right and will still work the
same way. There’ll just be more than one sequence going.
The recursive member will be the employees with supervisors (i.e., the rest) with a
growing list of their supervisors:
The join is similar to the self-join earlier. The current employee is referred to in the
e table alias, and this aliased table is joined to the CTE, which will be the supervisor.
The raw data will be from the aliased table, while the supervisor’s details will be
concatenated as the new supervisors parameter.
Normally, you’d want to limit the recursion with a WHERE clause. For this one, the join
will do the job, as it will stop when there are no more to be joined.
339
Chapter 9 More on Common Table Expressions
The magic is in the expression for the supervisors string. In the recursive member,
the CTE represents inherited values.
~ 34 rows ~
This will work in most DBMSs, but not yet in MSSQL or in MariaDB/
MySQL. However, it will nearly work.
In the case of MariaDB/MySQL, the '' in the anchor causes it to jump to the
conclusion that the string will be zero characters long, so the supervisors column will
be empty.
You will need to cast your empty string in the anchor to a longer one:
SELECT
..., cast('' AS char(255)), 1
FROM employees WHERE supervisorid IS NULL
340
Chapter 9 More on Common Table Expressions
SELECT
..., cast('' AS nvarchar(255)), ...
FROM employees WHERE supervisorid IS NULL
UNION ALL
SELECT
...,
cast(cte.givenname+' '+cte.familyname
+' < '+cte.supervisors as nvarchar(255)), ...
FROM cte JOIN employees AS e ON cte.id=e.supervisorid
-- Others
cte.givenname||' '||cte.familyname
|| CASE WHEN n>1 THEN ' < ' ELSE '' END
|| cte.supervisors
-- MSSQL
cte.givenname+' '+cte.familyname
+ CASE WHEN n>1 THEN ' < ' ELSE '' END
+ cte.supervisors
341
Chapter 9 More on Common Table Expressions
~ 34 rows ~
inserts from a virtual table, generated by the VALUES clause. That also means that, in
principle, you should be able to use VALUES ... as a virtual table without actually
inserting anything. Unfortunately, it’s not quite so straightforward.
A table literal is an expression which results in a collection of rows and columns—a
virtual table. If things go according to plan, it could look like this:
Not all DBMSs see it that way. Some DBMSs do allow just such an expression, but
others have something a little more complicated.
342
Chapter 9 More on Common Table Expressions
A little later, we’ll want to work with a virtual table to experiment with, so the first
step will be to put this into a CTE. Using the standard notation, you can use
id value
a apple
b banana
c cherry
Note that we’ve included the column names in the CTE name.
For the other DBMSs, there are various alternatives:
-- MSSQL
WITH cte(id,value) AS (
SELECT * FROM
(VALUES ('a','apple'), ('b','banana'),
('c','cherry')) AS sq(a,b)
)
SELECT * FROM cte;
-- MySQL (not MariaDB)
WITH cte(id,value) AS (
VALUES ROW('a','apple'), ROW('b','banana'),
ROW('c','cherry')
)
SELECT * FROM cte;
-- Oracle
WITH cte(id,value) AS (
SELECT 'a','apple' FROM dual
343
Chapter 9 More on Common Table Expressions
As you see, the prize for the most awkward version goes to Oracle, which doesn’t yet
support a proper table literal. Apparently, that’s coming soon.
MSSQL does support a table literal, but, for some unknown reason, it has to be inside
a subquery, complete with a dummy subquery name and dummy column names.
MySQL also supports a table literal, but requires each row inside a ROW() constructor,
because MySQL has a non-standard values() function which conflicts with using it simply
as a table literal. This is one of the cases where MariaDB and MySQL are not the same.
WITH dates(dob,today) AS (
-- list of dob and today values
)
SELECT
-- today - dob AS age
FROM dates;
The actual code is commented out, because the DBMSs all have their own ways.
It gets further complicated because of the date literals.
We’re going to try this with the following series of dates:
dob today
1940-07-07 2023-01-01
1943-02-25 2023-01-01
1942-06-18 2023-01-01
1940-10-09 2023-01-01
(continued)
344
Chapter 9 More on Common Table Expressions
dob today
1940-07-07 2022-12-31
1943-02-25 2022-12-31
1942-06-18 2022-12-31
1940-10-09 2022-12-31
1940-07-07 2023-07-07
1943-02-25 2023-02-25
1942-06-18 2023-06-18
1940-10-09 2023-10-09
-- Oracle
WITH dates(dob, today) AS (
SELECT date'1940-07-07',date'2023-01-01' FROM dual
UNION ALL SELECT date'1943-02-25',date'2023-01-01'
FROM dual
UNION ALL SELECT date'1942-06-18',date'2023-01-01'
FROM dual
-- etc
)
Note .
346
Chapter 9 More on Common Table Expressions
You now have a virtual table with a collection of test dates. You can now try out your
age calculation:
-- PostgreSQL
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
extract(year from age(today,dob)) AS age
FROM dates;
-- MariaDB/MySQL
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
timestampdiff(year,dob,current_timestamp) AS age
FROM dates;
-- MSSQL
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
datediff(year,dob,today) AS age
FROM dates;
-- SQLite
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
cast(
strftime('%Y.%m%d', today) - strftime('%Y.%m%d', dob)
347
Chapter 9 More on Common Table Expressions
We’ve already noted in Chapter 4 how MSSQL gets the age wrong, and this is one
way you can test this.
WITH
data AS (...)
summary AS (...)
SELECT
weekday_number, total,
100*total/sum(total) OVER()
FROM weekday_number
ORDER BY weekday_number;
The problem is that we’ve had to get the weekday number in order to sort this
correctly. It would have been nicer to use the weekday name instead. We can then use an
additional virtual table to sort the names.
348
Chapter 9 More on Common Table Expressions
First, let’s redo the data CTE with the day name:
-- PostgreSQL, Oracle
WITH data AS (
SELECT to_char(ordered,'FMDay') AS weekday, total
FROM sales
)
-- MSSQL
WITH data AS (
SELECT datename(weekday,ordered) AS weekday, total
FROM sales
)
-- MariaDB/MySQL
WITH data AS (
SELECT date_format(ordered,'%W'), total
FROM sales
)
You’ll notice that SQLite isn’t included in the list. That’s because it doesn’t have a
method of getting the weekday name. If you need it, you’ll want the reverse technique in
the next section.
The summary CTE will now group by the weekday name:
WITH
data AS (
SELECT
... AS weekday,
total
FROM sales
),
summary AS (
SELECT weekday, sum(total) AS total
FROM data
GROUP BY weekday
)
-- etc
349
Chapter 9 More on Common Table Expressions
We’ll now need a table literal with the days of the week as well as a sequence number.
sequence weekday
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
350
Chapter 9 More on Common Table Expressions
Finally, to do the sorting, you can join the summary CTE with the weekdays CTE and
sort by the sequence number:
WITH
data AS ( ),
summary AS ( ),
weekdays(dob, today) AS ( )
SELECT
summary.weekday, summary.total,
100*total/sum(summary.total) OVER()
FROM summary JOIN weekdays
ON summary.weekday=weekdays.weekday
ORDER BY weekdays.sequence;
One advantage of this technique is that you can change the sequence numbering in
the table literal, for example, to start on Wednesday if that suits you better.
By the way, if you’re going to sort by weekday, or anything like it, very often, you
might be better off saving the data in a permanent lookup table.
351
Chapter 9 More on Common Table Expressions
-- Oracle
WITH statuses(status,name) AS (
SELECT 1,'Gold' FROM DUAL
UNION ALL SELECT 2,'Silver' FROM DUAL
UNION ALL SELECT 3,'Bronze' FROM DUAL
)
WITH statuses(status,name) AS (
-- etc
)
SELECT *
FROM
customers
LEFT JOIN vip ON customers.id=vip.id
LEFT JOIN statuses ON vip.status=statuses.status
;
352
Chapter 9 More on Common Table Expressions
Again, the benefit is that you can change the status names on the fly.
You can also do the same sort of thing with author and customer genders. Another
thing you can do with this technique is to translate from one set of names to another set
of names.
You may be wondering why we don’t include the full name of the gender or the vip
status in the table itself. Remember that you should only record a piece of data
once, and it should be the simplest version possible. Storing a value as a single
character, as with the gender, or an integer, as with the vip status, reduces the
possibility of data error or variation, and you can spell it out later when you want.
Splitting a String
If you have the courage to look in the script which generated the database, you’ll find two
recursive CTEs near the end:
-- Populate Genres
INSERT INTO genres(genre)
WITH split(bookid,genre,rest,genres) AS (
...
)
SELECT DISTINCT genre
FROM split
WHERE split.genre IS NOT NULL;
-- Populate Book Genres
INSERT INTO bookgenres(bookid,genreid)
WITH split(bookid,genre,rest,genres) AS (
...
)
SELECT split.bookid,genres.id
FROM split JOIN genres ON split.genre=genres.genre
WHERE split.genre IS NOT NULL;
353
Chapter 9 More on Common Table Expressions
354
Chapter 9 More on Common Table Expressions
In order to make the code readable, the string has been split over two lines. Don’t
do this in your real code!
Some DBMSs don’t like string literals with a line break inside. For those that will
accept the line break, it will be part of the data, and we won’t want that.
Be sure to write the string on one line, even if it’s very long.
For the recursive CTE, we’ll build two values: the individual item and a string
containing the rest of the original string. The CTE can be called split:
WITH
cte(fruit) AS (),
split(fruit, rest) AS (
)
The anchor member will get the first item from the string, up to the comma, and the
rest, after the comma:
WITH
cte(fruit) AS (),
-- PostgreSQL
split(fruit, rest) AS (
SELECT
substring(fruit,0,position(',' in fruits)),
substring(fruit,position(',' in fruits)+1)||','
FROM cte
)
-- MariaDB, MySQL
split(fruit, rest) AS (
SELECT
substring(fruit,1,position(',' in fruits)-1),
substring(fruit,position(',' in fruits)+1)||','
FROM cte
)
355
Chapter 9 More on Common Table Expressions
-- MSSQL
split(fruit, rest) AS (
SELECT
cast(substring(fruit,0,charindex(',',fruits)) as varchar(255)),
cast(substring(fruit,charindex(',',fruits)+1,255)+',' as varchar(255))
FROM cte
)
-- SQLite
split(fruit, rest) AS (
SELECT
substring(fruit,0,instr(fruits,',')),
substring(fruit,instr(fruits,',')+1)||','
FROM cte
)
-- Oracle
split(fruit, rest) AS (
SELECT
substr(fruit,1,instr(fruits,',')-1),
substr(fruit,instr(fruits,',')+1)||','
FROM cte
)
Note that for MSSQL we’ve had to cast the calculation to varchar(255) because of a
peculiarity with string compatibility.
For the recursive member, we use the rest value. First, we get the string up to the
first comma, which becomes the fruit value. Then, we get the rest of the string from the
comma, which becomes the new value for rest:
WITH
cte(fruit) AS (),
-- PostgreSQL
split(fruit, rest) AS (
SELECT ...
UNION
SELECT
substring(rest,0,position(',' in rest)),
356
Chapter 9 More on Common Table Expressions
substring(rest,position(',' in rest)+1)
FROM cte WHERE rest<>''
)
-- MariaDB, MySQL
split(fruit, rest) AS (
SELECT ...
UNION
SELECT
substring(rest,1,position(',' in rest)-1),
substring(rest,position(',' in rest)+1)
FROM cte WHERE rest<>''
)
-- MSSQL
split(fruit, rest) AS (
SELECT ...
UNION ALL
SELECT
substring(rest,0,charindex(',', rest)),
substring(rest,charindex(',', rest)+1,255)
FROM cte WHERE rest<>''
)
-- SQLite
split(fruit, rest) AS (
SELECT ...
UNION
SELECT
substring(rest,0,instr(rest,',')),
substring(rest,instr(rest,',')+1)
FROM cte WHERE rest<>''
)
-- Oracle
split(fruit, rest) AS (
SELECT ...
UNION ALL
357
Chapter 9 More on Common Table Expressions
SELECT
substr(rest,1,instr(rest,',')-1),
substr(rest,instr(rest,',')+1)
FROM cte WHERE rest<>''
)
Note that we don’t add a comma to the rest value this time: that was just to get
started.
We have also added WHERE rest<>'' to the FROM clause. This is because we need to
stop recursing when there’s no more of the string to search.
You can now try it out:
WITH
cte(fruit) AS (),
split(fruit,rest) AS ()
SELECT * FROM split;
fruit rest
Apple Banana,Cherry,Date,Elderberry,Fig,
Banana Cherry,Date,Elderberry,Fig,
Cherry Date,Elderberry,Fig,
Date Elderberry,Fig,
Elderberry Fig,
Fig [NULL]
Of course, we don’t need to see the rest value in the output: it’s just there so you can
see its progress.
358
Chapter 9 More on Common Table Expressions
name list
colours Red,Orange,Yellow,Green,Blue,Indigo,Violet
elements Hydrogen,Helium,Lithium,Beryllium,Boron,Carbon
numbers One,Two,Three,Four,Five,Six,Seven,Eight,Nine
-- Oracle)
WITH
cte(name,items) AS (
SELECT 'colours','Red,Orange,...,Indigo,Violet'
FROM dual
UNION ALL SELECT 'elements','Hydrogen,...,Carbon'
FROM dual
UNION ALL SELECT 'numbers','One,Two,...,Eight,Nine'
FROM dual
),
WITH
cte(name, items) AS (),
-- PostgreSQL
split(name, item, rest) AS (
SELECT
name,
substring(items,0,position(',' in items)),
substring(items,position(',' in items)+1)||','
FROM cte
)
-- MariaDB, MySQL
split(name, list, rest) AS (
SELECT
name,
substring(items,1,position(',' in items)-1),
substring(items,position(',' in items)+1)||','
FROM cte
)
-- MSSQL
split(name, list, rest) AS (
SELECT
360
Chapter 9 More on Common Table Expressions
name,
cast(substring(items,0,charindex(',', items)) as varchar(255)),
substring(items,charindex(',', items)+1,255)+','
FROM cte
)
-- SQLite
split(name, list, rest) AS (
SELECT
name,
substring(items,0,instr(items,',')),
substring(items,instr(items,',')+1)||','
FROM cte
)
-- Oracle
split(name, list, rest) AS (
SELECT
name,
substr(items,1,instr(items,',')-1),
substr(items,instr(items,',')+1)||','
FROM cte
)
As for the recursive member, again it’s the same idea, with the name value included:
WITH
cte(name, items) AS (),
-- PostgreSQL
split(name, list, rest) AS (
SELECT ...
UNION
SELECT
name,
substring(rest,0,position(',' in rest)),
substring(rest,position(',' in rest)+1)
FROM cte WHERE rest<>''
)
361
Chapter 9 More on Common Table Expressions
-- MariaDB, MySQL
split(name, list, rest) AS (
SELECT ...
UNION
SELECT
name,
substring(rest,1,position(',' in rest)-1),
substring(rest,position(',' in rest)+1)
FROM cte WHERE rest<>''
)
-- MSSQL
split(name, list, rest) AS (
SELECT ...
UNION ALL
SELECT
name,
cast(substring(rest,0,charindex(',', rest)) as varchar(255)),
substring(rest,charindex(',', rest)+1,255)
FROM cte WHERE rest<>''
)
-- SQLite
split(name, list, rest) AS (
SELECT ...
UNION
SELECT
name,
substring(rest,0,instr(rest,',')),
substring(rest,instr(rest,',')+1)
FROM cte WHERE rest<>''
)
-- Oracle
split(name, list, rest) AS (
SELECT ...
UNION ALL
SELECT
name,
362
Chapter 9 More on Common Table Expressions
substr(rest,1,instr(rest,',')-1),
substr(rest,instr(rest,',')+1)
FROM cte WHERE rest<>''
)
WITH
cte(name, items) AS ()
split(name, item, rest) AS ()
SELECT *
FROM split
ORDER BY name, item;
When it’s all going, you should see something like the following:
363
Chapter 9 More on Common Table Expressions
As you see the recursive CTE was able to work with multiple rows of data.
Summary
In this chapter, we had a closer look at using Common Table Expressions. A common
table expression generates a virtual table that you can use later in the main query. In the
past, you would make do with a subquery in the FROM clause.
The reason why you would use a CTE or a FROM subquery is that you might need to
prepare data but you don’t want to go to the trouble of saving it either in a view or a temporary
table. CTEs are more ephemeral than temporary tables in that they are not saved at all.
CTEs have a number of advantages over FROM subqueries:
• You define the CTE before using it, making the query more readable
and more manageable.
• You can chain multiple dependent or independent CTEs simply. If
you wanted to do that with FROM subqueries, you would have to nest
them, which gets unwieldy very quickly.
• CTEs can be recursive, so you can use them to iterate through data.
Simple CTEs
The simplest use of a CTE is to prepare data for further processing. Some uses include
364
Chapter 9 More on Common Table Expressions
Parameter Names
A CTE is expected to have a name or alias for each column. You can define the names
inside the CTE, or you can define them as part of the CTE definition.
Multiple CTEs
Some queries involve multiple steps. These steps can be implemented by chaining
multiple CTEs.
Recursive CTEs
A recursive CTE is one which references itself. It can be used for iterating through a set
of data.
Some uses of recursive CTEs include
Coming Up
So far, we’ve worked on a number of important major concepts. In the next chapter,
we’ll have a look at a few additional techniques you can use to work smarter with your
database:
365
CHAPTER 10
More Techniques:
Triggers, Pivot Tables,
and Variables
Throughout the book, we’ve looked at pushing our knowledge and application of SQL a
little further and explored a number of techniques, some new and some not so new.
When looking at some techniques, in particular, those involving aggregates and
common table expressions, we also got a sense of pushing SQL deeper, with multitiered
statements.
In this chapter, we’ll go a little beyond simple SQL and explore a few techniques
which supplement SQL. They’re not directly related to each other, but they all allow you
to do more in working with your data.
SQL triggers are small blocks of code which run automatically in some response to
some database event. We’ll look at how these work and how you would write one. In
particular, we’ll look at a trigger to automatically archive data which has been deleted.
Pivot tables are aggregates in two dimensions. They allow you to build summaries in
both row and column data. We’ll look at an example of preparing data to be summarized
and how we produce a pivot table.
Variables are pieces of temporary data which can be used to maintain values
between statements. They allow us to run a group of SQL statements, while they hold
interim values which are passed from one statement to another. In this chapter, we’ll
look at using variables to hold temporary values while we add data to multiple tables.
367
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1_10
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
Understanding Triggers
Sometimes, a simple SQL query isn’t quite enough. Sometimes, what you really want is
for a query to start off one or more additional queries. Sometimes, what you want is a
trigger.
A trigger is a small block of code which will be run automatically when something
happens to the database. There are various types of triggers, including
One reason you might use DDL or Logon triggers is if you want to track activity by
storing this in a logging table.
Here, we’re going to look more at a DML trigger.
Triggers can be used to fill in some shortcomings of standard DBMS behavior. Here
are some examples which might call for a trigger:
• You might have an activity table which wants a date column updated
every time you make a change. You can use a trigger to set the
column for every insert or update.
• Suppose you have a rental table, where you enter a start and a finish
date. You’d like the finish date to default to the start date if it isn’t
entered. SQL defaults aren’t quite so clever, but you can set a trigger
to set the finish date when you insert a new row.
• SQL has no auditing in the normal sense of the word. You can create
a trigger to add some data to a logging table every time a row is
added, updated, or deleted.
In this example, we’re going to create a trigger to keep a copy of data which we’re
going to delete from the sales table.
368
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
In some of the preceding chapters, we’ve had to contend with the fact that in the
sales table, some rows have NULLs for the ordered date/time. Presumably, those sales
never checked out.
We’ve been pretty forgiving so far and filtered them out from time to time, but the
time has come to deal with them. We can delete all of the NULL sales as follows:
-- Not Yet!
DELETE FROM sales WHERE ordered IS NULL;
Note that there’s a foreign key from the saleitems table to the sales table, which
would normally disallow deleting the sales if there are any items attacked. However, if
you check the script which generates the sample database, you’ll notice the ON DELETE
CASCADE clause, which will automatically delete the orphaned sale items.
When should you delete data? The short answer is never. The longer answer is
more complicated. You would delete data that was entered in error, or you would
delete test data when you’ve finished testing.
In this case, we’re going to delete the sales with a NULL for the ordered date;
we’ll assume that the sale was never checked out and that the customer won’t
ever come back and finish it. However, we’ll keep a copy of it anyway, just in case.
Most DBMSs handle triggers in a very similar way, but there are variations. We’ll go
over the basics first and then the details for individual DBMSs.
369
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
None of the DBMSs do it exactly the same way, but it’s roughly right:
The event is typically one of BEFORE, AFTER, or INSTEAD OF, followed by one of the DML
statements. In this example, we want to do something with the old data before it’s deleted.
For the sample trigger, we’re going to copy the old data into a table called deleted_
sales. This means that we’re going to have to get to the data before it’s vanished. The
appropriate event is
BEFORE DELETE
It’s going to be a little complicated, because we want to copy not only the data from
the sales table but also from the saleitems table. We’ll do that by concatenating those
items into one string. You really shouldn’t keep multiple items that way, but it’s good
enough for an archive, and you can always pull it apart if you ever need to.
The archive table looks something like this:
-- PostgreSQL, MSSQL
WITH cte AS (
...
)
370
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
As you see, with some DBMSs you start with the CTE, as you would using a SELECT
statement, while in others you start with the INSERT clause.
As for the CTE itself, we’ll derive that from the data to be deleted.
For most DBMSs, each row to be deleted is represented in a virtual row called old
(:old in Oracle). MSSQL instead has a virtual table called deleted.
If we were simply archiving from one table, we wouldn’t need the CTE, and we could
simply copy the rows with
However, it’s not so simple when there’s another table involved. Here, the plan is to
read the book ids and quantities from the other table and combine them using string_
agg, group_concat, or listagg according to DBMS.
371
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
To generate the data, we’ll use a join and aggregate the results:
WITH cte(saleid,customerid,items) AS (
SELECT
s.id, s.customerid,
string_agg(si.bookid||':'||si.quantity,';')
FROM sales AS s JOIN saleitems AS si ON s.id=si.saleid
WHERE s.id=old.id
GROUP BY s.id, s.customerid
)
The preceding sample is for PostgreSQL, but the others are nearly identical—just the
variations in the string_agg() function, concatenation, and table aliases.
The items string will contain something like the following:
123:3;456:1;789:2
That is, one or more bookid:quantity items are joined with a semicolon.
If you do need to pull it apart, you can use the same techniques we used for splitting
strings in Chapter 9. We can now go about creating the trigger.
-- Before
SELECT * FROM sales order by id;
SELECT * FROM saleitems order by id;
SELECT * FROM deleted_sales order by id;
-- Delete with Trigger
DELETE FROM sales WHERE ordered IS NULL;
372
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
-- After
SELECT * FROM sales order by id;
SELECT * FROM saleitems order by id;
SELECT * FROM deleted_sales order by id;
PostgreSQL Triggers
PostgreSQL has the least convenient form of trigger, in that you first need to prepare a
function to contain the trigger code. A function is a named block of code, which can be
called later at any time.
To prepare for the function and trigger, we can start with a few DROP statements:
373
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
As you see, the function has the code for the CTE and for copying the data into the
deleted_sales table. Here are a few points about the function itself:
Once you have the function in place, creating the trigger is simple:
MySQL/MariaDB Triggers
With MariaDB/MySQL, the trigger can be written in a single block. First, we’ll write the
code to drop the trigger:
374
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
DELIMITER $$
DELIMITER ;
Here, the delimiter is changed to $$. It doesn’t have to be that, but it’s a combination
you’re unlikely to use for anything else. The new delimiter is used to mark the end of the
code and switched back to the semicolon after that.
After that, the trigger code is much as described:
DELIMITER $$
CREATE TRIGGER archive_sales_trigger
BEFORE DELETE ON sales
FOR EACH ROW
BEGIN
INSERT INTO deleted_sales(saleid,customerid,items,deleted_date)
WITH cte(saleid,customerid,items) AS (
SELECT
s.id, s.customerid,
group_concat(si.bookid||':'||si.quantity SEPARATOR ';')
FROM sales AS s JOIN saleitems AS si ON s.id=si.saleid
WHERE s.id=old.id
GROUP BY s.id, s.customerid
)
375
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
SELECT saleid,customerid,items,current_timestamp
FROM cte;
END; $$
DELIMITER ;
MSSQL Triggers
MSSQL also has a simple, direct way of creating a trigger. However, there’s a complicating
factor, which we’ll need to work around.
Before that, however, we’ll add the code to drop the trigger:
With other DBMSs, you create a BEFORE DELETE trigger to capture the data before it’s
gone. With MSSQL, you don’t have that option: there’s only AFTER DELETE and INSTEAD
OF DELETE. In both cases, there is a virtual table called deleted which has the rows to be
deleted.
The problem with AFTER DELETE is that, even though the deleted virtual table has
the deleted rows from the sales table, it’s too late to get the rows from the saleitems
table, as they have also been deleted, but there’s no virtual table for that.
For that, we’ll take a different approach. We’ll use an INSTEAD OF DELETE event,
which is to say that MSSQL will run the trigger instead of actually deleting the data. The
trick is to finish off the trigger by doing the delete at the end:
The deleted virtual table still has the rows which haven’t actually been deleted, but
were going to be before the trigger stepped in. All we need from that is the id to identify
the sales which should be deleted at the end, together with the cascaded sale items.
376
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
The other complication is that MSSQL won’t let you concatenate strings with
numbers, so you’ll have to cast the numbers as strings:
In MSSQL, varchar is short for varchar(30). It’s much more than we need for the
integers, but it will reduce to the actual size of the integer, and is easy to read.
The completed trigger code is
377
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
SQLite Triggers
Of all the DBMSs in this book, SQLite has by far the simplest and most direct version of
coding a trigger.
First, we can write the code to drop the trigger:
The code to create the trigger is almost identical to the discussion earlier:
The FOR EACH ROW clause is optional, since in SQLite there’s no alternative currently.
However, it’s included to make the point clear that the trigger applies to each row about
to be deleted.
You can now test the trigger.
Oracle Triggers
Writing trigger code in Oracle is similar to the basic code outlined earlier, but there are a
few complicating factors which we’ll need to work around.
378
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
/
CREATE TRIGGER archive_sales_trigger
BEFORE DELETE ON sales
FOR EACH ROW
BEGIN
...
END;
/
The forward slash (/) before and after the code defines the block. Everything
between the slashes, including the statements terminated with a semicolon, will be
treated as one block of code.
The second complication is that Oracle doesn’t like making changes to the table
doing the triggering. The solution is to tell Oracle that code is part of a separate
transaction:
/
CREATE TRIGGER archive_sales_trigger
BEFORE DELETE ON sales
FOR EACH ROW
DECLARE
PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
INSERT INTO deleted_sales(saleid, customerid, items, deleted_date)
WITH cte(saleid,customerid,items) AS (
SELECT
s.id, s.customerid,
listagg(si.bookid||':'||si.quantity,';')
FROM sales s JOIN saleitems si ON s.id=si.saleid
WHERE s.id=:old.id
GROUP BY s.id, s.customerid
)
SELECT saleid, customerid, items, current_timestamp
FROM cte;
COMMIT;
END;
/
380
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
We’ve already mentioned cases when you might want to provide a default value for
a column which is more complex than the built-in default feature, or you might want
to update a column automatically. Here, the trigger might be able to provide this extra
functionality.
However, it’s possible to get carried away with triggers. If there’s a DML trigger on
a table, then, every time you make any changes to the table data, there’s always a little
extra work, which might add an extra burden.
The other problem is that triggers might add a little more mystery to the database,
especially to other users of the database. Every time you do something, something else
happens. This can make troubleshooting a little trickier and make it a little harder to
check that the data is correct.
Pivoting Data
One of the important principles of good database design is that each column does
a different job. On top of that, each column is independent of the other columns.
That’s one reason why we put so much effort separating out the town details from the
customers table in Chapter 2.
There are some situations, however, where this sort of design doesn’t suit analysis.
Take, for example, a typical ledger type of table:
This is a layout that’s very easy to understand and analyze. If you want to get the
totals for a particular category, just add down the column. If you want to get the totals
for a particular item, just add across. This sort of thing used to be done by hand until
spreadsheets were invented to let the computer do all the hard work.
381
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
You may see this sort of design in database tables you come across. However, it’s not
a good design for SQL tables:
• A new category means adding a new column to the table design. You
may end up with a huge number of columns.
• The data is harder to analyze, because now you need calculate across
columns: SQL aggregate functions are designed to aggregate across rows.
• It is more interactive, and you can easily change what is being pivoted
and summarized.
• The spreadsheet will more automatically generate the categories; as
you will see, this is not so convenient from within the database.
On the other hand, using the database has the following advantages:
• You can create a view to regenerate the pivot table at any time.
There are two main ways you can generate a pivot table in SQL:
Since the purpose of pivoting data is to create summaries, you often need to use a
grouped field or to group the values yourself. For example:
• You can use the existing state column which is a group of addresses.
We’re going to see how to pivot data from the sales and customers tables to get total
sales by state and VIP categories. The result will look something like this:
383
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
In principle, you could transpose the table and have the VIP groups go down, with
the states going across. This version, however, will look neater.
WITH
statuses AS (
...
),
customerinfo AS (
...
),
salesdata AS (
)
...
• The vip table has a status number. The statuses CTE will be a table
literal which allocates a name to the number.
• The customerinfo CTE will join the tables together and select the
columns we want to summarize.
With those CTEs, we’ll run another aggregate query which will result in our
pivot table.
384
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
The status CTE is simple. We just need to match status numbers with names:
WITH
statuses(status, statusname) As (
-- PostgreSQL, SQLite, MariaDB (Not MySQL):
VALUES (1,'Gold'), (2,'Silver'), (3,'Bronze')
-- MySQL:
VALUES row(1,'Gold'), row(2,'Silver'),
row(3,'Bronze')
-- MSSQL:
SELECT * FROM (VALUES (1,'Gold'),(2,'Silver'),
(3,'Bronze'))
-- Oracle:
SELECT 1,'Gold' FROM dual
UNION ALL SELECT 2,'Silver' FROM dual
UNION ALL SELECT 3,'Bronze' FROM dual
)
The customerinfo CTE will join this to the customerdetails view and the vip table
to get the id, state, and status name for the customers:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
SELECT customerdetails.id, state, statuses.statusname
FROM
customerdetails
LEFT JOIN vip ON customerdetails.id=vip.id
LEFT JOIN statuses ON vip.status=statuses.status
)
SELECT *
FROM customerinfo;
385
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
Id state statusname
~ 303 rows ~
At this point, you can group it by state or status name to see how many of each you
have, but we’re more interested in the total sales.
For that, we’ll need to join the preceding with the sales table in another CTE:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
SELECT state, statusname, total
FROM customerinfo JOIN sales
ON customerinfo.id=sales.customerid
)
SELECT *
FROM salesdata;
386
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
NSW [NULL] 56
NSW Silver 43.5
VIC [NULL] 70
QLD [NULL] 28
VIC Gold 24.5
VIC [NULL] 133
~ 5294 rows ~
All of this is just to get the data ready. What we’re going to do now is generate our
group rows.
Obviously, you’ll need an aggregate query, grouping by state. Normally, it would
have looked something like this:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
)
SELECT state, sum(total)
FROM salesdata
GROUP BY state;
387
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
to give us this:
State sum
WA 20274
ACT 6781.5
TAS 28193
VIC 79199.5
NSW 101889
NT 6151
QLD 53331.5
SA 30977.5
However, to get that ledger table appearance, we’ll use aggregate filters to generate
three separate totals:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
)
SELECT
state,
sum(CASE WHEN statusname='Gold' THEN total END) AS gold,
sum(CASE WHEN statusname='Silver' THEN total END)
AS silver,
sum(CASE WHEN statusname='Bronze' THEN total END)
AS bronze
FROM salesdata;
388
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
You might be tempted to ask whether there’s an easier way to do it. The answer is not
really. The hard part was always going to be the preparation of the data for pivoting.
However, for a few DBMSs, the final step can be achieved with a built-in feature.
SELECT ...
FROM ...
PIVOT (aggregate FOR column IN(columnnames)) AS alias
• The column is the column whose values you want across the table. In
this case, it’s statusname.
• The columnnames is a list of values which will be the columns across
the pivot table. In this case, it’s Gold, Silver, Bronze.
• The alias is any alias you want to give. It’s not used here, but it’s
required. The pivot table is, after all, a virtual table.
389
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
)
SELECT *
FROM salesdata
-- MSSQL:
PIVOT (sum(total) FOR statusname IN (Gold, Silver, Bronze))
AS whatever
-- Oracle:
PIVOT (sum(total) FOR statusname IN ('Gold' AS Gold, 'Silver'
AS Silver, 'Bronze' AS Bronze))
;
This is a little bit simpler than the filtered aggregates we used previously. However,
note that there are some quirks with this technique.
The syntax for MSSQL and Oracle is not identical:
• In MSSQL, the column names list is a plain list of names. Also, note
that the PIVOT clause requires an alias.
You’ll notice that the state doesn’t make an appearance in the PIVOT clause; only
the statusname and total. Any column not mentioned in the PIVOT clause will appear
as grouping rows. You can have more complex pivot tables if there’s more than one such
column, but you need to make sure that the (virtual) table you want to pivot doesn’t have
any stray unwanted columns.
390
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
You’ll also notice that the IN expression isn’t a normal IN expression. To begin with,
it’s not a list of values, but a list of column names.
On top of that, you can’t use a subquery to get the list of column names. You have
to know ahead of time what the column names are going to be, and you’ll have to type
them in yourself.
Using the pivot feature is not quite as convenient as it might have been, but, if it’s
available, is still simpler than the filtered aggregates. However, you will still need to put
in some effort in preparing your data first.
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
), -- extra comma
pivottable AS (
SELECT *
FROM salesdata
PIVOT ...
)
391
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
SELECT *
FROM pivottable
;
If you run this, you’ll get the same result as before; we’ve just put the result into the
pivottable CTE.
The next step is to add the UNPIVOT clause at the end of the SELECT statement:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
),
pivottable AS (
...
)
SELECT *
FROM pivottable
-- MSSQL:
UNPIVOT (
total FOR statuses IN (Gold,Silver,Bronze)
) AS w
-- Oracle:
UNPIVOT (
total FOR statuses IN (Gold,Silver,Bronze)
)
392
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
The UNPIVOT clause is even more mysterious than the PIVOT clause. The only column
that’s specifically mentioned is the statuses column, and, again, you need to list the
possible values. From there, the DBMS magically works out that there is a state column,
and whatever’s left will appear in another column, which we have called total.
393
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
Many DBMSs supply information about the current database environment in the
form of special functions or system or global variables. Sometimes, these system
variables can be set to new values using a SET command. That’s not what we’re
looking at in this section. In this section, we’re looking at variables that you create
and set for your own use.
A variable is a temporary piece of data. Generally, you declare it before you use it
and define its data type. You may set it then or, more typically, in a later step.
Typically, a variable is associated with a stored block of code called a function or a
procedure, depending on the DBMS and what you’re attempting to do in the code. In
this section, we’ll be working without storing the code.
The various DBMSs have slightly different processes for working with variables:
394
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
SQLite is missing from this list, and that’s because it doesn’t support variables.
SQLite is typically embedded in a host application. The assumption is that you’re
writing programming code for the host application. You can have all the additional
variables and functionality you like there.
Code Blocks
If you’re using a client which makes it easy to run one statement at a time, you may find
it gets a little confused when working with blocks of multiple statements. It will be easier
to work with if you surround your block with delimiters.
For the various DBMSs, the delimiters look like this:
-- PostgreSQL
DO $$
...
END $$;
-- MariaDB/MySQL
DELIMITER $$
...
$$
DELIMITER ;
-- MSSQL
GO
...
GO
-- Oracle
/
...
/
In the end, you will probably just highlight all of the lines of code and run them
together. That’s what we recommend in trying the following code. Don’t try running just
one line at a time.
395
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
In the following code, we’ll do what we did in Chapter 3 in adding a new sale. Then,
we made a point of recording the new sale id, so that we could use it in subsequent
statements. This time, however, we’ll use variables to store interim values, so we can run
the code in a single batch.
The code will broadly follow these steps:
5. Update the sale items with their prices, using the sale id.
6. Update the new sale with the total, using the sale id, of course.
It would be nice to have another variable with the sale items. However, most DBMSs
aren’t adept at defining multivalued variables without a lot of extra fuss in defining
custom data types to do the job. Here, we’re trying to keep things simple.
What follows will be four similar versions of how to write the code block.
396
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
DO $$
...
END $$ ;
The $$ code is used to allow multiple statements to be treated as a single block. That
way, the semicolon doesn’t end up terminating the block prematurely.
Variables are declared inside a DECLARE section:
DO $$
DECLARE
cid INT := 42;
od TIMESTAMP := current_timestamp;
sid INT;
END $$ ;
The variable names can be anything you like, but you run the risk of competing
with column names in the following code. Some developers prefix the names with an
underscore (such as _cid).
The sid variable is an integer which will be assigned later. The cid and od variables
are for the customer id and ordered date/time. They are assigned from the beginning
with the special operator :=.
The code proper is inside a BEGIN ... END block. It will be all of the code you used in
Chapter 3, but run together. The important part is that the variable sid is used to manage
the new sale id:
DO $$
DECLARE
cid INT := 42;
od TIMESTAMP := current_timestamp;
sid INT;
BEGIN
397
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
UPDATE saleitems AS si
SET price=(SELECT price FROM books AS b
WHERE b.id=si.bookid)
WHERE saleid=sid;
UPDATE sales
SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=sid)
WHERE id=sid;
END $$;
The sid variable gets its value from the RETURNING clause in the first INSERT
statement. From there on, it’s used in the remaining statements.
You can test the results using
You should see the new sale and sale items at the top.
398
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
DELIMITER $$
BEGIN
END; $$
DELIMITER ;
DELIMITER $$
BEGIN
SET @cid = 42;
SET @od = current_timestamp;
SET @sid = NULL;
END; $$
DELIMITER ;
Variables are prefixed with the @ character. This makes them a little more obvious
and avoids possible conflict with column names.
The statement SET @sid = NULL; is unnecessary. Since you don’t declare variables,
we’ve included the statement just to make it clear that we’ll be using the @sid variable a
little later.
The whole code looks like this:
DELIMITER $$
BEGIN
SET @cid = 42;
SET @od = current_timestamp;
SET @sid = NULL; -- unnecessary; just to make clear
399
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
UPDATE saleitems
SET price=(SELECT price FROM books
WHERE books.id=saleitems.bookid)
WHERE saleid=@sid;
UPDATE sales
SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=@sid)
WHERE id=@sid;
END;
$$
DELIMITER ;
When you add a new row with an autogenerated primary key, you need to get
the new value to use later. The last_insert_id() function fetches the most recent
autogenerated value in the current session. You’ll notice that it doesn’t specify which
table: that’s why you need to call it immediately after the INSERT statement.
As you see, the rest of the code is generally the same as in Chapter 3, with the @sid
variable used to manage the new sale id.
You can test the results using
You should see the new sale and sale items at the top.
GO
...
GO
400
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
The GO keyword isn’t actually a part of Microsoft’s SQL language (or any other SQL,
for that matter). It’s actually an instruction to the client software to treat what’s inside as
a single batch and to run it as such. Some clients allow you to indent the keyword, and
some allow you to add semicolons and comments on the same line, but the safest thing
is not to indent it and not add anything else to the line.
Microsoft doesn’t have a block to declare variables, but it does have a statement. To
declare three variables, you can use three statements:
GO
DECLARE @cid INT = 42;
DECLARE @od datetime2 = current_timestamp;
DECLARE @sid INT;
GO
or you can use a single statement with the variables separated by commas:
GO
DECLARE
@cid INT = 42,
@od datetime2 = current_timestamp,
@sid INT;
GO
Variables are prefixed with the @ character, which makes them easy to spot and easy
to distinguish from column names.
The @sid variable is an integer which will be assigned later.
The rest of the code is similar to what we did in Chapter 3, but the new sale id will be
managed in the @sid variable:
GO
DECLARE @cid INT = 42;
DECLARE @od datetime2 = current_timestamp;
DECLARE @sid INT;
401
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
UPDATE saleitems
SET price=(SELECT price FROM books
WHERE books.id=saleitems.bookid)
WHERE saleid=@sid;
UPDATE sales
SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=@sid)
WHERE id=@sid;
GO
The @sid variable gets its value from the scope_identity() function. You’ll notice
that it doesn’t specify which table: that’s why you need to call it immediately after the
INSERT statement. From there on, it’s used in the remaining statements.
You can test the results using
You should see the new sale and sale items at the top.
When the time comes, the whole block will be run as a single batch.
402
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
/
DECLARE
cid INT := 42;
od TIMESTAMP := current_timestamp;
sid INT;
/
The variable names can be anything you like, but you run the risk of competing
with column names in the following code. Some developers prefix the names with an
underscore (such as _cid).
The sid variable is an integer which will be assigned later. The cid and od variables
are for the customer id and ordered date/time. They are assigned from the beginning
with the special operator :=.
The code proper is inside a BEGIN ... END block. It will be all of the code you used in
Chapter 3, but run together. The important part is that the variable sid is used to manage
the new sale id:
/
DECLARE
cid INT := 42;
od TIMESTAMP := current_timestamp;
sid INT;
BEGIN
INSERT INTO sales(customerid,ordered)
VALUES(cid, od)
RETURNING id INTO sid;
UPDATE saleitems
SET price=(SELECT price FROM books
403
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
WHERE b.id=saleitems.bookid)
WHERE saleid=sid;
UPDATE sales
SET total=(SELECT sum(price*quantity) FROM saleitems
WHERE saleid=sid)
WHERE id=sid;
END;
/
The sid variable gets its value from the RETURNING clause in the first INSERT
statement. From there on, it’s used in the remaining statements.
You can test the results using
You should see the new sale and sale items at the top.
Review
In this chapter, we’ve looked at a few additional techniques that can be used to get more
out of our database.
Triggers
Triggers are code scripts which run in response to something happening in the database.
Typically, these include INSERT, UPDATE, and DELETE events. Using a trigger, you can
intercept the event and make your own additional changes to the affected table or
another table. Some triggers can go further and work more closely with the DBMS or
operating system.
We explored the concept by creating a trigger which responds to deleting from the
sales table. In this case, we copied data from the sale and matching sale item into an
archive table.
404
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
Different DBMSs vary in detail, but generally they follow the same principles:
• Using this data, the trigger code can go ahead and perform additional
SQL operations.
Pivot Tables
A pivot table is a virtual table which summarizes data in both rows and columns. It’s a
sort of two-dimensional aggregate.
For the most part, raw table data isn’t ready to be summarized this way. You would
put some effort into preparing the data in the right form and making it available in one
or more CTEs.
You can create a pivot table manually using a combination of two techniques:
MSSQL and Oracle both have a non-standard PIVOT clause which will, to some
extent, automate the second process earlier. However, it still requires some input from
the SQL developer to finish the job.
SQL Variables
In this chapter, we used variables to streamline the code, first introduced in Chapter 3,
which adds a sale by inserting into multiple tables and updating them.
Most of the SQL we’ve worked with involved single statements. Some of those
statements were effectively multipart statements with the use of CTEs to generate
interim data.
In the case where you need more complex code to run in multiple statements,
you may need to store interim values. These values are held in variables, which are
temporary pieces of data.
405
Chapter 10 More Techniques: Triggers, Pivot Tables, and Variables
In most DBMSs, variables are declared and used within a block of code. In most
cases, the variables and their values will vaporize after the code block is run. MariaDB/
MySQL, however, will retain variables beyond the run.
SQLite doesn’t support variables. It is expected that the hosting application will
handle the temporary data that variables are supposed to manage.
Summary
Although you can go a long way with straightforward SQL statements and features, you
can often get more out of your DBMS with some additional features:
• Pivot tables are virtual tables which provide a compact view of your
summaries. You can generate a pivot table using a combination of
aggregate queries, but some DBMSs offer a pivot feature to simplify
the process.
• SQL variables are used to store temporary values between other SQL
statements. They can be used to store interim values that can be used
in subsequent statements.
Using what you’ve learned here and in previous chapters, you can build more
complex queries to work with and analyze your database.
406
APPENDIX A
Cultural Notes
The sample database was based on the way we do things in Australia. This is pretty
similar to the rest of the world, of course, but there are some details that might need
clearing up.
Australian addresses don’t make much use of cities, which have a pretty broad
definition in Australia.
Towns
Depending on how you define a town, there are about 15,000–20,000 towns in Australia.
In the sample database, town names have been deliberately selected as those
occurring at least three times in Australia, though not necessarily in the sample.
States
Australia has eight geographical states. Technically, two of them are territories, since
they don’t have the same political features.
407
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1
Appendix A Cultural Notes
Name Code
Northern Territory NT
New South Wales NSW
Australian Capital Territory ACT
Victoria VIC
Queensland QLD
South Australia SA
Western Australia WA
Tasmania TAS
Postcodes
A postcode is a four-digit code typically, though not exclusively, associated with a town:
The postcode is closely associated with the state, though some towns close to the
border may have a postcode from the neighboring state.
Phone Numbers
In Australia, a normal phone number has ten digits. For nonmobile numbers, the first
two digits are an area code, starting with 0, which indicates one of four major regions.
Mobile phones have a region code of 04.
There are also special types of phone numbers. Numbers beginning with 1800
are toll free, while numbers starting with 1300 are used for large businesses that are
prepared to pay for them.
408
Appendix A Cultural Notes
Shorter numbers starting with 13 are for very large organizations. Other shorter
numbers are for special purposes, such as emergency numbers.
Australia maintains a group of fake phone numbers, and all of the phone numbers
used in the database are, of course, fake. Don’t waste your time trying to phone one.
Email Addresses
There are a number of special domains reserved for testing or teaching. These include
example.com and example.net, which is why all of the email addresses use them.
This is true over the world.
Dates
Short dates in Australia are in the day/month/year format, which can get particularly
confusing when mixed with American and Canadian dates. It is for this reason that we
recommend using the month name instead of the month number or, better still, the
ISO8601 format.
409
APPENDIX B
DBMS Differences
This book covers writing code for the following popular DBMSs:
• PostgreSQL
• MySQL/MariaDB
• SQLite
• Oracle
Although there is an SQL standard, there will be variations in how well these DBMSs
support them. For the most part, the SQL is 80–90% the same, with the most obvious
differences discussed as follows.
As a rule, if there’s a standard and non-standard way of doing the same thing, it’s
always better to follow the standard. That way, you can easily work with the other
dialects. More importantly, you’re future-proofing your code, as all vendors move toward
implementing standards.
Writing SQL
In general, all DBMSs write the actual SQL in the same way. There are a few differences
in syntax and in some of the data types.
411
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1
Appendix B DBMS Differences
Semicolons
MSSQL does not require the semicolon between statements. However, apart from being
best practice to use it, Microsoft has stated that it will be required in a future version,1 so
you should always use one.
Data Types
All DBMSs have their own variations on data types, but they have a lot in common:
• SQLite doesn’t enforce data types, but has general type affinities.
Dates
• Oracle doesn’t like ISO8601 date literals (yyyy-mm-dd). However, it
is easy enough to get this to work. You can also use the to_date()
function or the to_timestamp() function to accept different date
formats.
• MariaDB/MySQL only accepts ISO8601 date literals. If you want to
feed it a different format, you can use the str_to_date() function.
• SQLite doesn’t actually have a date data type, so it’s a bit more
complicated. Generally, it’s simplest to use a TEXT type to store
ISO8601 strings, with appropriate functions to process it.
Case Sensitivity
Generally, the SQL language is case insensitive. However
1
Microsoft’s comment on semicolons: https://docs.microsoft.com/en-us/sql/t-sql/
language-elements/transact-sql-syntax-conventions-transact-sql#transact-sql-
syntax-conventions-transact-sql. TLDR: Semicolons are recommended and will be required
in the future.
412
Appendix B DBMS Differences
Quote Marks
In standard SQL
However
413
Appendix B DBMS Differences
Limiting Results
This is a feature omitted in the original SQL standards, so DBMSs have followed their
own paths. However
• PostgreSQL, Oracle, and MSSQL all now use the OFFSET ... FETCH
... standard, with some minor variations.
• PostgreSQL, MySQL/MariaDB, and SQLite all support the non-
standard LIMIT ... OFFSET ... clause. (That’s right, PostgreSQL
has both.)
• MSSQL also has its own non-standard TOP clause.
Filtering (WHERE)
DBMSs also vary in how values are matched for filtering.
Unlike most DBMSs, SQLite will allow you to use an alias from the SELECT clause in
the WHERE clause, which contradicts the standard clause order.
Case Sensitivity
This is discussed earlier.
String Comparisons
In standard SQL, trailing spaces are ignored for string comparisons, presumably to
accommodate CHAR padding. More technically, shorter strings are right-padded to longer
strings with spaces.
PostgreSQL, SQLite, and Oracle ignore this standard, so trailing spaces are
significant. MSSQL and MySQL/MariaDB follow the standard.
Dates
Oracle’s date handling is mentioned earlier. This will affect how you express a date
comparison.
414
Appendix B DBMS Differences
There is also the issue of how the ??/??/???? is interpreted. It may be the US d/m/y
format, but it may not. It is always better to avoid this format.
Wildcard Matching
All DBMSs support the basic wildcard matches with the LIKE operator.
Calculations
Basic calculations are the same, with the exceptions as follows. Functions, on the other
hand, are very different.
Of the DBMSs listed earlier, SQLite has the fewest built-in functions, assuming that
the work would be done mostly in the host application.
415
Appendix B DBMS Differences
Arithmetic
Arithmetic is mostly the same, but working with integers varies slightly:
• PostgreSQL, SQLite, and MSSQL will truncate integer division; Oracle
and MySQL/MariaDB will return a decimal.
• Oracle doesn’t support the remainder operator (%), but uses the
mod() function.
Formatting Functions
Generally, they’re all different. However
• PostgreSQL and Oracle both have the to_char() function.
• Microsoft has the format() function.
• SQLite only has a format() function, a.k.a. printf(), and is the most
limited.
• MySQL/MariaDB has various specialized functions.
Date Functions
Again, all of the DBMSs have different sets of functions. However, for simple offsetting
• PostgreSQL and Oracle have the interval which makes adding to and
subtracting from a data simple.
• MySQL/MariaDB has something similar, but less flexible.
• MSSQL relies on the dateadd() function.
Concatenation
This is a basic operation for strings:
416
Appendix B DBMS Differences
String Functions
Suffice to say that although there are some SQL standards
This means that these examples will all require special attention.
Generally, the DBMSs support the popular string functions, such as lower() and
upper() but sometimes in different ways. There is, however, a good deal of overlap
between DBMSs.
Joining Tables
Everything is mostly the same. However
Aggregate Functions
The basic aggregate functions are generally the same between DBMSs. Some of the more
esoteric functions are not so well supported by some.
417
Appendix B DBMS Differences
Manipulating Data
All DBMSs support the same basic operations. However
• Oracle doesn’t support INSERT multiple values without a messy
workaround, though there is talk of supporting it soon. MSSQL
supports them, but only to a limit of 1000 rows, but there is also a less
messy workaround for this limit. The rest are OK.
Manipulating Tables
All DBMSs support the same basic operations, but each one has its own variation on
actual data type and autogenerated numbers.
Among other things, this means that the create table scripts are not cross-DBMS
compatible.
• For PostgreSQL, you reset the underlying sequence after inserting the
data. For example:
SELECT setval(pg_get_serial_sequence('customers',
'id'), max(id))
FROM customers;
418
Appendix B DBMS Differences
• For Oracle, alter the table you’ve just added data to. For example:
GO
CREATE something AS
...
;
GO
That doesn’t include CREATE TABLE, which will happily mix in with the rest of the
statements.
419
Appendix B DBMS Differences
SELECT
id, customers.*
FROM customers;
For the OFFSET ... LIMIT ... clause, which fetches a limited number of rows, the
OFFSET value cannot be calculated.
As you know, in a GROUP BY query, you can only select aggregates or what’s in the
GROUP BY clause. With MariaDB/MySQL, that won’t work if the GROUP BY column is
calculated. You really should be using CTEs anyway.
Don’t forget to set your session to ANSI mode to have MariaDB/MySQL behave like
the rest in the use of double quotes and concatenation:
420
APPENDIX C
If you’re reading this, we’ll assume that you’re familiar with programming in
Python, though not necessarily an expert.
In particular, we’ll assume that, apart from the basics, you know about collections
such as tuples, lists, and dictionaries. Of course, you’ll be familiar with creating a
function. You’ll also need to know about installing and importing modules.
Before any of this can happen, however, you will probably have to install the
appropriate module.
Once you’ve done that, we’ll go through the following steps:
1. import the database module.
2. Make a connection to the database and store the connection
object and a corresponding cursor object.
3. Run your SQL and process the results.
A connection object represents a connection to the database, and you can use it to
manage your database session.
More importantly, a cursor object is what you’ll use to send SQL to the database
and to send and receive the data involved. The connection object also has some data
manipulation methods, but what they really do is create a cursor and pass on the rest of
the work to a cursor.
421
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1
Appendix C Using SQL with Python
# MariaDB/MySQL
pip3 install mysql-connector-python
# PostgreSQL
pip3 install psycopg2-binary
# Oracle
pip3 install oracledb
The module for the preceding MariaDB and MySQL is the same. However, there is a
dedicated MariaDB module if you need more specialized features.
# MSSQL (Windows)
pip3 install pyodbc
The pyodbc module requires ODBC (Open Database Connectivity) drivers to do the
job. On Windows, this will already be installed, especially if you’ve also installed SQL
Server. However, you will need to get the name of the driver.
In Python, run the following:
import pyodbc
print(pyodbc.drivers())
422
Appendix C Using SQL with Python
You’ll see a collection of one or more drivers. The one you want will be
something like
The command is too long to fit on this page. You should enter the command on one
line, with no break or spaces in the URL.
Once you’ve got Homebrew installed, you can use it to install the correct driver
for MSSQL:
Again, the command is too long to fit. You can write it on two lines as long as the first
line ends with a backslash; otherwise, write it on one line without the backslash.
But wait, there’s more. You then need to install the next part, at the same time
accepting the license agreement:
423
Appendix C Using SQL with Python
Now, you can install the module. You may have trouble installing it simply, especially
if you’re using an M1 Macintosh, so it’s safer to run this:
After this, you will need to get the name of the driver.
In Python, run the following:
import pyodbc
print(pyodbc.drivers())
You’ll see a collection of one or more drivers. The one you want will be
something like
Creating a Connection
Overall, to make a connection and cursor to the database, your code will look something
like this:
import dbmodule
connection = dbmodule.connect(...)
cursor = connection.cursor()
connection.close()
where dbmodule is the relevant module for the DBMS. Specifically, for the various
DBMSs, the code will be as follows.
Connecting to SQLite
The relevant module for SQLite is called sqlite3. After importing the module, you need
to make the connection to the database.
SQLite databases are in simple files. You’ll find there are no further credentials to
worry about, since that’s supposed to be handled in the host application. All you need to
do is to reference the file.
To connect to SQLite
424
Appendix C Using SQL with Python
import sqlite3
connection = sqlite3.connect(file) # path name of the file
cursor = connection.cursor()
The file string is the full or relative path name of the SQLite file.
Connecting to MSSQL
The module or MSSQL is called pyodbc. In principle, it can be used for any database
which supports ODBC.
A connection in MSSQL can be a string with all of the connection details. This string
is called a DSN—a Data Source Name. However, for readability and maintainability, it’s
easier to add the details as separate function parameters. In general, it looks like this:
import pyodbc
connection = pyodbc.connect(
driver='ODBC Driver 18 for SQL Server',
TrustServerCertificate='yes',
server='...',
database='bookshop',
uid='...',
pwd='...'
)
cursor = connection.cursor()
server='...,1432
425
Appendix C Using SQL with Python
Connecting to MariaDB/MySQL
The relevant module to connect to MariaDB/MySQL is called mysql.connector. To
connect to the database, you will need to indicate which server and database, as well as
your username and password:
import mysql.connector
connection = mysql.connector.connect(
user='...',
password='...',
host='...',
database='bookshop'
)
cursor = connection.cursor()
The host is typically the IP address of the database server. The standard port number
is 3306. If you need to change the port number, you can add it as another parameter:
port=3305.
Connecting to PostgreSQL
The module to connect to PostgreSQL is called psycopg2. To connect to the database,
you will need to indicate which server and database, as well as your username and
password:
import psycopg2
connection = psycopg2.connect(
database='...',
user='...',
password='...',
host='...'
)
cursor = connection.cursor()
426
Appendix C Using SQL with Python
The host is typically the IP address of the database server. The standard port number
is 5432. If you need to change the port number, you can add it as another parameter:
port=5433.
Connecting to Oracle
The module to connect to Oracle is called oracledb. To connect to the database, you will
need to indicate which server and database, as well as your username and password:
import oracledb
connection = oracledb.connect(
user='...',
password='...',
host='...',
service_name='...'
)
cursor = connection.cursor()
The host is typically the IP address of the database server. The standard port number
is 1521. If you need to change the port number, you can add it as another parameter:
port=1522.
connection.execute(sql)
Before we process the data, we’ll want to get a list of column names. This information
is available in the cursor.description object. The cursor.description object is a
tuple of tuples, one for each column. The data inside each of the tuples may include
information about the type of data, but that’s not available for all DBMS connections.
427
Appendix C Using SQL with Python
The column names will be the first item of each tuple. We can gather the names
using a list comprehension:
This adds the first member of each tuple to the columns list.
The data from the SELECT statement will be available from the cursor object. The
object includes methods to fetch one or more rows, but can also be iterated to fetch
the rows.
You can iterate through the cursor as follows:
Each row will be a tuple of values. You’ll recall that a tuple is a simple immutable
collection of values, so, among other things, the values don’t have a name.
You can combine the column names with each tuple using Python’s zip function,
which has nothing to do with zipping a file.
The zip function will take two collections and return a collection of tuples, each with
an element from the first collection and an element from the second collection:
zip(columns,row)
Here, the result will be a collection of tuples with the first member being a column
name and the second member being a corresponding value from the row. Technically,
it’s not a collection, but an iterator which is close enough for the next step.
Our next step will be to turn that into a dictionary object, using the first member of
each tuple as keys for the second member of the tuple.
This will produce a set of dictionary objects:
data = []
for row in cursor:
data.append(dict(zip(colums,row)))
print(data)
connection.close()
428
Appendix C Using SQL with Python
import ...
connection = ... . connect(...)
data = []
for row in cursor:
data.append(dict(zip(colums,row)))
print(data)
connection.close()
That will work, but it’s too hard-coded to be useful. Instead, we’ll get the customer id
from the user:
To put the customer id into the sql string, we could try something like this:
429
Appendix C Using SQL with Python
Modern Python supports the so-called f-string earlier. Alternatively, you could use
the more traditional format() string method.
The problem is that now you’ve opened up the query to user input. If, instead of
entering the number 42, the user had entered
42 OR 1=1
Different DBMSs have different placeholders. Here is how you can create your SQL
strings:
2
Don’t even think about storing passwords simply in a database table. This isn’t the place to
discuss how to manage user data safely, but storing plain passwords is very dangerous and
irresponsible.
430
Appendix C Using SQL with Python
Some DBMSs also allow variations on the preceding steps, such as using placeholder
names. However, these simple placeholders will do well enough.
You can then add your data in the form of a tuple:
(customerid,)
Remember that a tuple with a single value requires a comma at the end.
The code should now look like
You can see that the tuple with values is added as a second parameter to the
execute() method.
For simplicity, we can create separate SQL strings for the main steps:
431
Appendix C Using SQL with Python
We’ll get to those strings in a moment. Before we do, we need to look out for the new
sale id.
In SQL, there are two main methods of getting a newly generated id:
• Return it from the INSERT statement.
The first method is better, but isn’t supported by all DBMSs at this stage. We’ll need
to take that into account with the first SQL string.
The other thing is that we’ll include placeholders in these strings. That’s not strictly
necessary at this point, since we’re not including user input. However, it’s safer and
makes adding the values easier.
To make the code a little more reusable, we’ll wrap it inside a function:
return saleid
The customerid will be a simple integer. The items will be a list of dictionaries,
which we’ll describe later. The date will be a date object.
We don’t really need to return the saleid, but it doesn’t hurt, and it might come in
handy later.
432
Appendix C Using SQL with Python
Some of the strings are long; we’ve used multiline strings for readability. In Python,
multiline strings have triple quote characters:
multiline = '''
Multi
Line
String
'''
The other thing is whether you use single or double quotes. Many developers use
double quotes both for single-line strings and multiline strings. In this appendix, we’re
using single quotes. It doesn’t matter, as long as you’re consistent.
insertsale = '''
INSERT INTO sales(customerid, ordered) VALUES(%s,%s) RETURNING id;
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity) VALUES(%s,%s,%s);
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM books WHERE
books.id=saleitems.bookid) WHERE saleid=%s;
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=%s) WHERE id=%s;
'''
433
Appendix C Using SQL with Python
insertsale = '''
INSERT INTO sales(customerid, ordered) VALUES(?,?)
RETURNING id;
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(?,?,?);
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM books
WHERE books.id=saleitems.bookid) WHERE saleid=?;
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=?) WHERE id=?;
'''
insertsale = '''
INSERT INTO sales(customerid, ordered)
OUTPUT inserted.id VALUES(?,?);
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(?,?,?);
'''
434
Appendix C Using SQL with Python
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM books
WHERE books.id=saleitems.bookid) WHERE saleid=?;
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=?) WHERE id=?;
'''
Apart from the OUTPUT clause, these are basically the statements we used earlier.
insertsale = '''
INSERT INTO sales(customerid, ordered) VALUES(%s,%s);
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(%s,%s,%s);
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM books
WHERE books.id=saleitems.bookid) WHERE saleid=%s;
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=%s) WHERE id=%s;
'''
435
Appendix C Using SQL with Python
Oracle can use named or numbered placeholders. We’ll use numbered placeholders
because it’s simple:
insertsale = '''
INSERT INTO sales(customerid, ordered)
VALUES(:1, to_timestamp(:2,'YYYY-MM-DD HH24:MI:SS'))
RETURNING id INTO :3
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(:1,:2,:3)
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM books
WHERE books.id=saleitems.bookid) WHERE saleid=:1
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=:1) WHERE id=:2
'''
Watch out for this quirk: you cannot end the statements with a semicolon! If you do,
you’ll get an error message: SQL command not properly ended, which is somewhat
counterintuitive.
Note that the insertsale string includes an expression with single quotes. That’s OK
if the string is delimited with triple characters. If you’re writing it on one line, you might
need to use double quotes for the string.
436
Appendix C Using SQL with Python
# Not Oracle
cursor.execute(insertsale, (customerid, date))
For Oracle, you need to define an additional variable to capture the new id:
# Oracle
id = cursor.var(oracledb.NUMBER)
cursor.execute(insertsale, (customer, date, id))
To retrieve the new sale id, that depends on whether the id is returned from the
INSERT statement or not.
For those PostgreSQL and MSSQL, which return a value, you can fetch that
value using
# PostgreSQL, MSSQL
saleid = cursor.fetchone()[0]
The fetchone() method returns the first (and subsequent row from the result set) as
a tuple. Here, we want the first and only item.
For SQLite and MariaDB/MySQL, which don’t return a value, there is a special
lastrowid property:
# SQLite, MariaDB/MySQL
saleid = cursor.lastrowid
For Oracle, the new sale id is sort of in the id variable, but you still need to extract it
completely:
# Oracle
saleid = int(id.getvalue()[0])
437
Appendix C Using SQL with Python
# PostgreSQL, MSSQL
cursor.execute(insertsale, (customer, date))
saleid = cursor.fetchone()[0]
# SQLite, MariaDB/MySQL
cursor.execute(insertsale, (customer, date))
saleid = cursor.lastrowid
# Oracle
id = cursor.var(oracledb.NUMBER)
cursor.execute(insertsale, (customer, date, id))
saleid = int(id.getvalue()[0])
return saleid
Remember not to mess around with indentation. All of the code should be one level
in to be part of the addsale() function.
438
Appendix C Using SQL with Python
(
{ 'bookid': 123, 'quantity': 3},
{ 'bookid': 456, 'quantity': 1},
{ 'bookid': 789, 'quantity': 2},
)
Within the function, the tuple will appear in the items variable. We can iterate
through the tuple using the for loop.
In each iteration, we’ll execute the insertitems statement, which inserts one item
at a time. The data will be a tuple with the sale id from the previous step, as well as the
bookid and quantity members of the dictionary object.
The code will look like this:
# cursor.execute
# saleid
return saleid
439
Appendix C Using SQL with Python
cursor.execute(updateitems, (saleid,))
cursor.execute(updatesale, (saleid, saleid))
connection.commit()
The updateitems query needs only the sale id. Even though it’s only one value, it still
needs to be in a tuple, which is why there’s the extra comma at the end. The updatesale
query needs the sale id twice, once for the main query and once for its subquery.
At the end of the job, you need to commit the transaction, which means to save the
changes permanently in the database. Otherwise, the whole process is a waste of time.
The function now looks like this:
# cursor.execute
# saleid
cursor.execute(updateitems, (saleid,))
cursor.execute(updatesale, (saleid, saleid))
connection.commit()
return saleid
440
Appendix C Using SQL with Python
To get the current date and time, you’ll need to import from the datetime module;
you can then use the .now() method:
addsale (
42, # customer id
( # items
{ 'bookid': 123, 'quantity': 3},
{ 'bookid': 456, 'quantity': 1},
{ 'bookid': 789, 'quantity': 2},
),
datetime.now() # current date/time
)
441
Index
A sliding averages, 289
week averages, 290
Aggregate filters, 171–173, 210
Aggregating data
Aggregate functions, 19, 275
aggregate filter, 171–173
basic functions, 163, 209
calculated values
contexts, 163
arbitrary strings, 181, 183
count(*) OVER (), 279
CASE statements, 177, 178
CTE, 281, 282
CTE, 176
daily totals vs. grand totals, 282
customers, 173, 174
day-by-day summary, 282
delivery statistics, 179–181
day number, 281
GROUP BY clause, 173, 176
DBMS, 164, 417
month name, 175
descriptions, 166
monthnumber, 174
each day sales, 280, 281
clause order, 169, 170
NULL, 166
distinct values, 170
numerical statistics, 165
error message, 168
OVER (), 284
FROM/WHERE clauses, 169
percentage symbol, 283
GROUP BY () clause, 167, 168
sales totals, 279, 280
group concatenation, 183, 185
strings and
grouping sets, 195
dates, 165
CUBE, 195
total/sum(total) OVER(), 282
data, 186, 187, 189
weekday, percentage/sorting, 283
GROUP BY clause, 185, 186
Aggregate queries, 18–19, 92, 101, 168,
renaming values, Oracle, 199–201
246, 252, 266, 275, 405
ROLLUP, 196, 197
Aggregate window functions
sorting results, 197–199
daily sales view, 287
totals, 185
framing clause, 285–287
query, 167
ORDER BY clause, 284, 285
subtotals, 211
sliding window
UNION clause
daily totals, 288, 289
CTE, 195
dates, 289
grand total, 191
framing clauses, 288
443
© Mark Simon 2023
M. Simon, Leveling Up with SQL, https://doi.org/10.1007/978-1-4842-9685-1
INDEX
444
INDEX
445
INDEX
446
INDEX
E J, K
extract() function, 144 Joining tables, 104, 417
F L
Foreign key, 21, 27, 29, 30, 36, 38, 39, 50, LATERAL JOIN (CROSS APPLY)
55, 61, 68, 81, 276, 369 adding columns
Formatting functions, 130, 132, 145, 160, expression, 263, 264
187, 416 principle, 264
Frequency table, 201, 203, 205, 206 multiple columns
aggregate query, 266
FROM clause, 265, 266
G list of customers, 266, 267
GROUP BY clause, 19, 211 results, 266
grouping() function, 198–200 SELECT clause, 265
query, 262, 273
SELECT clause, 262
H WHERE clause, 262
Histograms, 201, 202 ltrim() function, 118, 138
447
INDEX
genre details, 88 O
genre names, 88
ON DELETE CASCADE clause, 276, 369
genres table, 89
One-to-many relationship, 60, 103
query, 87
books and authors, 62–64
side effects, 89, 90
books and authors view, 70–72
list, 85
child table/parent table, 62
multiple genres, book, 76
JOIN, 62, 63
principles, 76
NOT IN(…), 69, 70
SELECT statement, 84
one-to-many joins
string_agg(column,separator)
books and authors, 64, 65
function, 85
combinations, 69
MySQL, 147
FULL JOIN, 67
INNER JOIN, 65–68
N LEFT JOIN, 65, 67
NOT NULL, 67
Natural key, 36
NULL, 68
Normalization, 26, 27, 56
options, 65
Normalized database, 26–28
OUTER JOIN, 66
Normalized tables
rows, 69
properties, 55
subquery, 68
ntile()
unmatched parents, 68
cast(… AS int), 308
Oracle, 63
customers hights, 307
uses, 61
decile/row_decile, 308
One-to-many tables, 93
deciles, 306
One-to-maybe relationships
group size, 307, 308
contradiction, 73
NULL heights, 306
customers table, 75
rank_decile/count _decile, 309
customers, 74
NULLs, 7, 13, 159, 210, 276
join, 73, 74
NULL strings, 113, 118, 125, 136
LEFT JOIN, 75
Numeric calculations, 127, 160
secondary table, 73
approximation
SELECT *, 75
functions, 129–130
vip table, 73, 74
basic arithmetic, 127–128
VIP columns, 75
formatting functions, 130–132
One-to-one relationship, 61, 72, 103
mathematical
Oracle, 147, 229, 304
functions, 128–129
connection, 427
string (see String calculations)
quirks and variations, 420
449
INDEX
450
INDEX
SELECT statement, 12, 234, 235 correlated, 239, 240, 242, 243, 245, 272
Single value query, 237 cost, 239
SQL, 394 definition, 238
basic SQL, 10–11 expression, 245
data types, 11–12, 412 FROM clause, 272
dates, 412, 413 NULL, 255
feature, 102 price groups, books, 253
query, 368 SELECT statement, 254, 255
quotes, 413 summarizing table, 253
semicolon, 412 GROUP BY clause, 254
writing, 10, 411 IN() expression, 273
SQL clauses nested subqueries, 255–257
clause order, 12 non-correlated, 239, 240, 242, 272
limiting results, 14–15 ORDER BY clause, 272
multiple assertions, 13 SELECT clause, 242
ORDER BY clause, 14 aggregate query, 245
SELECT clause, 12 correlated subquery, 243, 244
sort strings, 15 join, 243
WHERE clause, 13 non-correlated subquery, 244
wildcard patterns, 13 window functions, 244
SQLite, 64, 140, 261, 395, 406 uses, 238, 272
connection, 424 WHERE clause, 242
SQL strings, 434 aggregates, 246
triggers, 378 big spenders, 246–249
Standard deviation, 208, 209 duplicate customers, 251, 252
Statistics, 212 last order, 249–251
String calculations, 160 WHERE EXISTS (…), 258, 273
ASCII and Unicode, 134–135 correlated subquery, 259, 260
case sensitivity, 134–135 FROM dual, 258
CHAR(length), 132 IN() expression, 260, 261
concatenation, 135–136 non-correlated subquery, 259
data types for strings, 132 SELECT NULL/SELECT 1/0, 259
string functions, 137–139 testing, 258
VARCHAR(length), 132, 133
String functions, 417
Subqueries T
column names, 243 Table, 213
complex query, 239 Table design, 20
452
INDEX
453
INDEX