Leveling Up With SQL Advanced Techniques For Transforming Data Into Insights 9781484296851 9781484296844
Leveling Up With SQL Advanced Techniques For Transforming Data Into Insights 9781484296851 9781484296844
This work is subject to copyright. All rights are solely and exclusively licensed
by the Publisher, whether the whole or part of the material is concerned,
specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The publisher, the authors, and the editors are safe to assume that the advice and
information in this book are believed to be true and accurate at the date of
publication. Neither the publisher nor the authors or the editors give a warranty,
expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral
with regard to jurisdictional claims in published maps and institutional
affiliations.
This Apress imprint is published by the registered company APress Media, LLC,
part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY 10004,
U.S.A.
To Brian. You’re part of what I am today.
Introduction
In the early 1970s, a new design for managing databases was being developed
based on the original work of E. F. Codd. The underlying model was known as
the relational model and described a way of collecting data and accessing and
manipulating data using mathematical principles.
Over the decade, the SQL language was developed, and, though it doesn’t
follow the relational model completely, it attempts to make the database
accessible using a simple language.
The SQL language has been improved, enhanced, and further developed over
the years, and in the late 1980s, the language was developed into a standard of
both ANSI (the American National Standards Institute) and ISO (the
International Organization for Standardization, and, that’s right, it doesn’t spell
ISO).
The takeaways from this very brief history are
SQL has been around for some time.
SQL is based on some solid mathematical principles.
There is an official standard, even if nobody quite sticks to it.
SQL is a developing language, and there are new features and new techniques
being added all the time.
The second half of the third point is worth stressing. Nobody quite sticks to
the SQL standards. There are many reasons for this, some good, some bad. But
you’ll probably find that the various dialects of SQL are about 80–90%
compatible, and the rest we’ll fill you in on as we go.
In this book, you’ll learn about using SQL to a level which goes beyond the
basics. Some things you’ll learn about are newer features in SQL; some are older
features that you may not have known about. We’ll look at a few non-standard
features, and we’ll also look at using features that you already know about, but
in more powerful ways.
This book is not for the raw beginner—we assume you have some
knowledge and experience in SQL. If you are a raw beginner, then you will get
more from my previous book, Getting Started with SQL and Databases;1 you
can then return to this book full of confidence and enthusiasm with a good solid
grounding in SQL.
If you have the knowledge and experience, the first chapter will give you a
quick overview of the sort of knowledge you should have.
The Sample Database
To work through the exercises, you’ll need the following:
A database server and a suitable database client.
Permissions to do anything you like on the database. If you’ve installed the
software locally, you probably have all the permissions you need, but if you’re
doing this on somebody else’s system, you need to check.
The script which produces the sample database.
The first chapter will go into the details of getting your DBMS software and
sample database ready. It will also give you an overview of the story behind the
sample database.
Notes
While you’re writing SQL to work with the data, there’s a piece of software at
the other end responding to the SQL. That software is referred to generically as a
database server, and, more specifically, as a DataBase Management System, or
DBMS to its friends. We’ll be using that term throughout the book.
The DBMSs we’ll be covering are PostgreSQL, MariaDB, MySQL,
Microsoft SQL Server, SQLite, and Oracle. We’ll assume that you’re working
with reasonably current versions of the DBMSs.
Chapter 1 will go into more details on setting up your DBMS, as well as
downloading and installing the sample database.
Source Code
All source code used in this book can be downloaded from
github.com/apress/leveling-up-sql.
Any source code or other supplementary material referenced by the author in this
book is available to readers on GitHub (github.com/apress). For more detailed
information, please visit https://www.apress.com/gp/services/source-code.
Acknowledgments
The sample data includes information about books and authors from Goodreads
(www.goodreads.com/), particularly from their lists of classical literature
over the past centuries. Additional author information was obtained, of course,
from Wikipedia (www.wikipedia.org/).
The author makes no guarantees about whether the information was correct
or even copied correctly. Certainly, the list of books should not in any way be
interpreted as an endorsement or even an indication of personal taste. After all,
it’s just sample data.
Table of Contents
Chapter 1: Getting Ready
About the Sample Database
Setting Up
Database Management Software
Database Client
The Sample Database
What You Probably Know Already
Some Philosophical Concepts
Writing SQL
Basic SQL
Data Types
SQL Clauses
Calculating Columns
Joins
Aggregates
Working with Tables
Manipulating Data
Set Operations
Coming Up
Chapter 2: Working with Table Design
Understanding Normalized Tables
Columns Should Be Independent
Adding the Towns Table
Adding a Foreign Key to the Town
Update the Customers Table
Remove the Old Address Columns
Changing the Town
Adding the Country
Additional Comments
Improving Database Integrity
Fixing Issues with a Nullable Column
Other Adjustments
Adding Indexes
Adding an Index to the Books and Authors Tables
Creating a Unique Index
Review
Normal Form
Multiple Values
Altering Tables
Views
Indexes
The Final Product
Summary
Coming Up
Chapter 3: Table Relationships and Joins
An Overview of Relationships
One-to-Many Relationship
Counting One-to-Many Joins
The NOT IN Quirk
Creating a Books and Authors View
One-to-One Relationships
One-to-Maybe Relationships
Multiple Values
Many-to-Many Relationships
Joining Many-to-Many Tables
Summarizing Multiple Values
Combining the Joins
Many-to-Many Relationships Happen All the Time
Another Many-to-Many Example
Inserting into Related Tables
Adding a Book and an Author
Adding a New Sale
Review
Types of Relationships
Joining Tables
Views
Inserting into Related Tables
Summary
Coming Up
Chapter 4: Working with Calculated Data
Calculation Basics
Using Aliases
Dealing with NULLs
Using Calculations in Other Clauses
More Details on Calculations
Casting
Numeric Calculations
String Calculations
Date Operations
The CASE Expression
Various Uses of CASE
Coalesce Is like a Special Case of CASE
Nested CASE Expression
Summary
Aliases
NULLs
Casting Types
Calculating with Numbers
Calculating with Strings
Calculating with Dates
The CASE Expression
Coming Up
Chapter 5: Aggregating Data
The Basic Aggregate Functions
NULL
Understanding Aggregates
Aggregating Some of the Values
Distinct Values
Aggregate Filter
Grouping by Calculated Values
Grouping with CASE Statements
Revisiting the Delivery Status
Ordering by Arbitrary Strings
Group Concatenation
Summarizing the Summary with Grouping Sets
Preparing Data for Summarizing
Combining Summaries with the UNION Clause
Using GROUPING SETS, CUBE, and ROLLUP
Histograms, Mean, Mode, and Median
Calculating the Mean
Generating a Frequency Table
Calculating the Mode
Calculating the Median
The Standard Deviation
Summary
Basic Aggregate Functions
NULLs
The Aggregating Process
Aggregate Filters
GROUP BY
Mixing Subtotals
Statistics
Coming Up
Chapter 6: Using Views and Friends
Working with Views
Creating a View
Using ORDER BY in MSSQL
Tips for Working with View
Table-Valued Functions
What Can You Do with a View?
Caching Data and Temporary Tables
Computed Columns
Summary
Views
Table Valued Functions
Temporary Tables
Coming Up
Chapter 7: Working with Subqueries and Common Table Expressions
Correlated and Non-correlated Subqueries
Subqueries in the SELECT Clause
Subqueries in the WHERE Clause
Subqueries with Simple Aggregates
Big Spenders
Last Orders, Please
Duplicated Customers
Subqueries in the FROM Clause
Nested Subqueries
Using WHERE EXISTS (Subquery)
WHERE EXISTS with Non-correlated Subqueries
WHERE EXISTS with Correlated Subqueries
WHERE EXISTS vs. the IN() Expression
LATERAL JOINS (a.k.a. CROSS APPLY) and Friends
Adding Columns
Multiple Columns
Working with Common Table Expressions
Syntax
Using a CTE to Prepare Calculations
Summary
Correlated and Non-correlated Subqueries
The WHERE EXISTS Expression
LATERAL JOINS (a.k.a. CROSS APPLY)
Common Table Expressions
Coming Up
Chapter 8: Window Functions
Writing Window Functions
Simple Aggregate Windows
Aggregate Functions
Aggregate Window Functions and ORDER BY
The Framing Clause
Creating a Daily Sales View
A Sliding Window
Window Function Subtotals
PARTITION BY Multiple Columns
Ranking Functions
Basic Ranking Functions
Ranking with PARTITION BY
Paging Results
Working with ntile
A Workaround for ntile
Working with Previous and Next Rows
Summary
Window Clauses
Coming Up
Chapter 9: More on Common Table Expressions
CTEs As Variables
Setting Hard-Coded Constants
Deriving Constants
Using Aggregates in the CTE
Finding the Most Recent Sales per Customer
Finding Customers with Duplicate Names
CTE Parameter Names
Using Multiple Common Table Expressions
Summarizing Duplicate Names with Multiple CTEs
Recursive CTEs
Generating a Sequence
Joining a Sequence CTE to Get Missing Values
Daily Comparison Including Missing Days
Traversing a Hierarchy
Working with Table Literals
Using a Table Literal for Testing
Using a Table Literal for Sorting
Using a Table Literal As a Lookup
Splitting a String
Summary
Simple CTEs
Parameter Names
Multiple CTEs
Recursive CTEs
Coming Up
Chapter 10: More Techniques: Triggers, Pivot Tables, and Variables
Understanding Triggers
Some Trigger Basics
Preparing the Data to Be Archived
Creating the Trigger
Pros and Cons of Triggers
Pivoting Data
Pivoting the Data
Manually Pivoting Data
Using the Pivot Feature (MSSQL, Oracle)
Working with SQL Variables
Code Blocks
Updated Code to Add a Sale
Review
Triggers
Pivot Tables
SQL Variables
Summary
Appendix A: Cultural Notes
Appendix B: DBMS Differences
Appendix C: Using SQL with Python
Index
About the Author
Mark Simon
has been involved in training and education
since the beginning of his career. He started as a
teacher of mathematics, but quickly pivoted into
IT consultancy and training because computers
are much easier to work with than high school
students. He has worked with and trained in
several programming and coding languages and
currently focuses mainly on web development
and database languages. When not involved in
work, you will generally find him listening to or
playing music, reading, or just wandering about.
About the Technical Reviewer
Aaditya Pokkunuri
is an experienced senior cloud database engineer
with a demonstrated history of working in the
information technology and services industry
with 13 years of experience.
He is skilled in performance tuning, MS
SQL Database Server Administration, SSIS,
SSRS, PowerBI, and SQL development.
He possesses in-depth knowledge of
replication, clustering, SQL Server high
availability options, and ITIL processes.
His expertise lies in Windows administration
tasks, Active Directory, and Microsoft Azure
technologies.
He also has extensive knowledge of
MySQL, MariaDB, and MySQL Aurora database engines.
He has expertise in AWS Cloud and is an AWS Solution Architect Associate
and AWS Database Specialty.
Aaditya is a strong information technology professional with a Bachelor of
Technology in Computer Science and Engineering from Sastra University, Tamil
Nadu.
Footnotes
1 https://link.springer.com/book/978148429494.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_1
1. Getting Ready
Mark Simon1
(1) Ivanhoe VIC, VIC, Australia
If you’re reading this book, you’ll already know some SQL, either through
previous study or through bitter experience, or, more likely, a little of both. In the
process, there may be a few bits that you’ve missed, or forgotten, or couldn’t see
the point.
We’ll assume that you’re comfortable enough with SQL to get the basic
things done, which mostly involves fetching data from one or more tables. You
may even have manipulated some of that data or even the tables themselves.
We won’t assume that you consider yourself an expert in all of this. Have a
look in the section “What You Probably Know Already” to check the sort of
experience we think you already have. If there are some areas you’re not
completely sure about, don’t panic. Each chapter will include some of the
background concepts which should take you to the next level.
If all of this is a bit new to you, perhaps we can recommend an introductory
book. It’s called Getting Started with SQL and Databases by Mark Simon, and
you can learn more about it at
https://link.springer.com/book/10.1007/978-1-4842-
9493-2.
In real life, there’s more to the story. For example, we haven’t included
payment or shipping methods, and we haven’t included login credentials.
There’s no stock either, although we’ll presume that the books are ordered on
demand.
But there’s enough in this database for us to work with as we develop and
improve our SQL skills.
Setting Up
You can sit in a comfortable chair with a glass of your favorite refreshment and a
box of nice chocolates and read this book from cover to cover. However, you’ll
get more from this book if you join in on the samples.
You’ll probably see this message a few times throughout the book. The
Appendix will tell you why.
It’s possible—even likely—that you already have the DBMS installed. Just
make sure that
It’s a fairly recent version.
Some of the features you’ll learn about aren’t available in some older
versions of some DBMSs. In particular, watch out for MySQL: you’ll need
version 8 which was released in 2018 for some of the more sophisticated
features.
You have enough privileges to create a database and to create and modify
tables. Most of the book won’t require that, but Chapter 2 definitely will.
At the very least, you’ll need to be able to install the sample database.
If you can’t make changes to the database, you can still work with most of
the book, and you’ll just have to nod your head politely as you’re reading
Chapter 2, in which we make a few changes to the database. You might also
have some difficulty in creating views, which we cover in Chapter 6 and in other
chapters.
Database Client
You’ll also need a database client. All the major DBMS vendors have their own
free client, and there are plenty of free and paid third-party alternatives.
Database Tables
SQL databases store data in one or more tables. In turn, a table presents the data
in rows and columns. You get the picture in Figure 1-2.
Writing SQL
SQL is a simple language which has a few rules and a few recommendations
for readability:
SQL is relaxed about using extra spacing. You should use as much
spacing as required to make your SQL more readable.
Each SQL statement ends with a semicolon (;).
The SQL language is case insensitive, as are the column names. Table
names may be case sensitive, depending on the operating system.
Microsoft SQL is relaxed about the use of semicolons, and many MSSQL
developers have got in the bad habit of forgetting about them. However,
Microsoft strongly encourages you to use them, and some SQL may not
work properly if you get too sloppy. See
https://docs.microsoft.com/en-us/sql/t-
sql/language-elements/transact-sql-syntax-
conventions-transact-sql#transact-sql-syntax-
conventions-transact-sql.
Remember, some parts of the language are flexible, but there is still a strict
syntax to be followed.
Basic SQL
The basic statement used to fetch data from a table is the SELECT table. In its
simplest form, it looks like this:
SELECT ...
FROM ...;
The SELECT statement will select one or more columns of data from a table.
You can select columns in any order.
The SELECT * expression is used to select all columns.
Columns may be calculated.
Calculated columns should be named with an alias; noncalculated columns
can also be aliased.
A comment is additional text for the human reader which is ignored by SQL:
SQL has a standard single-line comment: -- etc
Most DBMSs also support the non-standard block comment: /* ... */
Comments can be used to explain something or to act as section headers. They
can also be used to disable some code as you might when troubleshooting or
testing.
Data Types
Broadly, there are three main data types:
Numbers
Strings
Dates and times
Number literals are represented bare: they do not have any form of quotes.
Numbers are compared in number line order and can be filtered using the
basic comparison operators.
String literals are written in single quotes. Some DBMSs also allow double
quotes, but double quotes are more correctly used for column names rather than
values.
In some DBMSs and databases, upper and lower case may not match.
Trailing spaces should be ignored, but aren’t always.
Date literals are also in single quotes.
The preferred date format is ISO8601 (yyyy-mm-dd), though Oracle doesn’t
like it so much.
Most DBMSs allow alternative formats, but avoid the ??/??/yyyy format,
since it doesn’t mean the same thing everywhere.
Dates are compared in historical order.
SQL Clauses
For the most part, we use up to six clauses in a typical SELECT statement. SQL
clauses are written in a specific order. However, they are processed in a slightly
different order, as in Figure 1-4.
The important thing to remember is that the SELECT clause is the last to be
evaluated before the ORDER BY clause. That means that only the ORDER BY
clause can use values and aliases produced in the SELECT clause.1
As we’ll see later in the book, there are additional clauses which are
extensions to the one we have here.
SELECT columns
FROM table
WHERE conditions;
The conditions are one or more assertions, expressions which evaluate to true
or not true. If an assertion is not true, it’s not necessarily false either. Typically, if
the expression involves NULL, the result will be unknown, which is also not true.
NULL represents a missing value, so testing it is tricky.
NULLs will always fail a comparison, such as the equality operator (=).
Testing for NULL requires the special expression IS NULL or IS NOT
NULL.
Multiple Assertions
You can combine multiple assertions with the logical AND and OR operators. If
you combine them, AND takes precedence over OR.
The IN operator will match from a list. It is the equivalent of multiple OR
expressions. It can also be used with a subquery which generates a single column
of values.
Wildcard Matches
Strings can be compared more loosely using wildcard patterns and the LIKE
operator.
Wildcards include special pattern characters.
Some DBMSs allow you to use LIKE with non-string data, implicitly
converting them to strings for comparison.
Some DBMSs supplement the standard wildcard characters with additional
patterns.
Some DBMSs support regular expressions, which are more sophisticated than
regular wildcard pattern matching.
SELECT columns
FROM table
-- WHERE ...
ORDER BY ...;
The ORDER BY clause is both the last to be written and the last to be
evaluated.
Sorting does not change the actual table, just the order of the results for the
present query.
You can sort using original columns or calculated values.
You can sort using multiple columns, which will effectively group the rows;
column order is arbitrary, but will affect how the grouping is effected.
By default, each sorting column is sorted in increasing (ascending) order.
Each sorting column can be qualified by the DESC clause which will sort in
decreasing (descending) order. You can also add ASC which changes nothing
as it’s the default anyway.
Different DBMSs will have their own approach as to where to place sorted
NULLs, but they will all be grouped either at the beginning or the end.
Data types will affect the sort order.
Some DBMSs will sort upper and lower case values separately.
Limiting Results
A SELECT statement can also include a limit on the number of rows. This
feature has been available unofficially for a long time, but is now an official
feature.
The official form is something like
SELECT ...
FROM ...
ORDER BY ... OFFSET ... ROWS FETCH FIRST ... ROWS
ONLY;
SELECT ...
FROM ...
ORDER BY ... LIMIT ... OFFSET ...;
Sorting Strings
Sorting alphabetically is, by and large, meaningless. However, there are
techniques to sort strings in a more meaningful order.
Calculating Columns
In SQL, there are three main data types: numbers, strings, and dates. Each data
type has its own methods and functions to calculate values:
For numbers, you can do simple arithmetic and calculate with more complex
functions. There are also functions which approximate numbers.
For dates, you can calculate an age between dates or offset a date. You can
also extract various parts of the date.
For strings, you can concatenate them, change parts of the string, or extract
parts of the string.
For numbers and dates, you can generate a formatted string which gives you a
possibly more friendly version.
Casting a Value
You may be able to change the data type of a value, using cast():
You can change within a main type to a type with more or less detail.
You can sometimes change between major types if the value sufficiently
resembles the other type.
Sometimes, casting is performed automatically, but sometimes you need to
do it yourself.
One case where you might need to cast from a string is when you need a date
literal. Since both string and date literals use single quotes, SQL might
misinterpret the date for a string.
Views
You can save a SELECT statement into the database by creating a view. A view
allows you to save a complex statement as a virtual table, which you can use
later in a simpler form.
Views are a good way of building a collection of useful statements.
Joins
Very often, you will create a query which involves data from multiple tables.
Joins effectively widen tables by attaching corresponding rows from the other
tables.
The basic syntax for a join is
SELECT columns
FROM table JOIN table;
There is an older syntax using the WHERE clause, but it’s not as useful for
most joins.
Although tables are joined pairwise, you can join any number of tables to get
results from any related tables.
When joining tables, it is best to distinguish the columns. This is especially
important if the tables have column names in common:
You should fully qualify all column names.
It is helpful to use table aliases to simplify the names. These aliases can then
be used to qualify the columns.
The ON Clause
The ON clause is used to describe which rows from one table are joined to which
rows from the other, by declaring which columns from each should match.
The most obvious join is from the child table’s foreign key to the parent
table’s primary key. More complex joins are possible.
You can also create ad hoc joins which match columns which are not in a
fixed relationship.
Join Types
The default join type is the INNER JOIN. The INNER is presumed when no
join type is specified:
An INNER JOIN results only in child rows for which there is a parent. Rows
with a NULL foreign key are omitted.
An OUTER JOIN is an INNER JOIN combined with unmatched rows.
There are three types of OUTER JOIN:
A LEFT or RIGHT join includes unmatched rows from one of the joined
tables.
A FULL join includes unmatched rows from both tables.
A NATURAL join matches two columns with identical names and doesn’t
require an ON clause. It is particularly useful in joining one-to-one tables.
Not all DBMSs support this.
There is also a CROSS JOIN, which combines every row in one table with
every row in the other. It’s not generally useful, but can be handy when you
cross join with a single row of variables.
Aggregates
Instead of just fetching simple data from the database tables, you can generate
various summaries using aggregate queries. Aggregate queries use one or more
aggregate functions and imply some groupings of the data.
Aggregate queries effectively transform the data into a secondary summary
table. With grand total aggregates, you can only select summaries. You cannot
also select non-aggregate values.
The main aggregate functions include
count(), which counts the number of rows or values in a column
min() and max() which fetch the first or last of the values in sort order
For numbers, you also have
sum(), avg(), and stdev() (or stddev()) which perform the sum,
average, and standard deviation on a column of numbers
When it comes to working with numbers, not all numbers are used in the
same way, so not all numbers should be summarized.
For strings, you also have
string_agg(), group_concat(), or listagg(), depending on the
DBMS, which concatenates strings in a column
In all cases, aggregate functions only work with values: they all skip over
NULL.
You can control which values in a column are included:
You can use DISTINCT to count only one instance of each value.
You can use CASE ... END to work as a filter for certain values.
Without a GROUP BY clause, or using GROUP BY (), the aggregates are
grand totals: you will get one row of summaries.
You can also use GROUP BY to generate summaries in multiple groups.
Each group is distinct. When you do, you get summaries for each group, as well
as additional columns with the group values themselves.
Aggregates are not limited to single tables:
You can join multiple tables and aggregate the result.
You can join an aggregate to one or more other tables.
In many cases, it makes sense to work with your aggregates in more than one
step. For that, it’s convenient to put your first step into a common table
expression, which is a virtual table which can be used with the next step.
When grouping your data, sometimes you want to filter some of the groups.
This is done with a HAVING clause, which you add after the GROUP BY clause.
Data Types
There are three main types of data:
Numbers
Strings
Dates
There are many variations of the preceding types which make data storage
and processing more efficient and help to validate the data values.
There are also additional types such as boolean or binary data, which you
won’t see so much in a typical database.
Constraints
Constraints define what values are considered valid. Standard constraints include
NOT NULL
UNIQUE
DEFAULT
Foreign keys (REFERENCES)
You can construct your own additional constraints with the generic CHECK
constraint. Here, you add a condition similar to a WHERE clause which defines
your own particular validation rule.
Foreign Keys
A foreign key is a reference to another table and is also regarded as a constraint,
in that it limits values to those which match the other table.
The foreign key is defined in the child table.
A foreign key also affects any attempt to delete a row from the parent table.
By default, the parent row cannot be deleted if there are matching child rows.
However, this can be changed to either (a) setting the foreign key to NULL or (b)
cascading the delete to all of the children.
Indexes
Since tables are not stored in any particular order, they can be time consuming to
search. An optional index can be added for any column you routinely search,
which makes searching much quicker.
Manipulating Data
Data manipulation statements are used to add or change data. In addition to the
SELECT statement, there are
INSERT: Add new rows to the table
UPDATE: Change the data in one or more rows in the table
DELETE: Delete one or more rows of the table
Like SELECT, the UPDATE and DELETE statements can be qualified with
a WHERE clause to determine which rows will be affected.
Unlike SELECT, these have the potential to make a mess of a database,
especially since SQL doesn’t have an undo.
Set Operations
In SQL, tables are mathematical sets of rows. This means that they contain no
duplicates and are unordered. It also means that you can combine tables and
virtual tables with set operations.
There are three main set operations:
UNION combines two or more tables and results in all of the rows, with any
duplicates filtered out. If you want to keep the duplicates, you use the UNION
ALL clause.
INTERSECT returns only the rows which appear in all of the participating
tables.
EXCEPT (a.k.a. MINUS in Oracle) returns the rows in the first table which are
not also present in the second.
When applying a set operation, there are some rules regarding the columns in
each SELECT statement:
The columns must match in number and type.
Only the names and aliases from first SELECT are used.
Only the values are matched, which means that if your various SELECTs
change the column order or select different columns, they will be matched if
they are compatible.
A SELECT can include any of the standard clauses, such as WHERE and
GROUP BY, but not the ORDER BY clause. You can, however, sort the final
results with an ORDER BY at the end.
Set operations can also be used for special techniques, such as creating
sample data, comparing result sets, and combining aggregates.
Coming Up
As we said, we won’t presume that you’re an expert in all of this. As we
introduce the following chapters, we’ll also recap some of the basic principles to
help you steady your feet.
In the chapters that follow, we’ll have a good look at working with the
following ideas:
How to improve the reliability and efficiency of the database tables (Chapter
2)
How the tables are related to each other and how to work with multiple tables
(Chapter 3)
How to manipulate the values to get more value out of the values (Chapter 4)
How to summarize and analyze data (Chapter 5)
How we can save queries and interim results (Chapter 6)
How to mix data from multiple tables and aggregates (Chapter 7)
How to work more complex queries by building on top of other queries
(Chapters 6, 7, and 9)
How to add running aggregates and ranking data to our datasets (Chapter 8)
In Chapter 2, we’ll make a few changes to the database tables and even add a
few more tables to improve its overall design. It won’t be perfect, but it will
show how a database can be further developed.
Footnotes
1 SQLite is the exception here. You can indeed use aliases in the other clauses.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_2
SQL databases are, or at least should be, built on some pretty strong principles.
Although these principles are sometimes relaxed in real life, the database will be
more reliable and more efficient if they’re followed as far as possible.
In this chapter, we’re going to look at parts of the existing database and how
it can be improved using some of those principles. We’ll look at
The basic understanding of normal tables—tables which have been
constructed or reconstructed along the core principles
Modifying tables to follow the principles more closely
Improving the reliability and integrity of the database by adding additional
checks and constraints
Improving the performance of the database by adding indexes to help find
data more efficiently
Of course, we won’t be able to make the table perfect: that would take a long
time and a lot of experience with the database. You mightn’t even be in a
position to do this with your database. However, we’ll be able to get a better
understanding of what makes a database work better.
SELECT
id, givenname, familyname,
street, town, state, postcode
FROM customers;
you will see that there is indeed a relationship between some of the address
columns.
For example, if you change your address from one town to another, you will
probably also need to change the postcode and possibly the state. On top of that,
people living in the same town probably also have the same postcode; certainly,
they will be in the same state.
This creates a maintenance problem:
Changing address requires changes in three columns for a single change.
It is possible to make a mistake and change only some of the data; this creates
an inconsistency, making the data useless.
The correct solution would be to move this data to another table.
Note that the townid column must match the data type of the id column in
the towns table, which, in this case, is an integer.
You’ll notice that it doesn’t actually use the term FOREIGN KEY. It’s the
keyword REFERENCES that makes it a foreign key: in this case, it references an
id in the towns table.
You’ll also notice the naming of the foreign key using CONSTRAINT
fk_customers_town. Every constraint actually has a name, but you don’t
have to name it yourself if you’re prepared to allow the DBMS to make one up.
If so, you can use a shorter form:
If you already had the column, you could have added the foreign key
constraint retroactively with
By default, when you create a new column, it will be filled with NULLs. You
could have added a default value instead, but that would be pointless in this case,
since everybody lives somewhere else; in some cases, we don’t have the
customer’s address at all.
SELECT
id, givenname, familyname,
town, state, postcode, -- existing data
(SELECT id FROM towns AS t WHERE -- new data
t.name=customers.town
AND t.postcode=customers.postcode
AND t.state=customers.state
) AS reference
FROM customers;
The UPDATE statement is used to change values in an existing table. You can
set the value to a constant value, a calculated value, or, as in this case, a value
from another table.
Here, the same subquery is used to fetch the id that will be copied into the
townid column.
Some DBMSs allow you to alias the customers table, which would make
the UPDATE statement a little simpler.
A correlated subquery can be expensive, and it’s normally preferable to
use a join if you can. We could have used a join for the SELECT statement,
but not all DBMSs cooperate so well with UPDATE statements. Here, the
subquery is intuitive and works well, and, since you’re only running this
once, not too expensive.
SELECT
c.id, c.email, c.familyname, c.givenname,
c.street,
--original values
c.town, c.state, c.postcode,
c.townid,
-- from towns table
t.name AS town, t.state, t.postcode,
c.dob, c.phone, c.spam, c.height
FROM customers AS c LEFT JOIN towns AS t ON
c.townid=t.id;
If you’re doing this in Oracle, remember that you can’t use AS for the table
aliases:
SELECT
...
FROM customers c LEFT JOIN towns t ON c.townid=t.id;
Note that
We use the LEFT JOIN to include customers without an address.
We alias the customers and towns tables for convenience.
The towns table has a name column, instead of the town column. However,
in the context of the query, it makes sense to alias it to town.
We’ve also included the c.townid column, which, though it’s redundant,
might make it easier to maintain.
Once you have checked that the SELECT statement does the job, you can
create a view. Of course, you should leave out the old town data, since the whole
point is to use the data from the joined data:
In Microsoft SQL, you need to wrap the CREATE VIEW statement between
a pair of GO keywords:
-- MSSQL:
GO
CREATE VIEW customerdetails AS
SELECT
c.id, c.email, c.familyname, c.givenname,
c.street,
c.townid, t.name as town, t.state, t.postcode,
c.dob, c.phone, c.spam, c.height
FROM customers AS c LEFT JOIN towns AS t ON
c.townid=t.id;
GO
Here, we use DROP COLUMN which removes one or more columns and, of
course, all of their data, so you would want to be sure that you don’t need it
anymore. As you’ve seen earlier, there are some variations in the syntax between
DBMSs.
In Microsoft SQL, you will get an error that you can’t drop the postcode
column because there is an existing constraint. A constraint is an additional rule
for a valid value.
In this case, there is a constraint called ck_customers_postcode which
requires that postcodes comprise four digits only. You won’t need that constraint
now, especially since you’re going to remove the column.
To remove the constraint, run
-- MSSQL
ALTER TABLE customers
DROP CONSTRAINT ck_customers_postcode;
Once you have successfully removed the constraint, you can now remove the
columns:
Note that we’re reading from the customerdetails view, because the
town data is no longer in the customers table, though the townid is.
Now, change the customer’s townid to anything you like (as long as it’s no
more than the highest id in the towns table):
Note that the primary key is a two-character string. Every country has a
predefined two-character code, generally based on the country’s name,
either in English or in the country’s language. It makes sense to use this as
its primary key, rather than making one up. This is an example of a natural
key: a primary key based on actual data rather than an arbitrary code.
Run the script to install the table.
2.
Add a countryid column to the towns table, similar to the way you
added townid to the customers table. Remember, the data type must
match the preceding primary key:
-- MySQL / MariaDB
ALTER TABLE towns
ADD countryid CHAR(2) REFERENCES countries(id);
3. Update the towns table to set the value of countryid to 'au' for
Australia or whichever country you choose. This is much simpler than
setting it from a subquery:
4.
You will have to modify your view. First, drop the old version:
-- Not Oracle:
DROP VIEW IF EXISTS customerdetails;
-- Oracle:
DROP VIEW customerdetails;
5.
Next, you will have to recreate it with the country name:
-- Not Oracle
CREATE VIEW customerdetails AS
SELECT
...
c.townid, t.name AS town, t.state,
t.postcode,
n.name AS country
...
FROM
customers AS c
LEFT JOIN towns AS t ON c.townid=t.id
LEFT JOIN countries AS n ON
t.countryid=n.id;
-- Oracle
CREATE VIEW customerdetails AS
SELECT
...
c.townid, t.name AS town, t.state,
t.postcode,
n.name AS country
...
FROM
customers c
LEFT JOIN towns t ON c.townid=t.id
LEFT JOIN countries n ON t.countryid=n.id;
Note
This includes an additional JOIN to the countries table; to
accommodate the longer clause, we have split the JOIN over multiple
lines.
The alias for the countries table has been set to n (for Nation); this is
simply because we can’t use c as it is already in use.
Additional Comments
You may have noticed that we didn’t do anything about the street address
column. Strictly speaking, this is also subject to the same issues as the rest of the
address, so it would have been better if we did something similar.
However, street addresses are much more complicated, and we don’t have so
many customers, so we have left them as they are. This leaves us with an
imperfect but much improved design.
However, through an oversight, the column allows NULL, which, if you look
far enough, you’ll find in a number of rows. That doesn’t make sense: you can’t
have a sale item if you don’t know how many copies it’s for.
It’s reasonable to guess that a missing quantity suggests a quantity of 1.
You can implement this guess using coalesce():
SELECT
id, saleid, bookid,
coalesce(quantity,1) AS quantity, price
FROM saleitems
ORDER BY saleid, id;
Now we’ll get the same results, except that the NULLs have been replaced
with 1:
We certainly don’t want to keep doing this every time, so we’re going to fix
the old values and prevent the NULLs in the future.
What follows won’t work with SQLite. However, there is a section after
this which is what you might do to make the same changes in SQLite.
UPDATE saleitems
SET quantity=1
WHERE quantity IS NULL;
-- PostgreSQL
ALTER TABLE saleitems ALTER COLUMN quantity SET
NOT NULL;
-- MySQL/MariaDB
ALTER TABLE saleitems MODIFY quantity INT NOT
NULL;
-- MSSQL
ALTER TABLE saleitems ALTER COLUMN quantity INT
NOT NULL;
-- Oracle
ALTER TABLE saleitems MODIFY quantity NOT NULL;
-- Not Possible in SQLite
Earlier, the ALTER TABLE statement was used to add or remove a column.
You can also use it to make changes to an existing column. Here, we use it to
add a NOT NULL constraint.
As you’ve seen earlier, each DBMS has its own subtle variation on the
ALTER TABLE statement.
Setting a DEFAULT for Quantity
In principle, whatever caused the NULLs to appear may happen again, only now
it will generate an error. Better still, we should supply a default of 1 in case the
quantity is missing in a future transaction:
-- PostgreSQL
ALTER TABLE saleitems
ALTER COLUMN quantity SET DEFAULT 1;
-- MySQL/MariaDB
ALTER TABLE saleitems
MODIFY quantity INT DEFAULT 1;
-- MSSQL
ALTER TABLE saleitems
ADD DEFAULT 1 FOR quantity;
-- Oracle
ALTER TABLE saleitems
MODIFY quantity DEFAULT 1;
-- Not Possible in SQLite
The DEFAULT value is the value used if you don’t supply a value of your
own. The column doesn’t have to be NOT NULL, and NOT NULL columns
don’t have to have a DEFAULT. However, in this case, it’s a reasonable
combination.
Again, note that each DBMS has its own subtle variation on the syntax.
CHECK (quantity>0)
You could also impose an upper limit by using the BETWEEN expression:
-- PostgreSQL
ALTER TABLE saleitems
ADD CHECK (quantity>0);
-- MySQL/MariaDB
ALTER TABLE saleitems
MODIFY quantity INT CHECK(quantity>0);
-- MSSQL
ALTER TABLE saleitems
ADD CHECK(quantity>0);
-- Oracle
ALTER TABLE saleitems
MODIFY quantity CHECK(quantity>0);
-- Not Possible in SQLite
-- PostgreSQL
ALTER TABLE saleitems
ALTER COLUMN quantity SET NOT NULL,
ALTER COLUMN quantity SET DEFAULT 1,
ADD CHECK (quantity>0);
-- MySQL/MariaDB
ALTER TABLE saleitems MODIFY quantity INT
NOT NULL
DEFAULT 1
CHECK(quantity>0);
-- Oracle
ALTER TABLE saleitems MODIFY quantity
DEFAULT 1
NOT NULL
CHECK(quantity>0);
-- Not Possible in MSSQL
-- Not Possible in SQLite
Since you don’t actually make this sort of change terribly often, you lose
nothing if you keep the steps separate.
2.
Add a new quantity column with the required properties:
3.
Copy the data from the old column to the new one:
UPDATE saleitems
SET quantity=oldquantity;
4.
Drop the old column:
The new column will be at the end, which is not where the original was, but
that’s not really a problem.
Other Adjustments
As often in the development process, it’s not hard to get something working, but
the main effort goes into making it working just right. Here are some suggestions
to improve both the integrity and the performance of the database.
We’ll talk about indexes in the next section: they help in making the data
easier to search or sort.
You’ll notice that some of the CHECK constraints aren’t associated with a
single column. Some constraints are more concerned with how one column
relates to another column.
We certainly won’t address all of these suggestions here. After all, this isn’t a
real working database, and it’s quite possibly not your job anyway. We’ll just
look at two more.
-- PostgreSQL
ALTER TABLE books ADD CHECK (price>=0);
-- MySQL/MariaDB
ALTER TABLE books MODIFY price INT
CHECK(price>=0);
-- MSSQL
ALTER TABLE books ADD CHECK(price>=0);
-- Oracle
ALTER TABLE books MODIFY price CHECK(price>=0);
Again, to do this with SQLite, you can follow the steps for the quantity
in saleitems earlier.
Unlike adding a column constraint, the various DBMSs all use the same
syntax—except, of course, for SQLite. There is no simple method for adding a
table constraint in SQLite. Complex methods include dropping and recreating
the whole table similar to dropping and recreating a column or tampering with
the internals of the database, which is definitely not for the fainthearted.
Adding Indexes
SQL doesn’t define what order a table should be in. That leaves it up to the
DBMS to store the table in any way it deems most efficient.
The problem is that when searching for a particular row, it could be
anywhere, and the only way to find it is to look through the whole table and
hope that it doesn’t take too long.
If, on the other hand, the table were in order, it would be much easier to find
what you’re looking for. However, even if it’s in order, it’s just as likely to be in
the order of the wrong thing.
For example, even if the customers table is in, say, id order, it doesn’t
help when searching by familyname. If it’s in familyname order, it doesn’t
help when searching by phone.
The solution is to leave the table alone and then supplement the table with
one or more indexes. An index is an additional listing which is in search order,
together with a pointer to the matching row in the table.
For example, the customers table has an index for the familyname.
When the time comes to search on the familyname, the DBMS automatically
looks up the index instead, finds what it wants, and goes back to the real table to
fetch the rest of the data.
There are two costs to having an index:
Each index takes up a little more space in the database.
Every time you add or change a row in the table, each index will also need to
be updated.
For this reason, you will only find an index on a column if it has been
specifically requested in the table design. And you would only include an index
if you considered the improvement in search ability to be worth the cost in
storage and management.
There are two additional indexes which are automatically included:
Any UNIQUE column is always indexed; the best way to prevent duplicated
values is to keep an ordered list of existing values.
The primary key is always indexed; by definition, it is a unique identifier,
which you would presumably search often.
Another type of column which might be worth considering is a foreign key.
That’s because it will, of course, be heavily involved in searching and sorting.
There is some discussion in learned circles as to the merits of indexing
foreign keys. Overall, it appears to be a good idea, and you would probably
do well consider adding an index to each of them.
Any other column would be a matter of judgment. At least it’s not hard to
change your mind about adding or removing an index at some point in the
future.
Some DBMSs do include the ability to store the table in order of one
column or the other. This is called a clustered index or an index organized
table. In some DBMSs, such as Microsoft SQL, the clustering is permanent
(the DBMS ensures that the table is maintained in that order); in some others,
it is temporary (the DBMS sorts the table once, but you’ll have to do it again
in the future).
Here, we’re ignoring clustering. In any case, you still can’t keep the table
in multiple orders, so you’ll need indexes anyway.
The ON clause identifies the table and the columns you want listed.
It is possible to index multiple columns in a single statement, but that doesn’t
create multiple separate indexes. Instead, you create an index on the combined
value. For example:
ix_table_columns
This isn’t a rigid rule, but it makes things easier to work with.
Why does the index need a name anyway? Most of the time, you don’t really
care. However, there are two reasons:
Everything stored in the database, including maintenance objects, must have a
unique name for internal management.
If you ever need to drop an index, you need to use its name to identify it.
Even if you succeed in creating an anonymous index, the DBMS will
automatically assign its own name, which isn’t always a very pretty name.
Another index you might consider is on the foreign key authorid in the
books table. You can add it with
By grouping the names, you can count how many times they appear. Of
course, since you’re only interested in those that appear more than once, you can
filter the results with a HAVING clause:
phone number
[NULL] 17
In this case, there are no duplicates. What appear to be duplicates are NULLs,
because there are multiple NULLs in the table. They don’t count.
If you do find duplicates, then you have your work cut out for you in trying
to work out whether these duplicates are legitimate. You might even conclude
that duplicate phone numbers are OK, so you wouldn’t go ahead with the next
step.
Assuming that duplicates are not OK, to protect against duplicates, you add a
UNIQUE INDEX:
-- Not MSSQL
CREATE UNIQUE INDEX uq_customers_phone
ON customers(phone);
-- MSSQL
CREATE UNIQUE INDEX uq_customers_phone
ON customers(phone)
WHERE phone IS NOT NULL;
Note that this time the index name begins with uq as a reminder that this is a
unique index. Again, there are no rules for how to name the index, but this one
follows a common and understandable pattern.
Whether or not you really want to disallow duplicate phone numbers is
another question. Two customers from the same household or organization
may well share the same phone number, so disallowing them would be
problematic. This is an exercise in how to disallow duplicates, but not
necessarily on whether to disallow duplicates. That’s something best left to
the needs of the individual database.
Review
A well-designed SQL database needs to follow a few rules to ensure that the data
can be relied upon. There is no guarantee that the data is true, but the data will at
least be valid.
Normal Form
A table which follows certain design principles is said to be in a normal form.
This doesn’t mean that it’s commonplace, but rather that it is in a definitive
form.
Normalized tables include the following properties:
Data is atomic.
Rows are unordered.
Rows are unique.
Rows are independent of each other.
Columns are independent of each other.
Columns are of a single type.
Column names are unique.
Columns are unordered.
Multiple Values
One issue in developing tables is how to handle multiple values and recurring
values. In general, the solution is to have additional tables and to link them using
foreign keys.
Altering Tables
When restructuring or hardening a database, you need to make changes to
existing tables and columns. The ALTER TABLE statement can be used to
Add extra columns, including foreign keys
Drop columns
Add or drop constraints
Add or drop indexes
Constraints include adding NOT NULL, defaults, and additional CHECK
constraints.
Views
A view is a saved SELECT statement. One reason to create a view is for the
convenience of having data from one or more tables in one place.
Sometimes, when you create a view with combined data, you end up with a
result which no longer follows all the rules of normalization. In the trade, this
would be referred to as denormalization.
Denormalized data is generally a bad way to maintain data, but very often a
convenient way to extract data. In this sense, it is the best of both worlds: the
original data is still intact in the original tables.
Some DBMSs include the ability to update data in view. In fact, the update
doesn’t affect the view at all, but is rather passed on to the underlying tables.
Indexes
An index is a supplement to a table which stores the selected data in order,
together with a reference to the data in the original table. Using the index, the
DBMS can search for data more quickly.
Indexes are automatically created for primary keys and unique columns. You
can add an index on any other column.
Indexes have some costs, so they shouldn’t be added for no reason. Costs
include storage and maintenance.
Unique indexes can be added to ensure that values in a particular column, or
combination of columns, are unique.
Summary
In this chapter, we focused on the properties of individual tables and looked for
ways to make the database more reliable and more efficient.
We looked at
The principles of normalized SQL tables
How multiple values are handled in normalized tables
Altering tables to improve their reliability and to better fit the principles of
normal tables
Creating views to improve access to multiple tables
Adding an index to improve efficiency
The process of improving the database was, of course, incomplete, but it
gives us a better understanding of what makes a database more reliable and more
efficient.
Coming Up
In this chapter, we’ve been focused on properties of individual tables, which help
to improve the integrity and efficiency of the tables.
In the next chapter, we’ll look more at how multiple tables interact.
Footnotes
1 This is odd, since constraints normally ignore NULLs, and NULL doesn’t match NULL anyway.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_3
A database is not just one table. Well, it can be of course, but any sophisticated
database, such as one which you would use to manage an online bookshop, will
comprise a number of tables, each handling a different collection of data.
While you can get some useful information from examining individual
tables, you will get so much more from combining tables.
In this chapter, we will look at working with multiple tables, how they are
related to each other, and how to combine them when the time comes.
Specifically, we’ll look at
What we mean by table relationships and their main types
How one-to-many relationships are used to manage multiple values
How one-to-one relationships are used to extend one table with another
How many-to-many relationships are used to manage more complex multiple
values
How to work with inserting and updating data in multiple tables
We’ll look at why the database is structured this way with multiple tables and
how we can use joins to combine them into virtual tables.
An Overview of Relationships
A well-structured database adheres to a number of important design principles.
Two of these are as follows:
Each table has one type of data only and doesn’t include data which rightly
belongs in another table.
For example, the customers table doesn’t include book details, and the
books table doesn’t include customer details. This would also apply to the
books and authors tables.
That isn’t to say that the books table isn’t aware of the author at all.
We’ll look at that in a moment.
Data is never repeated. The same item of data is not to be found in a different
table, nor will it be repeated in the same table.
For example, if you were to include the author’s name and other details in
the books table, you would find yourself repeating the same details for other
books by the same author.
These two principles are related: if you mix author details with the books,
violating the first principle, you will end up repeating the details for multiple
books, violating the second principle.
The correct way to manage books and authors is to put author details in a
separate table and for the books table to include a reference to one of the
authors. In this way, we say that there is a relationship between the two tables.
The same would apply to the books and customers tables. Since the goal
is for customers to buy books, there should be a relationship between these
tables as well. However, this relationship is a little more complex, as we shall
see later.
There are three main types of relationships:
A one-to-many relationship is between the primary key of one table and a
foreign key of another.
For example, there is a one-to-many relationship between authors and
books: one author can have many books, and many books can have the same
author.
A one-to-one relationship is between the primary key of one table and the
primary key of another. Generally, this is rare, and you are more likely to see
a variation of this.
For example, there is a vip table of additional features for customers. For
each customer, there can only be one vip entry, and there is a (modified) one-
to-one relationship between the two tables.
A many-to-many relationship is not a direct relationship, but one that
involves a joining table between the two main tables.
For example, there is a genres table which contains possible genre
labels for books. Since one book could have many genres, and one genre
could apply to many books, there is a many-to-many relationship between the
two tables.
You will see that this is implemented with an additional table.
These relationships might be described as planned relationships. They’re
usually enforced with foreign key constraints, usually involve a primary key, and
define a tight structure for the database.
There can also be unplanned relationships. For example, you might consider
a relationship between birthdays of customers and authors. That sort of
relationship would probably be a coincidence, but might be worth exploring in
some situations—maybe Scorpios feel an affinity with other Scorpios.
We’ll refer to unplanned relationships as ad hoc relationships and look at a
few later.
If you have multiple tables in a planned or unplanned relationship, you can
examine the combination using a JOIN.
One-to-Many Relationship
This is the most common type of relationship between two tables. The
relationship is between the primary key of one table and a foreign key in another.
However, it’s actually implemented as a reference from a foreign key to the
primary key.
The relationship is used to indicate a number of possible scenarios. For
example:
One Author has written many Books.
One Customer has many Sales.
One Sale contains many Items.
Note that the use of the word many can imply any number from 0 to ∞.
In the preceding cases, one table is referred to as the one table, while the
other is referred to as the many table, which is not very informative. Sometimes,
it is helpful to think of the one table as the parent table, while the many table is
the child table.
A one-to-many relationship is implemented as a reference from the child
table to the parent table, for example, for books and authors:
Note that while the child table has a reference to the parent table, the parent
table does not have a reference to the child table.
You can combine parent and child tables using a JOIN:
-- Not Oracle
SELECT
b.id, b.title, -- etc
a.givenname, a.familyname -- etc
FROM books AS b JOIN authors AS a ON
b.authorid=a.id;
This will give you the books with their matching authors:
Note that Oracle has a quirk which disallows using AS for table aliases. If
you’re using Oracle, you’ll need to remember that in the following examples
which may include AS.
Remember, if there are anonymous books (books with a NULL for
authorid), you will need an outer join:
-- Not Oracle
SELECT
b.id, b.title, -- etc
a.givenname, a.familyname -- etc
FROM books AS b LEFT JOIN authors AS a ON
b.authorid=a.id;
This will give you all of the books with or without their authors:
In the previous example, we opted for a LEFT JOIN. When you join a child
table to a parent table, you generally have four options:
Only the matching rows
Include all of the unmatched children
Include all of the unmatched parents
Include all of the unmatched children and parents
The first option is, of course, an INNER JOIN, or, more simply, JOIN.
The result would look like Figure 3-3.
Figure 3-3 An Inner Join
You’ll notice that the join doesn’t include unmatched books or authors.
The second and third options are LEFT OUTER JOIN or RIGHT OUTER
JOIN, depending on whether the unmatched rows are on the left or the right;
again, we can simply write LEFT JOIN or RIGHT JOIN. In this case, a LEFT
JOIN would include unmatched books, as in Figure 3-4.
Figure 3-4 An Outer Join
In this case, we went for the LEFT JOIN because the child table was on the
left, and we wanted all of them with or without matches.
Despite the apparent symmetry, all joins are not equal. When you join a child
to a parent table, the number of results will generally reflect the child table.
That’s because many of the children would share the same parent.
To get a fair estimate of how many results you might expect, therefore, you
should start by counting the rows.
To get the number of results in an INNER JOIN, you’ll need to count the
number of children which match a parent—that is, where the foreign key is NOT
NULL:
That should get you the number of rows for the INNER JOIN previously:
Count
1172
To count the number of unmatched child rows, you just need to count the
ones where the foreign key is NULL:
-- Unmatched Children
SELECT count(*) FROM books WHERE authorid IS NULL;
That will give you the number of rows missing in the INNER JOIN:
Count
29
If you add this to the number for the INNER JOIN, you’ll get the total
number of books, which is the number of rows in the child OUTER JOIN
earlier.
To get the number of unmatched parent rows is trickier. You’ll need to count
the number of rows in the parent table whose primary key is not one of the
foreign keys in the child table:
-- Unmatched Parents
SELECT count(*) FROM authors
WHERE id NOT IN(SELECT authorid FROM books WHERE
authorid IS NOT NULL);
Count
45
The subquery selects for the authors whose id does make an appearance in
the books table. The NOT IN expression selects for the others. The reason that
the subquery includes the WHERE authorid IS NOT NULL clause is due to
a quirk in the behavior of NOT IN with NULLs. This is explained later.
Now, you have all the numbers you need to estimate the number of rows in
your join. You can use the following combinations:
JOIN Calculation
INNER JOIN INNER JOIN
Child OUTER JOIN INNER JOIN + Unmatched Children= Children
Parent OUTER JOIN INNER JOIN + Unmatched Parents
Full OUTER JOIN INNER JOIN + Unmatched Children + Unmatched Parents
That’s the number of rows you can expect from a child outer join: LEFT
JOIN or RIGHT JOIN, depending on where you put the child table.
Of course, that’s not necessarily the end of it. If you have an inner join, and
there are some NULL foreign keys, then you’ll end up with fewer than the
estimate. If you opt for a parent outer join, then there’ll be more rows if you
have parents without matching children.
However, this is a good starting point.
To find customers in the other states, you can use NOT IN:
That’s as expected. However, if you include NULL in your list, things get
messy. You need to remember how IN(...) is interpreted. For example:
That last term state=NULL will always fail, since NULL always fails a
comparison, but that’s OK if it matches one of the others.
However, the NOT IN version:
is equivalent to
When you negate a logical expression, you not only negate the individual
terms, but you also negate the operators between them.
Once again, the term state<>NULL always fails, but, since this is now
ANDed with the rest, it fails the whole expression.
The moral of this story is that you can’t use NOT IN if the list contains
NULLs.
You can drop an existing view using DROP VIEW. For most DBMSs, you
can use DROP VIEW IF EXISTS if you’re not sure that it exists (yet). Not
with Oracle, however.
Microsoft SQL has an additional quirk: CREATE VIEW must be the only
statement in its batch, so you need to put the statement between the GO keyword,
which marks the end of one batch and the beginning of another:
We’ve included the authorid in case you want to use it to get more author
details.
Once you have saved a view, you can pretend it’s another table:
SELECT * FROM bookdetails;
You’ll get the same results as before with a little less effort.
One-to-One Relationships
A one-to-one relationship associates a single row of one table with a single row
of another. It is normally between two primary keys.
If every row in one table is associated with a row in another table, then you
can consider the second table as an extension of the first table. If that’s the case,
why not just put all of the columns in the same table? Reasons include the
following:
You want to add more details, but you don’t want to change the original table.
You want to add more details, but you can’t change the original table (possibly
because of permissions).
The additional table contains details that may be optional: not all rows in the
original table require the additional columns.
You want to keep some of the details in a separate table so that you can add
another layer of security to the additional details.
One-to-Maybe Relationships
Technically, a true one-to-one relationship requires a reference from both tables
to each other. Among other things, it is hard to implement as it might require
adding both rows at the same time.
Since a row from table A must reference a row from table B, you would
need to have the table B row in place before you add to table A. However, if a
row from table B must also reference a row from table A, then you need to
add to table A first. That’s clearly a contradiction.
One way to do this would be to defer the foreign key constraint until after
you’ve added to both tables in either order. Unfortunately, most DBMSs don’t
let you do this, so you’re stuck with this impossible situation.
Here, the secondary table contains additional data for some of the rows in the
main table.
Note that this relationship is implemented by making the id in the secondary
table both a primary key and a foreign key.
For example, the vip table includes additional features for some customers:
You can see all of the customers, some of whom also have VIP data:
You’ll notice that there aren’t as many rows in the vip table as in the
customers table. There might have been, if every customer were a VIP, but
not in this case.
You can see how they relate using a join:
This gives all of the customers, with either their VIP data or NULLs in the
extra columns.
Note
We need the LEFT JOIN to include non-VIP customers. If you wanted
VIP customers only, a simple (inner) JOIN would be better.
We could have used SELECT *, but using c.*, v.* allows you to
decide which tables you are most interested in.
As a special case, you can also select VIP customers only, without additional
VIP columns, using
SELECT c.*
FROM customers AS c JOIN vip AS v ON c.id=v.id;
Here, the inner join selects only VIP customers, and the c.* selects only the
customer columns.
Why you would want to do this is, of course, up to you.
Multiple Values
One major question in database design is how to handle multiple values. The
principles of properly normalized tables preclude multiple values in a row:
A single column cannot contain multiple values.
For example, you shouldn’t have multiple phone numbers in a single
column, such as 0270101234,0355505678. This would be impossible to
sort or search properly and be a real nightmare to maintain.
You shouldn’t have multiple columns with the same role.
For example, you shouldn’t have multiple columns for phone numbers
such as phone1, phone2, phone3, etc. It would make searching
problematic as you can’t be sure which column to search. You would also
have the problem of too many or not enough columns.
Note that you can do this if there is a clear distinction in the type of phone
number. For example, you could legitimately have separate columns for fax
(does anybody remember these?), mobile, and landline numbers.
For example, suppose we wish to record multiple genres for a book. Here are
two attempted solutions which are not correct:
A column with multiple (delimited) values
The idea is that the genre column would have multiple genres or genre ids,
delimited possibly by a comma. The problem is that the data is not atomic,
and this becomes very difficult to sort, search, and update. You will also need
extra work to use the data.
Multiple columns for genres
You cannot have multiple columns with the same name, so these columns
might be called genre1, genre2, etc. Here, the problems are (a) you will either
have too many or not enough columns, (b) there is no “correct” column for a
particular value, and (c) searching and sorting are impractical.
The problem of recording genres is more complicated, because not only can
one book have multiple genres, one genre can apply to multiple books. This is an
example of a many-to-many relationship.
This cannot be achieved directly between the two tables; rather, it involves
an additional table between them.
If you have the courage to look at the script which created and populated
the sample database, you’ll find a table called booksgenres (not to be
confused with the bookgenres table, which is, of course, completely
different) which does indeed have the genres combined in a single column.
This is, of course, cheating.
This is one case where you might break the rules for the purpose of
transferring or backing up the data only. However, the data should never stay
in this format.
Many-to-Many Relationships
To represent a many-to-many relationship between tables, you will need another
table which links the two others.
Such a table is called an associative table or a bridging table.
It looks like Figure 3-7.
Figure 3-7 A Many-to-Many Relationship
-- Book Table
CREATE TABLE books (
id int PRIMARY KEY,
title varchar,
-- etc
);
-- Genre Table
CREATE TABLE genres (
id int PRIMARY KEY,
name varchar,
description varchar
-- etc
);
The genres table includes a surrogate primary key. It also contains the
actual genre name and a description so that the use of the particular genre is
clear.
You can see what’s in the two tables with simple SELECT statements:
Neither table refers to the other. Instead, you need an additional table.
The associative table will then link books with genres:
-- Associative Table
CREATE TABLE book_genres (
bookid int REFERENCES books(id),
genreid int REFERENCES genres(id)
);
bookid Genreid
456 8
789 8
123 52
456 38
789 38
123 80
456 94
356 1
789 113
123 9
1914 1
936 1
1198 1
918 1
456 35
789 68
456 146
789 80
456 101
456 145
1618 2
844 3
~ 8011 rows ~
This table is a simple table which has one job only: record which books are
related to which tables.
Each column must be a foreign key to the other table; otherwise, the whole
point of the association is lost. This association allows a book to be associated
with multiple genres and a genre to be associated with multiple books.
In the preceding table, for example, book 123 has multiple genres. Book
456 also has two genres. Some of those genres appear for both books and, for
all we know, other books later on. That is, one book can have many genres, and
one genre can associate with many books.
There is one more requirement. The combination should be unique. There is
no point in associating a book with the same genre more than once. Since there
is no other data in the table, it would be appropriate to make the combination a
compound primary key:
-- Associative Table
CREATE TABLE book_genres (
bookid int REFERENCES books(id),
genreid int REFERENCES genres(id),
PRIMARY KEY (bookid,genreid)
);
Count
8011
SELECT *
FROM
bookgenres AS bg
JOIN books AS b ON bg.bookid=b.id
JOIN genres AS g ON bg.genreid=g.id
;
This gives a very long list, because the bookgenres table is very long:
Here, we started the join from the middle, since we’re focusing on the
associative table. You could just as readily have started on one end:
SELECT *
FROM
books AS b
JOIN bookgenres AS bg ON b.id=bg.bookid
JOIN genres AS g ON bg.genreid=g.id
;
You’ll get the same data, but, since the tables are in a different order, the
columns will also be in a different order.
In reality, you end up with too many columns, two of which are duplicated
by the join. You can simplify the result as
id title Genre
1732 In His Steps Fiction
414 Poesies Fiction
241 Researches in Teutonic Mythology Fantasy
247 The King in Yellow Gothic
1914 Voyage of the Beagle Classics
936 The Origin of Species Classics
~ 8011 rows ~
WITH cte AS (
SELECT b.id, b.title, g.genre
FROM bookgenres AS bg
JOIN books AS b ON bg.bookid=b.id
JOIN genres AS g ON bg.genreid=g.id
)
--etc
;
-- Not SQLite
SELECT
b.id, b.title, b.published, b.price,
g.genre,
a.givenname, a.othernames, a.familyname,
a.born, a.died, a.gender, a.home
FROM
authors AS a
RIGHT JOIN books AS b ON a.id=b.authorid
LEFT JOIN bookgenres AS bg ON b.id=bg.bookid
JOIN genres AS g ON bg.genreid=g.id
;
In the preceding example, we have included most of the columns from the
four tables, omitting the foreign keys and most of the other primary keys.
The tables are joined in a line from the authors table to the genres table.
Since we want all of the books, regardless of whether they have associated
authors or genres, we use two outer joins. As it turns out, we see examples of
each of the three main join types.
SQLite doesn’t support the RIGHT JOIN, so this won’t work.
You can write the joins starting from the books table if you like:
SELECT
b.id, b.title, b.published, b.price,
g.genre,
a.givenname, a.othernames, a.familyname,
a.born, a.died, a.gender, a.home
FROM
books AS b
LEFT JOIN bookgenres AS bg ON b.id=bg.bookid
JOIN genres AS g ON bg.genreid=g.id
LEFT JOIN authors AS a ON b.authorid=a.id
;
Visually, this appears to put the emphasis on the books table, but it will
give exactly the same results as before.
This time, SQLite will be happy.
However, there is an alternative, which takes advantage of the view
previously created.
First, you can replace the references to the individual books and authors
tables with the bookdetails view:
SELECT
bd.id, bd.title, bd.published, bd.price,
g.genre,
bd.givenname, bd.othernames, bd.familyname,
bd.born, bd.died, bd.gender, bd.home
FROM
bookdetails AS bd
LEFT JOIN bookgenres AS bg ON bd.id=bg.bookid
JOIN genres AS g ON bg.genreid=g.id;
WITH cte AS (
SELECT bg.bookid, string_agg(g.genre,', ') AS
genres
FROM bookgenres AS bg JOIN genres AS g ON
bg.genreid=g.id
GROUP BY bg.bookid
)
SELECT *
FROM bookdetails AS b JOIN cte ON b.id=cte.bookid;
You might want to filter the genres. You can do that inside the CTE:
WITH cte AS (
SELECT bg.bookid, string_agg(g.genre,', ') AS
genres
FROM bookgenres AS bg JOIN genres AS g ON
bg.genreid=g.id
WHERE g.genre IN('Fantasy','Science Fiction')
GROUP BY bg.bookid
)
SELECT *
FROM bookdetails AS b JOIN cte ON b.id=cte.bookid;
You’ll then get a filtered list:
Note that the concatenated genres column has been aliased to genres.
That’s the same name as the genres table, so you might get confused. The
good news is that SQL doesn’t, so you can get away with it. On the other hand, if
you’re worried about that you can always use double quotes: AS "genres".
Of course, you can also choose a better alias.
A word of warning, however. When you start using views inside queries, you
will have to consider some possible side effects:
Since the view is not part of the original database, you may lead to some
confusion with other users, since views look like tables, but aren’t with the
rest of the tables.
If you have too many views in a query, the DBMS optimizer may not be able
to work out the most efficient plan for running the query. This can be because
some views produce more than you need for the next query, and the optimizer
may not be able to work out what you really want.
If there are any changes to the view, they will, of course, affect the outcome of
the query.
These side effects will be more pronounced if you start to create views using
other views. It is often safer to create the new view from scratch.
This doesn’t mean that you shouldn’t use views in your queries—that’s the
whole point of creating a view. It does mean, however, that you should be
careful when piling them up.
You won’t need to run the preceding example, as the data is already in the
sample tables.
You can fetch the associated data using something like the following:
SELECT *
FROM
multibooks AS b
JOIN authorship AS ba ON b.id=ba.bookid
JOIN multiauthors AS a ON ba.authorid=a.id;
You can also combine the authors for each book using a CTE and an
aggregate query:
WITH cte AS (
SELECT
ba.bookid,
string_agg(a.givenname||' '||a.familyname,' &
')
AS authors
FROM authorship AS ba JOIN multiauthors AS a
ON ba.authorid=a.id
GROUP BY ba.bookid
)
SELECT b.id, b.title, cte.authors
FROM multibooks AS b JOIN cte ON b.id=cte.bookid
ORDER BY b.id;
id title Authors
1 Man Plus Frederik Pohl
2 Proxima Stephen Baxter
3 The Long Mars Stephen Baxter & Terry Pratchett
4 The Shining Stephen King
5 The Talisman Peter Straub & Stephen King
6 The Long Earth Stephen Baxter & Terry Pratchett
~ 23 rows ~
The main sample database doesn’t include multiple authors simply because it
doesn’t happen often enough with classic literature to make it worth
complicating the sample further.
However, the point is that whenever you have multiple values, you will need
additional tables rather than additional columns or compound columns. Multiple
values should appear in rows, not columns.
Adding an Author
In principle, you would add the new author with the following statement:
As the comment says, don’t run this statement yet. Because the author’s id
is autogenerated, we’ll need to get the new id after inserting the row. You can
do a search for it after adding the row, but it may be possible to have the DBMS
tell you what the new id is.
Different DBMSs have different methods of getting this id.
For PostgreSQL, you can simply use a RETURNING clause at the end of the
INSERT statement:
-- PostgreSQL
INSERT INTO authors(givenname, othernames,
familyname,
born, died, gender,home)
VALUES('Agatha','Mary Clarissa','Christie',
'1890-09-15','1976-01-12','f',
'Tourquay, Devon, England')
RETURNING id; -- Take note of this!
The additional SELECT statements earlier all fetch the newly generated id.
Oracle, on the other hand, makes it pretty tricky. It does support a
RETURNING clause, but only into variables. You can get the newly generated
id, but that involves some extra trickery in hunting for sequences. The simplest
method really is to select the row you’ve just inserted using data you’ve just
entered:
-- Oracle
INSERT INTO authors(givenname, othernames,
familyname,
born, died, gender,home)
VALUES('Agatha','Mary Clarissa','Christie',
date '1890-09-15',date '1976-01-12','f',
'Tourquay, Devon, England');
SELECT id FROM authors
WHERE givenname='Agatha' AND othernames='Mary
Clarissa'
AND familyname='Christie';
Of course, you don’t necessarily need to filter all of the new values: just
enough to be sure you’ve got the right one.
Adding a Book
After that, the rest is easy.
Whether or not you have just added the new author, you can simply search
for the authors table to get the author id:
Taking note of the id in particular, you can insert the book with the
following statement:
Of course, you will need to supply the correct id in the preceding statement,
either from the INSERT statements in the previous section or from the SELECT
statement earlier.
Note that we’ve picked an arbitrary value of 16.00 for the price. It didn’t
need the decimal part, of course, but it makes the purpose clearer.
Data Value
Customer ID 42
Book IDs 123, 456, 789
Quantities 3, 2, 1
-- PostgreSQL
INSERT INTO sales(customerid, ordered)
VALUES (42,current_timestamp)
RETURNING id;
-- MSSQL
INSERT INTO sales(customerid, ordered)
VALUES (42,current_timestamp);
SELECT scope_identity();
-- MySQL / MariaDB
INSERT INTO sales(customerid, ordered)
VALUES (42,current_timestamp);
SELECT last_insert_id();
-- SQLite
INSERT INTO sales(customerid, ordered)
VALUES (42,current_timestamp);
SELECT last_insert_rowid();
-- Oracle
INSERT INTO sales(customerid, ordered, total)
VALUES (42,current_timestamp,0);
SELECT id FROM sales WHERE id=42 AND total=0;
-- Not Oracle
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES
( ... , 123, 3),
( ... , 456, 1),
( ... , 789, 2);
-- Oracle
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES ( ... , 123, 3);
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES ( ... , 456, 1);
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES ( ... , 789, 2);
Remember to use your new sale id in the preceding statements.
Also, remember that Oracle doesn’t like multiple values in a single INSERT
statement, which is why there are multiple statements. You can use that for the
other DBMSs if you prefer, but it’s not necessary.
The prices come from another table. You can fetch those prices into the new
sale items using a subquery:
UPDATE saleitems
SET price=(SELECT price FROM books WHERE
books.id=saleitems.bookid)
WHERE saleid = ... ;
The correlated subquery fetches the price from the saleitems table for the
matching book (WHERE books.id=saleitems.bookid).
The WHERE clause in the main query ensures that only the new sale items get
the prices. This is important because you don’t want to copy the prices into the
old sale items: there might have been a price change since the older sales were
completed, and you shouldn’t let that change affect old transactions.
SELECT sum(quantity*price)
FROM saleitems
WHERE saleid = ... ;
The result would be correct, but it would also be incomplete. What’s missing
is the tax and any VIP discount applicable.
Let’s assume a tax of 10%—it varies from country to country, of course, so
you might want to make an adjustment. That means you’ll end up paying (1 +
10%) times the total:
In real life, of course, you would simply write 1.1, but the preceding
expression is a reminder of where the value came from and how you might adapt
it for different tax rates.
The VIP discount depends on the customer. You can read that from the VIP
table:
The reason you subtract it from 1 is that it’s a discount: it comes off the full
price.
You can use that in a subquery with the calculated total:
SELECT
sum(quantity*price)
* (1 + 0.1)
* (SELECT 1 - discount FROM vip WHERE id = 42)
FROM saleitems
WHERE saleid = ... ;
except not necessarily. Some customers aren’t VIPs, so the subquery might
return a NULL. That would destroy the whole calculation. Since a missing VIP
value means no discount, we should coalesce the subquery to 1:
SELECT
sum(quantity*price)
* (1 + 0.1)
* coalesce((SELECT 1 - discount FROM vip
WHERE id = 42),1)
FROM saleitems
WHERE saleid = ... ;
UPDATE sales
SET total = (
SELECT
sum(quantity*price)
* (1 + 0.1)
* coalesce((SELECT 1 - discount FROM vip
WHERE id = 42),1)
FROM saleitems
WHERE saleid = ...
)
WHERE id = ... ;
There’s a lot going on here. First, the UPDATE query sets a value to a
subquery, which, in turn, uses a subquery to fetch a value. You’ll also find that
the query uses the sale id twice, once to filter the sale items and once to select
the sale.
Review
A main feature of SQL databases is that there are multiple tables and that these
tables are related to each other.
Relationships are generally established through primary keys and foreign
keys which reference the primary keys in related tables. The foreign key is
normally in the form of a constraint, which guarantees that the foreign key
references a valid primary key value in the other table, if not necessarily the
correct one.
There may also be ad hoc relationships which are not planned or enforced.
Types of Relationships
There are three main relationship types:
One-to-many relationships are between a foreign key in one table, often
called a child table, and a primary key in another, the parent. Generally, one
parent can have many children.
This is the most common type of relationship.
One-to-one relationships are between primary keys in both tables. The
primary key in one table doubles up as a foreign key in the other.
A true one-to-one relationship requires both primary keys to be foreign
keys to the other table. In practice, this is difficult to implement, and the
foreign key is normally on one table only. This may informally be called a
one-to-maybe relationship.
Many-to-many relationships allow a row in one table to relate to many rows
in the other, as well as the other way around. Since columns can only have
single values, this relationship is created through another table, often called an
associative table, with a pair of one-to-one relationships.
In any reasonably sized database, the fact that there are many tables in
one-to-many relationships results in many-to-many relationships.
It’s a basic principle in a database that a column shouldn’t have multiple
values and that you shouldn’t have multiple columns doing the same job. The
way to handle multiple values is with additional tables, either in one-to-many or
many-to-many relationships.
Joining Tables
When there is an established relationship between tables, you can combine their
contents using joins.
Sometimes, you may want to count the number of expected results to check
whether your join type matches what you want.
When you do join tables, you often end up with several rows with the same
repeated data coming from the parent table. You may be able to simplify this by
grouping on parent data and aggregating on the child data. Because you can only
select what you summarise, you may need to join the results again to get more
details.
Views
Selecting what you want from multiple related tables can be inconvenient. You
can save your complex joined query in a view for future use and use it as you
might a simple table afterward.
Summary
In this chapter, we looked at how multiple tables are related through foreign keys
matching with primary keys. We also looked at different types of relationships
and why tables were designed this way.
Using this, we were able to combine tables using one or more joins to match
rows from one table to another. We looked at different types of joins and when
you might choose between them.
Coming Up
Most of the data we’ve worked with have been simple values, though in a few
cases we calculated values such as tax and discounts.
In the next chapter, we’re going to take a small detour and concentrate on
performing calculations in SQL.
Footnotes
1 One to maybe: My term. Others call it a one-to-zero-or-one, which is less snappy.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_4
No doubt, you will have seen calculations before now. SQL allows you to
include calculated data in your queries.
In this chapter, we’ll look at some important ideas about the different types
of data and how they can be calculated.
Don’t get too carried away with calculations in your SQL. The database is
more concerned with maintaining and accessing raw data. However, it’s useful to
be able to take your raw data and make it more useful to the task at hand.
DBMSs vary widely in their ability to perform calculations. This is
especially the case with functions, which vary not only in scope but even in
what the DBMSs call them.
SELECT
height/2.54, -- single column
givenname||' '||familyname -- multiple
columns
-- givenname+' '+familyname -- MSSQL
FROM customers;
?column? ?column?
66.339 May Knott
67.283 Rick Shaw
60.236 Ike Andy
69.291 Pat Downe
61.575 Basil Isk
69.409 Pearl Divers
~ 303 rows ~
SELECT
'active', -- hard-coded
(SELECT name FROM towns WHERE id=townid) -- sub
query
FROM customers;
You get
?column? ?column?
active Kings Park
active Richmond
active Hillcrest
active Guildford
active Wallaroo
active Broadwater
~ 303 rows ~
SELECT
upper(familyname) -- upper case function
FROM customers;
?column?
KNOTT
SHAW
ANDY
DOWNE
ISK
DIVERS
~ 303 rows ~
In all cases, you’ll notice that a calculated value doesn’t have a proper name.
Using Aliases
Calculated columns cause a minor inconvenience for SQL. Generally, each
column should have a distinct name, but SQL has no clear idea what to call the
newly generated column.
Some SQL clients will leave the calculated column unnamed, while some
will generate a dummy name. When experimenting with simple SELECT
statements, this is OK, but when taking the statement seriously, such as when
you plan to use the results later, you will need to give each column a better
name.
An alias is a new name for a column, whether it’s a calculated column or an
original. You create an alias using the AS keyword. For example:
SELECT
id AS customer,
height/2.54 AS height,
givenname||' '||familyname AS fullname,
-- givenname+' '+familyname AS fullname -
- MSSQL
'active' AS status,
(SELECT name FROM towns WHERE id=townid) AS town,
length(email) AS length
-- len(email) AS length -- MSSQL
FROM customers;
Note .
The id column has been aliased even though it hasn’t been calculated.
The height calculation has been aliased to height; this is fine, since it
still means the same thing, but in different units.
Apart from the fact that each calculated column must have a distinct name,
other reasons to include aliases are as follows:
Sometimes, you simply need to rename columns either for better meaning or
to suit later use.
Sometimes, you need to format or convert a column to something more
suitable, but still retain its original name.
At this point, we’re not worried about whether the preceding aliases are the
best possible names for their columns; we’re just looking at how they work.
Alias Names
By and large, the rules for alias names are the same as those for the names of
columns. That means
Aliases and original column names must be unique.
Aliases should not contain spaces, can’t start with a number, and can’t contain
other special characters.
Aliases should not be SQL keywords.
If you really need to work around the preceding second and third rules, you
can enclose the alias in double quotes. For example:
SELECT
ordered AS "order",
shipped AS "shipped date"
FROM sales;
You should resist the urge to do this. Aliases, as with column names, are for
technical rather than aesthetic use. A SELECT statement is not actually a report.
Some DBMSs offer alternatives to double quotes for special names:
Microsoft SQL offers square brackets as an alternative: [shipped date].
There is no reason to prefer this to double quotes.
MySQL/MariaDB uses “backticks” as an alternative: 'shipped date'. In
ANSI mode, this is unnecessary, but in traditional mode, it’s all you’ve got.
Whatever names you choose, remember that they are meant to be purely
functional. Don’t get carried away trying to use upper and lower case, or spaces,
or anything else that might look better. That’s up to the software handling the
output of your queries. In SQL, you just need a suitable name to refer to the data.
AS Is Optional
You will discover soon enough that AS is optional:
SELECT
id customer,
height/2.54 height,
givenname||' '||familyname fullname,
-- givenname+' '+familyname fullname -- MSSQL
'active' status,
(SELECT name FROM towns WHERE id=townid) town,
length(email) length
-- len(email) length -- MSSQL
FROM customers;
Some developers justify leaving out the AS as it saves time or makes them
look more professional. However, you will also make this kind of mistake soon
enough:
SELECT
id,
email
givenname, familyname,
height,
dob
FROM customers;
~ 303 rows ~
SELECT
id, title,
price*1.1 AS price -- adjust to include tax
FROM books
WHERE price<15;
This will work. Here, the price has been increased to include tax and
aliased to the original name, which is legitimate.
id Truncate price
2078 The Duel 13.75
1530 Robin Hood, The Prince of Thieves 13.75
982 Struwwelpeter: Fearful Stories and Vile Pictures … 12.65
573 The Nose 11
1573 Rachel Ray 11
532 Elective Affinities 12.65
~ 521 rows ~
However, the WHERE clause will filter on the original price column, not
the adjusted version.
Again, there’s not much you can do about this directly, as you don’t have the
option to write the SELECT clause further down, and you can’t create aliases in
any other clause.
Later, we will see how using Common Table Expressions can help
preprocess calculated columns.
It’s probably not a good idea to alias a calculation to an original column
name if you’re planning to use it later.
SQL has a clear idea of what it’s going to do with the aliased name, but the
human reader may well get confused.
SELECT
id, givenname, familyname,
height/2.54 AS height -- sometimes NULL
FROM customers;
SELECT
id, givenname, othernames, familyname,
givenname||' '||othernames||' '||familyname AS
fullname
-- MSSQL:
-- givenname+' '+othernames+' '+familyname AS
fullname
FROM authors;
Coalesce
SQL has a function called coalesce() which can replace NULL with a
preferred alternative. The word “coalesce” actually means to combine, but how
it came to be the name of this operation is one of those mysteries lost in the
depths of ancient history.
The function is used this way:
coalesce(expression,planB)
SELECT
id, givenname, familyname,
phone
FROM employees;
SELECT
id, givenname, familyname,
coalesce(phone,'1300975711') -- coalesce to main
number
FROM employees;
The thing about coalesce() is that you can’t always get away with it.
You need to be sure that your substitute makes sense and that your guess is a
good one. There are many times when it wouldn’t make sense, such as a missing
price for a book or an author’s date of birth; NULL is often the best thing you can
do.
In Chapter 2, you guessed at a missing quantity using coalesce() and
then fixed it so that the quantity can’t be NULL in the future. Sometimes, that’s
the best solution.
-- MSSQL
SELECT
id, givenname, othernames, familyname,
coalesce(givenname+' ','')
+coalesce(othernames+' ','')
+familyname AS fullname
FROM authors;
This gives us
-- Oracle
SELECT
id, givenname, othernames, familyname,
ltrim(givenname||' ')||ltrim(othernames||' ')
||familyname AS fullname
FROM authors;
SELECT *
FROM books
WHERE length(title)<24; -- MSSQL: len(title)
giving
You may need this if your database is case sensitive and you need to match a
string in an unknown case:
SELECT *
FROM books
WHERE lower(title) LIKE '%journey%';
giving
SELECT *
FROM customers
WHERE height<(SELECT avg(height) FROM customers);
giving
You can also use calculations in the ORDER BY clause, such as when you
want to sort by the length of a title:
SELECT *
FROM books
ORDER BY length(title); -- MSSQL: length(title)
which gives
However, you’re likely to want to select what you’re sorting by, so it would
make more sense to calculate the value in the SELECT clause and sort by the
result:
SELECT *
FROM customers
ORDER BY coalesce(height,0); -- NULLS FIRST
SELECT *
FROM customers
ORDER BY coalesce(height,1000); -- NULLS LAST
By coalescing all of the NULLs to an extreme value, SQL will sort them to
one end or the other accordingly.
As for the FROM clause, you’ll need a calculation which generates a virtual
table. That’s usually going to be a view, a join, or even a subquery. A Common
Table Expression, in this context, is like a subquery. We’ll do more of that sort of
thing later.
Casting
The cast() function is used to interpret a value as a different data type. Recall
that SQL has three main data types: numbers, strings, and dates. You can use cast
to do one of two things:
You can try to cast from one main type to another.
Casting to a string should be easy enough, but casting to another type
requires that SQL know how to interpret the value. Different DBMSs react
differently to failure.
You can cast within a main type. For example, you can cast between integer
and decimal numbers or between dates and datetimes.
If you cast a decimal to an integer, a datetime to a date, or a string to a
shorter string, you’ll naturally lose precision. If you cast in the opposite
direction, the extra precision will be filled with the equivalent of “nothing.”
If you do cast to a narrower type, it will probably work, but don’t push
your luck too hard. For example, casting the number 123.45 to a
decimal(4,2) will fail because you haven’t allowed enough digits; you’ll
get an overflow error.
For what follows, remember that SQLite doesn’t have a date type, so that’s
one cast you won’t have to worry about. Later, we’ll have a quick look at the
equivalent in SQLite.
Here are some examples of casting within types:
If you cast a string to a longer type, one of two things will happen. If you
cast it to a CHAR (fixed length) type, the extra length will be padded with spaces.
If you cast it to a VARCHAR type, the string will be unchanged. However, the
string will be permitted to grow to a longer string.
Casting between types is a different matter. Most DBMSs will automatically
cast to a string if necessary. For example:
-- Not MSSQL
SELECT id || ': ' || email
FROM customers;
?column?
42: [email protected]
459: [email protected]
597: [email protected]
186: [email protected]
352: [email protected]
576: [email protected]
~ 303 rows ~
-- MSSQL
SELECT cast(id as varchar(5)) + ': ' + email
FROM customers;
You can do the same with dates, too. We’ll do that with the customers’ dates
of birth, but we’ll run into the complication of the fact that some dates of birth
are missing. Using coalesce should do the job:
For SQLite, it wasn’t much effort as we’ve stored the dates as a string
anyway.
Here, we’ve coalesced the entire concatenated value ' Born: ' || dob.
That’s because we want to replace the whole expression with the empty string if
the dob is missing. Concatenating with a NULL should result in a NULL.
For Oracle, you run again into the quirk of treating NULL strings as empty
strings, so they won’t coalesce. We can work around it using CASE:
-- Oracle
SELECT
id || ': ' || email
|| CASE
WHEN dob IS NOT NULL THEN ' Born: ' ||
dob
END
FROM customers;
-- Integers
SELECT * FROM sorting
ORDER BY numberstring;
SELECT * FROM sorting
ORDER BY cast(numberstring as int); -- not
MySQL
-- ORDER BY cast(numberstring as signed); -
- MySQL
In the sorting table, there are some values stored as strings which
represent numbers or dates. The only way to sort them properly is to cast them
first.
Note that MySQL won’t let you cast to an integer directly. You have to use
SIGNED (which means the same thing) or UNSIGNED. MariaDB is OK with
integers.
Not all casts from strings are successful, since the string may not resemble
the correct type. For example:
-- This works:
SELECT cast('23' as int) -- MySQL: as signed
-- FROM dual -- Oracle
;
-- This doesn’t:
SELECT cast('hello' as int) -- MySQL: as signed
-- FROM dual -- Oracle
;
Numeric Calculations
A number is normally used to count something—it’s the answer to the question
“how many.” For example, how many centimeters in the customer’s height, or
how many dollars were paid for this item?
Numbers aren’t always used that way. Sometimes, they’re used as tokens or
as codes. The calculations you might perform on a number would depend on
how the number is being used.
Basic Arithmetic
You can always perform the basic operations on numbers:
SELECT
3*5 AS multiplication,
4+7 AS addition,
8-11 AS subtraction,
20/3 AS division,
20%3 AS remainder, -- Oracle: mod(20,3),
24/3*5 AS associativity,
1+2*3 AS precedence,
2*(3+4) + 5*(8-5) AS distributive
-- FROM dual -- Oracle
;
Note that you’ll need to add FROM dual if you’re testing this in Oracle.
Also note
Different DBMSs have different attitudes to dividing integers. In some cases,
20/3 would give you a result of 6, discarding the fraction. On other cases,
you’d get something like 6.66...7 as a decimal.
The % operator calculates the remainder after integer division. Oracle uses
the mod() function.
When mixing operations, SQL follows the rules you would have learned in
school regarding precedence (which operators come first) and associativity
(calculating from left to right). SQL also allows you to use parentheses to
calculate expressions first.
If you know someone who’s forgotten the basic rules of arithmetic, you can
tell them
1.
Do what’s inside parentheses first.
2.
Do multiplication | division before addition | subtraction (precedence).
3.
Do operations of the same precedence from left to right (associativity).
Of course, these expressions work just the same whether the value is a literal
or some stored or calculated value.
Mathematical Functions
There are some mathematical functions as well. For the most part, the
mathematical functions won’t get a lot of use unless you’re doing something
fairly specialized.
SELECT
pi() AS pi, -- Not Oracle
sin(radians(45)) AS sin45, -- Not Oracle
sqrt(2) AS root2, -- √2
log10(3) AS log3,
ln(10) AS ln10, -- Natural Logarithm
power(4,3) AS four_cubed -- 4³
-- FROM dual -- Oracle
;
So, now you can use SQL to find the length of a ladder leaning against a wall
or the distance between two ships lost at sea.
Approximation Functions
There are also functions which give an approximate value of a decimal number.
Here is a sample with variations between DBMSs:
SELECT
ceiling(200/7.0) AS ceiling,
-- SQLite: round(200/7.0 + 0.5),
-- Oracle: ceil(200/7.0),
floor(200/7.0) AS floor,
-- SQLite: round(200/7.0 - 0.5),
round(200/7.0,0) AS rounded_integer,
-- or round(200/7), -- not MSSQL
round(200/7.0,2) AS rounded_decimal
If you use the cast() function to another narrow number type, you’ll also
lose precision. However, what happens next depends on the DBMS:
SELECT
cast(234.567 AS int) AS castint,
-- cast(234.567 AS unsigned), -- MySQL
cast(234.567 AS decimal(5,2)) AS castdec
-- FROM dual -- Oracle
;
Formatting Numbers
Formatting functions change the appearance of a number. Unlike approximation
and other functions, the result of a formatting function is not a number but is a
string; that’s the only way you can change the way a number appears.
For numbers, most of what you want to do is change the number of decimal
places, display the thousands separator, and possibly currency symbols.
Again, the different DBMSs have wildly different functions. As an example,
here are some ways of formatting a number as currency with thousands
separators:
-- PostgreSQL, Oracle
SELECT
to_char(total,'FM999G999G999D00') AS
local_number,
to_char(total,'FML999G999G999D00') AS
local_currency
FROM sales;
SELECT to_char(total,'FM$999,999,999.00') FROM
sales;
-- MariaDB/MySQL
SELECT
format(total,2) AS local_number,
format(total,2,'de_DE') AS specific_number
FROM sales;
-- MSSQL
SELECT
format(total,'n') AS local_number,
format(total,'c') AS local_currency
FROM sales;
-- SQLite
SELECT
printf('$%,d.%02d',total,round(total*100)%100)
FROM sales;
local_number local_currency
28.00 $28.00
34.00 $34.00
58.50 $58.50
50.00 $50.00
17.50 $17.50
13.00 $13.00
~ 5549 rows ~
Note .
MSSQL has its own format() function with its more intuitive formatting
codes; it also adjusts for locales and can be used to format a date.
String Calculations
A string is a string of characters, hence the name. In SQL, this is referred to as
character data.
Traditionally, SQL has two main data types for strings:
Character: CHAR(length) is a fixed-length string. If you enter fewer
characters than the length, then the string will be right-padded with spaces.
This probably explains why standard SQL ignores trailing spaces for string
comparison.
Character varying: VARCHAR(length) is a limited length string. If you
enter a shorter string, it will not be padded.
In principle, CHAR() is more efficient for processing since it’s always the
same length, and the DBMS doesn’t need to worry about working out the size
and making things fixed. VARCHAR() is supposed to be more efficient for
storage.
In reality, modern DBMSs are much cleverer than their ancestors, and the
difference between the two types is not very important anymore. For example,
PostgreSQL recommends always using VARCHAR since it actually handles that
type more efficiently.
Most DBMSs offer a third type, TEXT, which is, in principle, unlimited in
length. Again, modern DBMSs allow longer standard strings than they used to,
so again this is not so important. Microsoft has deprecated TEXT in favor of
VARCHAR(MAX) which does the same job.
A string literal is written between single quotes:
When working with strings, you normally simply want to save them and
fetch them. However, you can process the strings themselves. This is usually one
of the following operations:
Concatenation means joining strings together.
Concatenation is the only direct operation on strings. All other operations
make use of functions.
Some functions will make changes to a string. They don’t actually change the
string, but return a changed version of the string.
Some functions can be used to extract parts of a string.
Some functions are more concerned with individual characters of the string.
Case Sensitivity
SQL will store the upper/lower case characters as expected, but you may have a
hard time searching for them. That’s because some databases ignore case, while
others don’t.
How a database handles case is a question of collation. Collation refers to
how it interprets variations of letters. In English, the only variation to worry
about is upper or lower case, but other languages may have more variations,
such as accented letters in French or German.
Collation will have an impact on how strings are sorted and how they
compare. In English, you’re mainly worried about whether upper case strings
match lower case strings and possibly whether upper and lower case strings are
sorted together or sorted separately. In some other languages, the same questions
might apply to whether accented and nonaccented characters match and how
they, too, are sorted.
You can set a collation when you create the database or a table, but if you
don’t worry about it, the DBMS will have a default collation for new databases.
In PostgreSQL, Oracle, and SQLite, the default collation is case sensitive, so
upper and lower case won’t match. With MySQL/MariaDB and MSSQL, the
default collation is case insensitive, so they will match.
If you’re not sure whether your particular database is case sensitive or not,
you can try this simple test:
If the database is case sensitive, you won’t get any rows, since a won’t
match A; if it’s not, you will get the whole table.
Concatenation
Concatenation means joining strings together. This is the simplest string
operation and the only one which can be done without a function.
The concatenation operator is usually ||. Microsoft SQL Server uses +
instead. For example:
SELECT
id,
givenname||' '||familyname AS fullname
-- givenname+' '+familyname AS fullname -
- MSSQL
FROM customers;
-- Not SQLite
SELECT
id,
concat(givenname,' ',familyname) AS fullname
FROM customers;
String Functions
Other operations with strings require functions. Here are some examples.
For the following examples, we’ve included SELECT * for context—
except that in Oracle you need to write SELECT table.* if you’re mixing
it with other data, so we’ve done that with all of the examples which include
Oracle.
The length of a string is the number of characters in the string. To find the
length, you can use
To find where part of a string is, you can use the following:
-- replace(original,search,replace)
SELECT books.*, replace(title,' ','-') AS hyphens
FROM books;
-- PostgreSQL, Oracle
SELECT books.*, initcap(title) AS lower FROM
books;
To remove extra spaces at the beginning or the end of a string, you can use
trim() to remove from both ends, or ltrim() or rtrim() to remove from
the beginning or end of the string:
WITH vars AS (
SELECT ' abcdefghijklmnop ' AS string
-- FROM dual -- Oracle
)
SELECT
string,
ltrim(string) AS ltrim,
rtrim(string) AS rtrim,
trim(string) AS trim AS trim,
ltrim(rtrim(string)) AS same
FROM vars;
All modern DBMSs support trim(), but MSSQL didn’t until version 2017.
PostgreSQL also calls it btrim(). You may not notice when the spaces on the
right are trimmed.
You can get substring with substring() or substr(), depending on
your DBMS:
WITH vars AS (
SELECT 'abcdefghijklmnop' AS string
FROM dual -- Oracle
)
SELECT
-- PostgreSQL, MariaDB/MySQL, Oracle, SQLite
substr(string,3,5) AS substr,
-- PostgreSQL, MariaDB/MySQL, MSSQL, SQLite
substring('abcdefghijklmnop',3,5) AS substring
FROM vars;
Some DBMSs include specialized functions to get the first or last part of a
string. In some cases, you can use a negative start to get the last part of a string:
WITH vars AS (
SELECT 'abcdefghijklmnop' AS string
FROM dual -- Oracle
)
SELECT
-- Left
-- PostgreSQL, MariaDB/MySQL, MSSQL:
left('abcdefghijklmnop',4) AS lstring
-- All DBMSs including SQLITE and Oracle:
-- substr(string,1,n) AS lstring,
-- Right
-- PostgreSQL, MariaDB/MySQL, MSSQL:
right('abcdefghijklmnop',4) AS rstring
-- MariaDB/MySQL, Oracle, SQLite
-- substr('abcdefghijklmnop',-4) AS rstring
FROM vars;
Just note that if you spend a lot of time extracting substrings from your data,
it’s possible that you’re trying to store too much in a single value.
On the other hand, you can often use substrings to reformat raw data into
something more friendly.
Date Operations
From an SQL point of view, dates are problematic. That’s because, despite their
overwhelming presence in daily life, measuring dates is a mess.
One problem is that we measure dates using a number of incompatible cycles
all at the same time: the day, week, month, and year. To make things worse, we
all live in different time zones, so we can’t even agree on what time it is.
Most DBMSs have a number of related data types to manage dates,
specifically the date which is for dates with times and datetime which
includes the time. Generally, you can expect variations on these types, as well as
the ability to include time zones.
The exception is SQLite, which expects you to use numbers or strings and
run the values through a few functions to do the date arithmetic.
There are a number of things you would expect to do with dates and times:
1.
Enter and store a date/time
2.
Get the current date/time
3.
Group and sort by date/time
4.
Extract parts of the date/time
5.
Add to a date/time
6.
Calculate the difference between two dates/times
7.
Format a date/time
SQLite has a completely different approach to working with dates.
That’s partly because it doesn’t actually support dates. As a result,
SQLite will be missing from much of the following discussion. The
Appendix has some information on handling dates in SQLite.
The normal way to enter a date or datetime literal is to use one of the
following:
date: '2013-02-15'
datetime: '2013-02-15 09:20:00'
You can also omit the seconds or include decimal parts of a second.
The format is a variation of the ISO8601 format. In pure ISO8601 format,
the time would be written after a T instead of a space.
Note that with Oracle, datetime literals generally use a different format. To
use the preceding formats, prefix the literal with date or datetime,
respectively:
date: date '2013-02-15'
datetime: datetime '2013-02-15 09:20:00'
In PostgreSQL, MSSQL, and MySQL/MariaDB, you can often enter another
readable date format such as '15 Feb 2013'. However, you should never
use the format '2/3/2013' which has different meanings internationally.
In practical terms, just stick to the standard format:
SELECT *
FROM customers
WHERE dob<'1980-01-01'; -- Oracle dob<date '1980-01-
01';
SELECT
current_timestamp AS now,
current_date AS today -- Not MSSQL
-- FROM dual -- Oracles
;
Note .
-- Not Oracle
SELECT
current_timestamp AS now,
cast(current_timestamp as date) AS today
-- FROM dual -- Oracle
;
This won’t quite work with Oracle; it will let you do the cast all right, but it
doesn’t change anything. Instead, you should use the trunc() function:
-- Oracle
SELECT
current_timestamp AS now,
trunc(current_timestamp) AS today
FROM dual -- Oracle
;
This will still have a time component, but it’s set to 00:00.
SELECT *
FROM sales
ORDER BY ordered;
WITH cte AS (
SELECT
cast(ordered as date) AS ordered, total -
- Not Oracle
-- trunc(ordered) AS ordered, total -
- Oracle
FROM sales
)
SELECT ordered, sum(total)
FROM cte
GROUP BY ordered
ORDER BY ordered;
ordered sum
2022-05-04 43.00
2022-05-05 150.50
2022-05-06 110.50
2022-05-07 142.00
2022-05-08 214.50
2022-05-09 16.50
~ 389 rows ~
WITH chelyabinsk AS (
SELECT
timestamp '2013-02-15 09:20:00' AS datetime
FROM dual
)
SELECT
datetime,
EXTRACT(year FROM datetime) AS year,
EXTRACT(month FROM datetime) AS month,
EXTRACT(day FROM datetime) AS day,
-- not Oracle or MariaDB/MySQL:
EXTRACT(dow FROM datetime) AS weekday,
EXTRACT(hour FROM datetime) AS hour,
EXTRACT(minute FROM datetime) AS minute,
EXTRACT(second FROM datetime) AS second
FROM chelyabinsk;
Note that Oracle and MariaDB/MySQL don’t have a direct way of extracting
the day of the week, which can be a problem if, say, you want to use it for
grouping. However, as you will see later, you can use a formatting function to
get the day of the week, as well as the preceding values.
PostgreSQL also includes a function called
date_part('part',datetime) as an alternative to the preceding
function.
Date Extracting in Microsoft SQL
Microsoft SQL has two main functions to extract part of a date:
datepart(part,datetime) extracts the part of a date/time as a
number.
datename(part,datetime) extracts the part of a date/time as a string.
For most parts, such as the year, it’s simply a string version of the datepart
number. However, for the weekday and the month, it’s actually the human-
friendly name.
You can see these two functions in action:
WITH chelyabinsk AS (
SELECT cast('2013-02-15 09:20' as datetime) AS
datetime
)
SELECT
datepart(year, datetime) AS year, -- aka
year()
datename(year, datetime) AS yearstring,
datepart(month, datetime) AS month, -- aka
month()
datename(month, datetime) AS monthname,
datepart(day, datetime) AS day, -- aka day()
datepart(weekday, datetime) AS weekday, -
- Sunday=1
datename(weekday, datetime) AS weekdayname,
datepart(hour, datetime) AS hour,
datepart(minute, datetime) AS minute,
datepart(second, datetime) AS second
FROM chelyabinsk;
Note .
Formatting a Date
As with numbers, formatting a date generates a string.
For both PostgreSQL and Oracle, you can use the to_char function. Here
are two useful formats:
-- PostgreSQL
WITH vars AS (SELECT timestamp '1969-07-20
20:17:40' AS moonshot)
SELECT
moonshot,
to_char(moonshot,'FMDay, DDth FMMonth YYYY')
AS fulldate,
to_char(moonshot,'Dy DD Mon YYYY') AS
shortdate
FROM vars;
-- Oracle
WITH vars AS (
SELECT timestamp '1969-07-20 20:17:40' AS
moonshot FROM dual
)
SELECT
moonshot,
to_char(moonshot,'FMDay, ddth Month YYYY') AS
fulldate,
to_char(moonshot,'Dy DD Mon YYYY') AS
shortdate
FROM vars;
You’ll notice that there is a slight difference in the format codes between
PostgreSQL and Oracle.
For MariaDB/MySQL, there is the date_format() function:
For Microsoft SQL, the format() function can also be used for dates:
SQLite has very limited formatting functionality, and you certainly can’t get
month or weekday names without some additional trickery. It’s usually better to
leave the date alone and let the host application do what is needed.
You can learn more about the format codes at
PostgreSQL: www.postgresql.org/docs/current/functions-
formatting.html#FUNCTIONS-FORMATTING-DATETIME-TABLE
Oracle:
https://docs.oracle.com/en/database/oracle/oracle-
database/21/sqlrf/Format-Models.html
MariaDB: https://mariadb.com/kb/en/date_format/
MySQL: https://dev.mysql.com/doc/refman/8.0/en/date-
and-time-functions.html
Microsoft SQL: https://learn.microsoft.com/en-
us/dotnet/standard/base-types/custom-date-and-time-
format-strings
Date Arithmetic
Generally, the two things you want to do with dates are
Modify a date by adding or subtracting an interval
Find the difference between two dates
To modify a date, you can add or subtract an interval. Some DBMSs define a
type of data called interval for the purpose. For example, to add four months
to now, you can use
-- PostgreSQL
SELECT
date '2015-10-31' + interval '4 months' AS
afterthen,
current_timestamp + interval '4 months' AS
afternow,
current_timestamp + interval '4' month -
- also OK ;
-- Oracle
SELECT
add_months('31 Oct 2015',4) AS afterthen,
current_timestamp + interval '4' month AS
afternow,
add_months(current_timestamp,4) -- also OK
FROM dual;
-- MariaDB/MySQL
SELECT
date_add('2015-10-31',interval 4 month) AS
afterthen,
date_add(current_timestamp,interval 4 month)
AS afternow,
current_timestamp + interval '4' month -
- also OK
;
afterthen Afternow
2016-02-29 00:00:00 2023-10-01 16:01:13.691447+11
You’ll notice that PostgreSQL and Oracle use the addition operator, while
MariaDB/MySQL uses a special function. Oracle also has a special function to
add months.
For Microsoft SQL, you use dateadd, specifying the units and number of
units:
-- MSSQL
SELECT
dateadd(month,4,'2015-10-31') AS afterthen,
dateadd(month,4,current_timestamp) AS afternow
;
-- SQLite
SELECT
strftime('%Y-%m-%d','2015-10-31','+4 month')
AS afterthen,
strftime('%Y-%m-%d','now','+4 month') AS
afternow
;
The other thing you’ll want to do is calculate the difference between two
dates. Here again, every DBMS does it differently. For example, to find the age
of your customers, you can use
-- PostgreSQL
SELECT
dob,
age(dob) AS interval,
date_part('year',age(dob)) AS years,
extract(year from age(dob)) AS samething
FROM customers;
-- MariaDB/MySQL
SELECT
dob,
timestampdiff(year,dob,current_timestamp) AS
age
FROM customers;
-- Oracle
SELECT
dob,
trunc(months_between(current_timestamp,dob)/12)
AS age
FROM customers;
-- SQLite
SELECT
dob,
cast(
strftime('%Y.%m%d', 'now')
- strftime('%Y.%m%d', dob)
as int) AS age
FROM customers;
For PostgreSQL, you’ll get the following results. The other DBMSs won’t
have the age column:
SELECT
id,title,
CASE
WHEN price<13 THEN 'cheap'
WHEN price<=17 THEN 'reasonable'
WHEN price>17 THEN 'expensive'
-- ELSE NULL
END AS price
FROM books;
id Title price
2094 The Manuscript Found in Saragossa expensive
336 The Story of My Life reasonable
1868 The Tenant of Wildfell Hall [NULL]
375 Dead Souls reasonable
1180 Fables cheap
990 The History of Pendennis: His Fortun … cheap
~ 1200 rows ~
Note that if all conditions fail, then the result will be NULL, which is
commented out earlier. If you want an alternative to NULL, use the ELSE
expression:
SELECT
id,title,
CASE
WHEN price<13 THEN 'cheap'
WHEN price<=17 THEN 'reasonable'
WHEN price>17 THEN 'expensive'
ELSE ''
END AS price
FROM books;
SELECT
c.id,
givenname||' '||familyname AS name,
-- givenname+' '+familyname AS name, -- MSSQL
CASE status
WHEN 1 THEN 'Gold'
WHEN 2 THEN 'Silver'
WHEN 3 THEN 'Bronze'
CASE AS status
FROM customers AS c LEFT JOIN VIP ON c.id=vip.id;
-- Oracle:
-- FROM customers c LEFT JOIN VIP ON c.id=vip.id;
id Name status
69 Rudi Mentary [NULL]
182 June Hills Bronze
43 Annie Day [NULL]
263 Mark Time Bronze
266 Vic Tory Silver
68 Phyllis Stein [NULL]
442 Herb Garden Gold
33 Eileen Dover [NULL]
~ 303 rows ~
This form isn’t much shorter, but it makes the intention clear.
You can also use the IN expression:
SELECT
id, givenname, familyname,
CASE
WHEN state IN('QLD','NSW','VIC','TAS') THEN
'East'
WHEN state IN ('NT','SA') THEN 'Central'
ELSE 'Elsewhere'
END AS region
FROM customerdetails;
SELECT
id, givenname, familyname,
coalesce(phone,'-') AS coalesced,
CASE
WHEN phone IS NOT NULL THEN phone
ELSE '-'
END AS cased
FROM customers;
That might be, for example, if the customer never checked out the order. We
probably should get rid of them, but, for now, we’ll just filter them out:
The first thing you’ll have to do is to calculate the difference between dates.
This varies between DBMSs:
Note that with SQLite, the simplest way to get an age is to convert dates to a
Julian date, which is the number of days since Noon, 24 November 4714 BC.
Long story.
You know by now that you can’t use the calculated values in other parts of
the SELECT clause, so that’s awkward if you need them. You can, however, do
the query in two steps.
If you put the preceding query in a Common Table Expression, you can then
use the results in the main query.
First, you need to distinguish between those which have been shipped and
those which haven’t:
WITH salesdata AS (
-- one of the above queries WITHOUT the semicolon
)
SELECT
salesdata.*,
CASE
WHEN shipped IS NOT NULL THEN
-- One of two statuses
ELSE
-- One of three statuses
END AS status
FROM salesdata;
The statuses in each case are additional CASE expressions:
WITH salesdata AS (
-- one of the above queries WITHOUT the semicolon
)
SELECT
salesdata.*,
CASE
WHEN shipped IS NOT NULL THEN
CASE
WHEN shipped_age>14 THEN 'Shipped
Late'
ELSE 'Shipped'
END
ELSE
CASE
WHEN ordered_age<7 THEN 'Current'
WHEN ordered_age<14 THEN 'Due'
ELSE 'Overdue'
END
END AS status
FROM salesdata;
Summary
Data in an SQL table should be stored in its purest, simplest form. However, this
data can be recalculated to increase its usefulness.
Calculations can take a number of forms:
Based on single columns
Based on multiple columns
Hard-coded literal values
Results of a subquery
Calculated from a function
Calculations can also be used in the WHERE and ORDER BY clause.
Aliases
All calculated values should be renamed with an alias. The word AS is optional,
but is recommended to reduce confusion.
You can also alias noncalculated columns if the new name makes more
sense.
Aliases are given in the SELECT clause, which is evaluated last before
ORDER BY. For most DBMSs, this means that you can’t use the alias in any
other clause but the ORDER BY.
NULLs
A table may, of course, include NULLs in various places. As a rule, a NULL will
wipe out any calculation, leaving NULL in its wake.
You can bypass NULLs with the coalesce() function which replaces
NULL with an alternative value. You might also use a CASE ... END
expression.
Casting Types
SQL works with three main data types:
Numbers
Dates and times
Strings
You may need to change the data type. This is done with the cast()
function.
When you cast within a major type, the effect is to change the precision or
size of the type.
When you cast between major types, it is usually for compatibility. While
casting to a string is usually possible and often automatic, casting from a string
may not always succeed. Different DBMSs have different reactions to an
unsuccessful cast.
5. Aggregating Data
Mark Simon1
(1) Ivanhoe VIC, VIC, Australia
Databases store data. That’s obvious, but the data itself is pretty inert—you save
it, you retrieve it, and you sometimes change it. That’s OK for some things, but
sometimes you want the data to work a little harder.
You can put the data to work when you start to summarize it. You can then
see trends, see where it’s going, or just get an overview of the data.
Aggregate functions are used to calculate summaries of data. They have
three contexts:
Summarize the whole table.
Summarize in groups, using GROUP BY.
Include summaries row by row. This is done with window functions, using
the OVER clause.
You’ll learn about window functions in Chapter 8. In this chapter, we look at
how to calculate summaries, either wholly or in groups, using SQL’s built-in
aggregate functions.
-- Book Data
SELECT
-- Count Rows:
count(*) AS nbooks,
-- Count Values in a column:
count(price) AS prices,
-- Cheapest & Most Expensive
min(price) AS cheapest, max(price) AS priciest
FROM books;
-- Customer Data
SELECT
-- Count Rows:
count(*) AS ncustomers,
-- Count Values in a column:
count(phone) AS phones,
-- Height Statistics
stddev_samp(height) AS sd -- MSSQL:
stdev(height)
FROM customers;
ncustomers phones sd
303 286 6.992
All of these functions are applicable to numbers, but only the following may
be used for other data, such as strings and dates:
count
max and min
For example:
SELECT
-- Count Values in a column:
count(dob) AS dobs,
-- Earliest & Latest
min(dob) AS earliest, max(dob) AS latest
FROM customers;
gives you
NULL
Aggregate functions do not include NULLs. The only time this is not obvious is
when using the sum function. However, it is significant to note that
count(column) will only count the non-NULL values in the column, so
you may get fewer than the total number of rows.
avg(column) will also ignore the NULL values, so the average is divided
only by the number of values, not necessarily the number of rows.
To put it another way, there is a world of difference between NULL on one
hand and 0 or '' on the other.
We’ll take advantage of this fact when we look at aggregate filters later.
Understanding Aggregates
Using aggregates sometimes runs into a few problems and seems to have a few
quirky rules. It all makes more sense if you understand how aggregates really
work.
When you aggregate data, the original data is effectively transformed into a
new virtual table, with summaries for one or more groups.
For example, the query
SELECT
count(*) AS rows,
count(phone) AS phones
FROM customers;
can be regarded as
SELECT
count(*) AS rows,
count(phone) AS phones
FROM customers
GROUP BY () -- PostgreSQL, MSSQL, Oracle only
;
Note that the clause GROUP BY () doesn’t work for all DBMSs, such as
MariaDB/MySQL or SQLite. That doesn’t matter, since the grouping is
happening anyway.
The thing is, with or without the GROUP BY () clause, SQL will generate
the virtual summary table as soon as it finds an aggregate function in the query.
In the preceding example, the data is summarized into a single virtual
summary table of one row. In turn, this virtual table has grand totals for every
column as in Figure 5-1.
This is why you can’t include individual row data with an aggregate query.
For example, this won’t work:
SELECT
id, -- oops
count(*) AS rows,
count(phone) AS phones
FROM customers;
You’ll get an error message basically telling you that you can’t use the id in
the query.
Note that in MariaDB/MySQL in traditional mode, you can indeed run
this statement successfully. However, the DBMS will pick the first id it can
find, and that really has no meaningful value. It’s mainly useful if you can be
sure that all of the non-aggregate values are the same.
For example:
SELECT
town, state, -- grouping columns
count(phone) AS phones, -- summaries for each
group:
min(dob) AS oldest
FROM customerdetails
GROUP BY town, state;
(You may get a group of NULLs either at the beginning or the end, because
we haven’t filtered out the NULL addresses.)
In the overall scheme of things, the (virtual) GROUP BY clause appears after
the FROM and possibly WHERE clauses and is evaluated at that point:
SELECT ...
FROM ...
WHERE ...
GROUP BY ...
-- SELECT
ORDER BY ...
SQL neither knows nor cares about the actual meaning of the data, so
there are no checks over whether you should apply these aggregate functions
to particular columns.
Distinct Values
Most aggregate functions can be applied to distinct values, but it is probably
statistically invalid. However, it can be meaningful if you count distinct values,
such as in the following example:
SELECT
count(state) AS addresses,
count(DISTINCT state) AS states
FROM customerdetails;
This will count how many distinct states are in the customer details. That’s
not to say that you can’t count the state column anyway, as it indicates the
number of rows which have any address information at all:
Addresses states
278 8
Be careful, though. It’s possible that the column doesn’t give the whole
picture. For example, if you try
Aggregate Filter
Normally, aggregate functions apply to the whole table or to the whole group.
For example, count(*) will count all the rows in the table or group.
A relatively new feature allows you to apply an aggregate function to some
of the rows. This can be applied multiple times in the query.
For example, the following will count all the customers in the customers
table:
Suppose you want to separate the customers into the younger and older
customers.
You might instinctively try something like this:
-- PostgreSQL:
SELECT
count(*) FILTER (WHERE dob<'1980-01-01') AS
older,
count(*) FILTER (WHERE dob>='1980-01-01') AS
younger
FROM customers;
Older younger
133 106
SELECT
count(CASE WHEN dob<'1980-01-01' THEN 1 END) AS
old,
count(CASE WHEN dob>='1980-01-01' THEN 1 END) AS
young
FROM customers;
This uses the CASE expression to separate the dob values. They will either
be 1 or NULL, and the count() function counts only the 1s.
You can also use this technique with other aggregate functions. For example:
-- New Standard
SELECT
sum(total),
sum(total) FILTER (WHERE ordered <'...') AS
older,
sum(total) FILTER (WHERE ordered>='...') AS
newer
FROM sales;
-- Alternative
SELECT
sum(total),
sum(CASE WHEN ordered<'...' THEN total END) AS
older,
SELECT
sum(total),
sum(CASE WHEN ordered<'...' THEN total END) AS
older,
sum(CASE WHEN ordered>='...' THEN total END)
AS newer
FROM sales;
Here, the value is either total or NULL, and sum() politely ignores the
NULLs.
You can also group by a derived value. For example, you can group your
customers by their month of birth:
-- PostgreSQL, Oracle
SELECT EXTRACT(month FROM dob) as monthnumber,
count(*) AS howmany
FROM customerdetails
GROUP BY EXTRACT(month FROM dob)
ORDER BY monthnumber;
-- MSSQL
SELECT month(dob) AS monthnumber, count(*) AS
howmany
FROM customerdetails
GROUP BY month(dob)
ORDER BY monthnumber;
-- MySQL / MariaDB
SELECT month(dob) AS monthnumber, count(*) AS
howmany
FROM customerdetails
GROUP BY month(dob)
ORDER BY monthnumber;
-- SQLite
SELECT strftime('%m',dob) as monthnumber,
count(*) AS howmany
FROM customerdetails
GROUP BY strftime('%m',dob)
ORDER BY monthnumber;
Monthnumber howmany
1 19
2 14
3 17
4 23
5 24
6 15
7 27
8 18
9 18
10 24
11 17
12 23
[NULL] 64
Note that the calculation appears twice, once in the SELECT clause and once
in the GROUP BY clause. This is because the SELECT is evaluated after GROUP
BY, so, alas, its alias is not yet available to GROUP BY.
This is not a real problem, as the SQL optimizer will happily reuse the
calculation, so it’s not really doing it twice.
Unfortunately, the month number isn’t very friendly, so we could use the
month name. However, inconveniently, the month name is in the wrong sort
order, so we will need both:
-- Not SQLite
-- PostgreSQL, Oracle
SELECT EXTRACT(month FROM dob) as monthnumber,
to_char(dob,'Month') AS monthname,
count(*) AS howmany
FROM customerdetails
GROUP BY EXTRACT(month FROM dob),
to_char(dob,'Month')
ORDER BY monthnumber;
-- MSSQL
SELECT month(dob) AS monthnumber,
datename(month,dob) AS monthname, count(*) AS
howmany
FROM customerdetails
GROUP BY month(dob), datename(month,dob)
ORDER BY monthnumber;
-- MySQL / MariaDB
SELECT month(dob) AS monthnumber,
monthname(dob) AS monthname, count(*) AS
howmany
FROM customerdetails
GROUP BY month(dob), monthname(dob)
ORDER BY monthnumber;
As you see, you can’t quite do this in SQLite since it doesn’t have a function
to get the month name.
Technically, grouping by both is redundant, since there is only one month
name per month. However, we need both so that we can display one, but order
by the other.
Although repeating the calculations is not a problem, it does make the query
less readable and harder to maintain. We can take advantage of using a Common
Table Expression:
WITH cte AS (
...
)
SELECT monthname, count(*)
FROM cte
GROUP BY monthnumber, monthname
ORDER BY monthnumber;
You can use GROUP BY with any calculated field, but note that
Since simple calculations don’t always result in something worth grouping,
there is a limit on what you can do with them.
As noted before, the calculation needs to be in both the SELECT clause and
the GROUP BY clause, making the process tedious.
The second point earlier can be alleviated with the use of Common Table
Expressions. The first point can be addressed by the use of CASE statements.
CASE
WHEN dob<'1980-01-01' THEN 'older'
WHEN dob IS NOT NULL then 'younger'
-- ELSE NULL
END
Remember that some dobs may be NULL, so you need to filter them to get
the younger ones. Remember, too, that the default ELSE is NULL, so we don’t
need to include it.
To count them, we could include this in the GROUP BY clause as follows:
SELECT count(*)
FROM customers
GROUP BY CASE
WHEN dob<'1980-01-01' THEN 'older'
WHEN dob IS NOT NULL then 'younger'
END;
Count
64
133
106
but it’s useless without some sort of labels. We can do this by repeating the
calculation in the SELECT clause:
SELECT
CASE
WHEN dob<'1980-01-01' THEN 'older'
WHEN dob IS NOT NULL then 'younger'
END AS agegroup,
count(*)
FROM customers
GROUP BY CASE
WHEN dob<'1980-01-01' THEN 'older'
WHEN dob IS NOT NULL then 'younger'
END;
Agegroup count
[NULL] 64
Older 133
Younger 106
but from the point of view of coding, it’s worse than the calculated columns
in the previous section, so this would definitely benefit from the use of a
Common Table Expression:
WITH cte AS (
SELECT
*,
CASE
WHEN dob<'1980-01-01' THEN 'older'
WHEN dob IS NOT NULL then 'younger'
END AS agegroup FROM customers
)
SELECT agegroup,count(*)
FROM cte
GROUP BY agegroup;
This will now give you a more manageable result.
WITH salesdata AS (
-- PostgreSQL, MariaDB / MySQL, Oracle
SELECT
ordered, shipped, total,
current_date - cast(ordered as date) AS
ordered_age,
shipped - cast(ordered as date) AS shipped_age
FROM sales
-- MSSQL
SELECT
ordered, shipped, total,
datediff(day,ordered,current_timestamp)
AS ordered_age,
datediff(day,ordered,shipped) AS shipped_age
FROM sales
-- SQLite
SELECT
ordered, shipped, total,
julianday('now')-julianday(ordered) AS
ordered_age,
julianday(shipped)-julianday(ordered) AS
shipped_age
FROM sales
)
SELECT
ordered, shipped, total,
CASE
WHEN shipped IS NOT NULL THEN
CASE
WHEN shipped_age>14 THEN 'Shipped
Late'
ELSE 'Shipped'
END
ELSE
CASE
WHEN ordered_age<7 THEN 'Current'
WHEN ordered_age<14 THEN 'Due'
ELSE 'Overdue'
END
END AS status
FROM salesdata;
If you want to summarize this into status groups, you can again put the
whole statement into a CTE and then summarize the CTE. You already have one
CTE to precalculate the age, so we’ll need another to hold the preceding results:
WITH
salesdata AS (
-- as above
),
statuses AS (
SELECT
ordered, shipped, total,
CASE
WHEN shipped IS NOT NULL THEN
CASE
WHEN shipped_age>14
THEN 'Shipped Late'
ELSE 'Shipped'
END
ELSE
CASE
WHEN ordered_age<7 THEN
'Current'
WHEN ordered_age<14 THEN 'Due'
ELSE 'Overdue'
END
END AS status
FROM salesdata
)
SELECT status, count(*) AS number
FROM statuses
GROUP BY status;
Status Number
Due 94
Current 78
Shipped 3808
Overdue 1273
Shipped Late 296
-- Postgresql
POSITION(substring IN string)
-- MariaDB / MySQL & SQLite
INSTR(substring,string)
-- Oracle
INSTR(string,substring)
-- MSSQL
CHARINDEX(substring,string)
In this case, we can find the position of the status string inside a longer
string with the status values in order:
'Shipped,Shipped Late,Current,Due,Overdue'
The commas aren’t necessary, but they make the string more readable.
What’s more important is that the status strings are in your preferred order, and
the position function will return a lower value for strings it finds earlier. The rest
is up to the ORDER BY clause.
We can order the preceding query using the positioning function like this:
WITH
salesdata AS (
-- as above
),
statuses AS (
-- as above
)
SELECT status, count(*) AS number
FROM cte
GROUP BY status
-- Postgresql
ORDER BY POSITION(status IN
'Shipped,Shipped Late,Current,Due,Overdue')
-- MariaDB / MySQL & SQLite
ORDER BY INSTR(status,
'Shipped,Shipped Late,Current,Due,Overdue')
-- Oracle
ORDER BY INSTR(status,
'Shipped,Shipped Late,Current,Due,Overdue')
-- MSSQL
ORDER BY CHARINDEX(status,
'Shipped,Shipped Late,Current,Due,Overdue')
;
Status number
Shipped 3808
Shipped Late 296
Current 78
Due 94
Overdue 1273
You can use this technique for any nonalphabetical string order, such as days
of the week or colors in the rainbow.
Group Concatenation
There is an additional function which can be used to aggregate string data. This
function will concatenate strings with an optional delimiter.
This function has a few different names:
DBMS Function
PostgreSQL string_agg(column, delimiter)
SQL Server 2017+ string_agg(column, delimiter)
SQLite group_concat(column, delimiter)
MySQL and group_concat(column /* ORDER BY column */SEPARATOR
MariaDB delimiter)
Oracle listagg(column, delimiter)
For example, you can get a list of all the books for each author this way:
SELECT
a.id, a.givenname, a.familyname,
-- PostgreSQL, MSSQL
string_agg(b.title, '; ') AS works
-- SQLite
-- group_concat(b.title, '; ') AS works
-- Oracle
-- listagg(b.title, '; ') AS works
-- MariaDB / MySQL
-- group_concat(b.title SEPARATOR '; ') AS
works
FROM authors AS a LEFT JOIN books AS b ON
a.id=b.authorid
GROUP BY a.id, a.givenname, a.familyname;
The works column has all of the book titles concatenated with a ; between
them. Note that the GROUP BY clause uses the author id but includes the
redundant author names to allow them to be selected.
Be careful, though. It’s easy to get carried away with this function, and you’ll
see that the list of books can be very long, and the concatenated string can be
very, very long.
In the preceding example, there are four possible totals that you could get:
The count() of each state/town combination; this is what you normally
get.
The count() of each state group.
The count() of each town group. In this example, it’s not so useful, since
some town names are duplicated across states, so you’d be combining values
which shouldn’t be. However, in other examples, this would be useful.
The count() of the whole log—the grand total.
Apart from the last one, the others would all be considered subtotals at some
level.
When we work with the example shortly, we’ll aggregate by three columns,
and there’ll be eight combinations, so eight totals and subtotals we can calculate.
Modern SQL allows you to generate a result set which is a combination of
totals and subtotals of table data and aggregate data. Depending on the DBMS,
this might include a modification of the GROUP BY clause:
GROUPING SETS allow you to specify which additional summaries to
include. So, for example, you can decide which of the four possibilities earlier
you want to include.
This is supported by PostgreSQL, Microsoft SQL, and Oracle.
ROLLUP is a simplified version of GROUPING SETS which produces some
of the possible subtotals, treating the columns as a hierarchy. In the preceding
example, you would get the state/town, state, and grand totals.
This is supported by PostgreSQL, Microsoft SQL, Oracle, and
MariaDB/MySQL.
CUBE is also a specialized version of GROUPING SETS which produces all
of the possible subtotals. In the preceding example, it’s all four of the possible
totals.
Here, we’ll have a look at generating such a summary. However, rather than
work with customers’ addresses, we’ll have a look at sales data.
SELECT
-- PostgreSQL, Oracle
to_char(s.ordered,'YYYY-MM') AS ordered,
-- MariaDB / MySQL
-- date_format(s.ordered,'%Y-%m') AS ordered,
-- MSSQL
-- format(s.ordered,'yyyy-MM') AS ordered,
-- SQLite
-- strftime('%Y-%m',s.ordered) AS ordered,
s.total, c.id, c.state
FROM sales AS s JOIN customerdetails AS c
ON s.customerid=c.id
WHERE s.ordered IS NOT NULL;
To begin with, we’ll generate the summaries separately and combine them
with a UNION clause.
The next step is to generate summaries for the state and customer ids:
-- state summaries
SELECT
state, NULL, NULL, count(*) AS nsales,
sum(total) AS total
FROM salesdata
GROUP BY state
ORDER BY state;
and
Don’t worry about the missing column names, as we’ll get them from the
UNION.
The reason to include all those NULLs is to line up the columns when you
combine them in a UNION.
Finally, get the grand total:
-- grand total
SELECT
NULL, NULL, NULL, count(*) AS nsales,
sum(total) AS total
FROM salesdata
-- GROUP BY ()
;
? ? ? nsales total
[NULL] [NULL] [NULL] 5295 326918.22
Note that this includes the commented out GROUP BY () clause, just as a
reminder that this is a grand total; of course, you don’t need it.
The UNION clause can be used to combine the results of multiple SELECT
statements. The only requirement is that they match in the number and types of
columns.
Note that only the first query has aliases for the number of sales and the
total; in a UNION, the column names for the first query apply to the whole result.
You can alias the rest if it makes you feel better, but it won’t make any
difference.
When combining different levels of summaries, the higher-level summaries
will have NULL instead of actual values. This is correct, but inconvenient:
When sorted, NULL may appear at the beginning or the end of the list. The
SQL standard is ambivalent on this, and different DBMSs have different
opinions, while some give you a choice.
In any case, NULL in the result set is unclear and unhelpful.
To resolve the sorting problem, we can add a contrived value to force a
sorting order:
To get the results in the right order, we have introduced two values,
state_level and town_level, so that we can push the totals below the
other values.
To eliminate the sorting columns from the result set, you can turn this into a
Common Table Expression:
WITH cte AS (
-- UNION query above
)
SELECT state, id, ordered, nsales, total
FROM cte
ORDER BY
state_level,state,id_level,id,ordered_level,ordered;
This isn’t so much work to get the results, but there may be a simpler
method.
SELECT columns
FROM table
GROUP BY GROUPING SETS ((set),(set));
Recall that the previous example had SELECT statements, grouped by state,
customer id, ordered date, and a grand total. This can be generated as follows:
SELECT state,town,count(*)
FROM customers
GROUP BY GROUPING SETS ((state,id,ordered),(state,id),
(state),());
The CUBE variation works best when you don’t have too many grouping
columns and when they’re all unrelated to each other. Remember three columns
would give you eight possible combinations. You can calculate the number of
possibilities as 2n, where n is the number of columns. In this case, it’s 23 = 8. If
you had even four columns, you would have 16 possible totals and subtotals,
which might start to get overwhelming.
Both forms will give you the same result. Note that MSSQL gives you the
choice to use either form.
ROLLUP makes an important assumption that the columns form some sort of
hierarchy. In the case of the customer state and the customer id, that’s obvious.
Whether you consider the ordered date as the end of the hierarchy is up to you.
You can see the hierarchy in the results and in the fact that this matches the
GROUPING SETS example earlier. You will get results for
1.
(state, id, ordered) combinations
2.
(state, id) combinations
3.
(state) values
4.
() – grand totals
Clearly, using ROLLUP is a much simpler way to get these results, and you
probably won’t miss the flexibility of GROUPING SETS very much.
-- PostgreSQL, MSSQL;
SELECT
coalesce(state,'National Total') AS state,
coalesce(cast(id as varchar),state||' Total')
AS id,
coalesce(ordered,'Total for '||cast(id as
varchar))
AS ordered,
count(*), sum(total)
FROM salesdata
GROUP BY ROLLUP (state,id,ordered)
ORDER BY grouping(state), state,
grouping(id), id, grouping(ordered), ordered;
-- NOT Oracle
This will give you something meaningful for the summary rows.
SELECT
coalesce(state,'National Total') AS state,
grouping(state) AS statelevel,
CASE
WHEN state IS NULL THEN NULL
WHEN id IS NULL THEN 'Total for '||state
ELSE cast(id AS varchar(3))
END AS id,
grouping(id) AS idlevel,
CASE
WHEN id IS NULL THEN NULL
WHEN ordered IS NULL THEN
'Total for '||cast(id as varchar(3))
ELSE ordered
END AS ordered,
grouping(ordered) AS orderedlevel,
count(*) AS count, sum(total) AS sum
FROM salesdata
GROUP BY ROLLUP (state,id,ordered)
ORDER BY statelevel, state, idlevel, id, orderedlevel,
ordered
;
Here, the grouping() function is used in the SELECT clause and then
used for sorting. The id and ordered columns are calculated with a CASE
... END expression to get around the problem of the NULL strings.
Of course, now you have those three extra columns used for sorting. To hide
them, you can use a CTE:
WITH cte AS (
-- SELECT statement as above
-- don't bother with the ORDER BY clause
)
SELECT state, id, ordered, count, sum
FROM cte
ORDER BY statelevel, state, idlevel, id, orderedlevel,
ordered
;
For adults in Australia, the mean height is about 168.7 cm. Actually, there
are two mean heights, one for female and one for male adults, but between
them the average is 168.7 cm. The standard deviation is 7 cm. You can get
more information at
https://en.wikipedia.org/wiki/Average_human_height_by_country
mean
170.844
Height
169
171
153
176
156
176
~ 267 rows ~
WITH heights AS (
SELECT floor(height+0.5) AS height
FROM customers
WHERE height IS NOT NULL
)
SELECT height, count(*) AS frequency
FROM heights
GROUP BY height
ORDER BY height;
Height frequency
153 1
154 3
156 1
157 3
158 1
159 2
~ 36 rows ~
Note that there may be some missing values. That’s natural, especially in a
relatively small sample such as we have. However, with these gaps it’s not quite
ready for a histogram. Later, when we have a closer look at recursive common
table expressions, we’ll see how to fill in the gaps.
WITH
heights AS (
SELECT floor(height+0.5) AS height
FROM customers
WHERE height IS NOT NULL
), -- don't forget to add a comma here
frequency_table AS (
SELECT height, count(*) AS frequency
FROM heights
GROUP BY height
)
...
Finally, you can cross join the frequency table to the limits CTE to find
the mode(s):
WITH
heights AS (
...
),
frequency_table AS (
...
),
limits AS (
...
)
SELECT height, frequency
FROM frequency_table,limits
WHERE frequency_table.frequency=limits.max
ORDER BY height;
Height frequency
172 22
In a perfect set of normal data, the mode should match the mean exactly. In
real life, it should be close.
Calculating the Median
The median is the middle value. What that means is that half of the values
should be below the median and half should be above.
To find the median involves putting all of the values in order and finding the
midpoint. You can do this if you like, but that involves some skills we haven’t
developed yet, in particular getting the row number. We’ll look at that later in the
chapter on window functions.
Fortunately, modern SQL includes a function called percentile_cont.
Unfortunately, not all DBMSs use it the same way, and SQLite doesn’t support it
at all.
The percentile_cont() function finds the value by its percentile. A
percentile is a grouping of 100 groups. The 50th percentile would be in the
middle.
To find the median in PostgreSQL:
percentile_cont
171.2
SELECT
stddev_pop(height) AS sd
-- stdevp(height) AS sd -- MSSQL
FROM customers;
Remember that the standard deviation only has meaning when you believe
that the underlying data follows a normal distribution.
Summary
In this chapter, we had a look at aggregating sets of data.
NULLs
Aggregate functions all skip NULLs. This is particularly important when
counting values, but also when calculating averages.
The fact that NULLs are skipped can also be used when calculating selective
aggregates.
Aggregate Filters
It’s possible to filter what data is used for a single aggregate function.
There is a standard FILTER (WHERE ...) clause which allows you to
filter a column. However, it’s not (yet) widely supported.
The common way to filter data is to use the CASE ... END expression on
what you’re aggregating. Set a value of, say, 1 for the values you want, allow the
rest to default to NULL, and let the aggregate functions ignore them for the rest.
You can also aggregate on DISTINCT values. This makes the most sense
when you are counting.
GROUP BY
The GROUP BY clause can be used to generate a virtual table of group
summaries.
In some DBMSs, you can use GROUP BY () to generate a grand summary.
This is the default without the GROUP BY () clause and is automatically done
whenever SQL sees an aggregate function. It’s never truly needed.
You can group by basic values, but also by calculated values.
Grouping by calculated values can get complicated, since the SELECT and
ORDER BY clauses can only use what’s in the GROUP BY. Because of the
clause order, you may find yourself repeating the same calculations in various
clauses.
Since the SELECT clause is only evaluated near the end, and selecting and
ordering can only be done on what’s in the GROUP BY clause, you may find the
following techniques helpful:
Using redundant groups to select one thing and sort by another
Putting aggregate queries in a CTE and joining that with other tables to get the
rest of the results
When grouping by a column, your results may not be in the correct order.
Since the group names are all strings, sorting on the group name will only put
them in alphabetical order, which isn’t always suitable. However, you can also
sort them by their position in another string, which can be in any order you like.
Mixing Subtotals
By and large, aggregate queries produce simple aggregates on one level.
Sometimes, you need to combine them with various levels of subtotals.
You can generate subtotals in separate queries and combine them with
UNION. You might need some extra work to get the results sorted in your
preferred order.
Most DBMSs include subtotaling operations to create the combined result
automatically. They may include GROUPING SETS, ROLLUP, or CUBE. Most
include the ROLLUP which is the most common variation. There are additional
grouping functions to assist with sorting and labeling.
Statistics
In general, aggregate functions are basically statistical in nature. Although SQL
is not as powerful as dedicated statistical software, you can use aggregates and
grouping to generate some of the basic statistics.
Coming Up
In some cases, we have used a query in a Common Table Expression to prepare
data. However, in one case we created a view instead, so that we could reuse the
query.
In the next chapter, we’ll have a closer look at creating and using views to
improve our workflow.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_6
This isn’t the first time you’ve looked at views in this book, and it won’t be
the last time.
This chapter consolidates what you may have already picked up about
views and related concepts and gives a few ideas about working with them in
general.
You can spend the rest of your life writing SQL statements, and the job
would get done. However, you might get to the point where writing the same
thing over and over again loses its charm, and so you’ll want to find ways of
reusing previous queries.
First, let’s have a look at what we mean by tables and what happens when
you use the SELECT statement.
SQL databases store data in tables. Actually, they don’t—each table is really
stored in some other structure such as a binary tree, which is more efficient.
However, by the time you see it, it will be presented as a table, and that’s what
it’s called in the database.
A table is made up of rows and columns. For our purpose, the table doesn’t
have to be a permanent table, and there are operations which generate table
structures without necessarily being permanently stored. We’ll refer to them as
virtual tables.
Here is a list of operations which generate (virtual) tables, in increasing order
of longevity:
The result of a SELECT statement is a virtual table.
A join is the combination of two or more (virtual) tables to produce an
expanded virtual table.
A Common Table Expression generates a virtual table which you can use
later in the query. A table subquery amounts to the same thing.
A view is a saved SELECT query, which will regenerate the virtual table on
call.
A materialized view is a view which stores its results so it doesn’t always
have to regenerate everything.
Some DBMSs support Table-Valued Functions which are functions
which generate virtual tables.
A temporary table is like a real table. It may or may not be stored on the
disk. It will self-destruct when the session is finished.
The thing about tables and virtual tables is that they can all be used in the
FROM clause.
You will already know about using joins. You have also used Common Table
Expressions, but we’ll discuss them in more depth in the following chapters.
In this chapter, we’ll look at the rest and how we can improve our workflow
with them.
Note the syntax is exactly the same as for tables. From the perspective of a
SELECT statement, there is no distinction between selecting from a view and
from a table.
One important consequence of this is that you cannot have views with the
same names as tables—views and tables share the name space.
This doesn’t mean that there are no differences. The DBMS stores views as
separate types of objects and manages them differently. However, once created,
you can treat a view like a table.
Views can be an important part of your workflow. For example:
Saving a complex query to be used simply
Exposing complex queries to external applications
Creating an interface between existing data and an application
As a substitute for some queries
Creating a view requires permissions which you may not already have
as a database user.
Creating a View
A view starts off as a simple SELECT statement. For example, we can start
developing a pricelist view which will comprise some information about
books, their authors, and the price, including tax:
/* Notes
===================================================
MSSQL: Use + for concatenation
Oracle: No AS for tables:
FROM books b JOIN authors a ON ...
===================================================
*/
SELECT
b.id, b.title, b.published,
coalesce(a.givenname||' ','')
|| coalesce(othernames||' ','')
|| a.familyname AS author,
b.price, b.price*0.1 AS tax, b.price*1.1 AS inc
FROM books AS b LEFT JOIN authors AS a ON
b.authorid=a.id
WHERE b.price IS NOT NULL;
SELECT *
FROM aupricelist
WHERE published BETWEEN 1700 AND 1799;
This gives you
With the exception of MSSQL, you could have included the ORDER BY
clause in the view itself. Although it’s convenient, it’s probably not a good idea:
you forcing the DBMS to sort the result whether you need it or not, and you may
end up sorting it again in a different order afterward.
Among other things, this will allow you to create an ordered view without
the need to include extra columns just for sorting.
However, you need to be aware that an ordered view does place an extra
burden on the database, so it should only be used when needed.
Table-Valued Functions
Views are a powerful tool, but there’s one shortcoming: you can’t change any of
the values used in a view. For example, the aupricelist view has a hard-
coded tax rate of 10%. A more flexible type of view would allow you to input
your own tax rate. Such a view would then be called a parameterized view.
Parameterized views are not generally supported in SQL. Some DBMSs
support functions which generate a virtual table, known as a Table-Valued
Function, or TVF if you’re in a hurry. This will give more or less the same
result.
Of our popular DBMSs, only PostgreSQL and Microsoft SQL Server support
a straightforward method of creating a TVF. We’ll explore these two in the
following discussion.
Most DBMSs allow you to create custom functions. The notable exception is
SQLite, which does, however, allow you to create functions externally and hook
them in.
A function which generates a single value at a time is called a scalar
function. Built-in functions such as lower() and length() are scalar
functions.
When creating a function, there is, in a sense, a contract. The function
definition includes what input data is expected and what sort of data will be
returned. If the input data doesn’t fit, then don’t expect a result.
A TVF works the same way: you define what input is expected, and you
promise to return a table of results. Here, we’ll create a more generic price list
which allows you to tell it what the tax rate is, rather than hard-coding it.
To use the TVF, you use it like any virtual table:
SELECT *
FROM pricelist(15);
Here, the TVF is called pricelist() and the input parameter is 15,
meaning 15%. The code should handle converting that to 0.15:
id Title pub author price tax inc
2078 The Duel 1811 Heinrich von Kleis ... 12.50 1.88 14.38
503 Uncle Silas 1864 J. Sheridan Le Fan ... 17.00 2.55 19.55
2007 North and South 1854 Elizabeth Gaskell 17.50 2.63 20.13
702 Jane Eyre 1847 Charlotte Brontë 17.50 2.63 20.13
1530 Robin Hood, The Pr ... 1862 Alexandre Dumas 12.50 1.88 14.38
1759 La Curée 1872 Émile Zola 16.00 2.40 18.40
~ 1070 rows ~
TVFs in PostgreSQL
The outline of a TVF in PostgreSQL looks like this:
In this outline
The function name pricelist includes the input parameter names and
types.
The function will return a TABLE structure with column names and types.
The coding language is plpgsql which is PostgreSQL’s standard coding
language.
The actual code is contained in one big string. Because there might be other
strings in the code, the $$ at either end acts as an alternative delimiter.
The code is then placed between BEGIN and END; in this case, it will return
the results of a SELECT query.
Filling in the details, we can write
The output table is the most tedious part. In it, we have to list all of the
column names and types we’re expecting to generate.
As for the calculation, we’ve taken a user-friendly approach and allowed the
tax rate to resemble the percentage we might have used in real life. We can’t use
%, especially as that has another meaning, but other than that, we can use the
value. However, we then need to divide by 100 to get its real value.
GO
CREATE FUNCTION pricelist(...) RETURNS TABLE AS
RETURN SELECT ...
GO
There are two types of TVF in MSSQL. There is a more complex type, but
the simpler type earlier is very similar to creating a view.
In this outline
The function name pricelist includes the input parameter names and
types.
The function will return a TABLE structure.
In the simple TVF, there is only a single SELECT statement, which is
immediately returned as the result.
The actual code is almost the same as for the view, except that it will include
the value from the input parameter.
Filling in the details, we can write
Convenience
The most immediate use of a view is as a convenient way of packaging a useful
SELECT query. For example:
Both of the preceding views include joins, and one includes a number of
calculations. It’s much more convenient to use the saved view when you need it.
As an Interface
A second use of views is to present a consistent interface for existing data.
For example, when we refactored the customers table by referencing
another table and dropping a few columns, we ran the risk of invalidating any
other queries which depended on the old structure. By creating the
customerdetails view, you have a new virtual table which can be read the
same way as the old table.
It can also be handy if you’re in the process of renaming or rearranging
tables and columns. Suppose, for example, you’re in the process of developing a
new version of the customers table, with some of the following columns:
/* Notes
========================================================
MSSQL: Use + for concatenation
Oracle, SQLite: Use substr(phone,2) instead of right()
=======================================================
*/
-- CREATE VIEW newcustomers AS
SELECT
id AS customerid,
givenname AS firstname, familyname AS lastname,
cast(height/2.54 as decimal(3,1))
AS height_in_inches,
'+61' || right(phone,9) AS au_phone
-- etc
FROM customers;
(The CREATE VIEW clause is commented out, because we’re not really
going to go ahead with this.)
This approach will also be useful if you’re preparing data for an external
application.
-- Oracle
CREATE GLOBAL TEMPORARY TABLE somebooks (
id INT PRIMARY KEY,
title VARCHAR(255),
author VARCHAR(255),
price DECIMAL(4,2)
);
-- MSSQL
CREATE #somebooks (
id INT PRIMARY KEY,
title VARCHAR(255),
author VARCHAR(255),
price DECIMAL(4,2)
);
Note .
PostgreSQL allows you to use the GLOBAL and LOCAL keywords for the
same purpose, but then ignores them; they recommend leaving them out.
MSSQL uses hashes for global and private temporary tables: one hash (#)
for private and two hashes (##) for global.
By “global,” we mean that other uses of the database can access the
temporary table. Private ones are, well, private to the session.
If you’re in a desperate hurry, PostgreSQL and SQLite allow you to save
time by writing TEMP instead of TEMPORARY. It probably took you more time
to read this paragraph.
The temporary table in this example has a simple integer primary key. If you
intend adding more data as you go, you might also use an autoincremented
primary key.
Once you have created your temporary table, you can copy data into it using
the SELECT statement. For example:
The INSERT ... SELECT ... statement copies data into an existing
table, temporary or permanent.
You can create a new table and populate it in one statement with the
following statement:
-- PostgreSQL, SQLite
SELECT id,title,author,price
INTO TEMPORARY otherbooks
FROM aupricelist
WHERE price IS NULL;
-- MSSQL
SELECT id,title,author,price
INTO #otherbooks
FROM aupricelist
WHERE price IS NULL;
As you see, this statement takes one of two forms; PostgreSQL supports
both.
Note that either form requires that you have permissions to create either a
temporary or permanent table.
Remember, however, that the data is a copy, so it will go stale unless you
update it.
Why would you want a temporary table? There’s nothing in our sample
database which could be regarded as in any way heavy-duty. However, in the
real world, you might be working with a query which involves a huge number of
rows, complex joins, filters and calculations, and sorting. This could end up
taking a great deal of time and effort, especially if you’re constantly regenerating
the data.
The reasons you would use a temporary table rather than a view include
It’s more efficient to save previously generated results than it is to regenerate
them. This is called caching the results.
Sometimes, you want the data to be out of date, such as when you need to
work with a snapshot of the data from earlier in the day.
If you need to work with the snapshot at some point in the future, a
temporary table may be too fleeting. Everything we’ve done will also apply to
specially created permanent tables.
A database should never keep multiple copies of data. However, there are
times when you need a temporary table for further processing, experimenting, or
in transit to migrating data.
Computed Columns
Modern SQL allows you to add a column to a table which in principle shouldn’t
be in a table. A computed column, or calculated column, is an additional
column which is based on some calculated value. When you think about it, that’s
the sort of thing you would do in a view.
Think of the computed column as embedding a mini-view in the table. It’s
particularly handy if you commonly use one calculation but don’t want the
overhead of a view. It can also be handy if you have the option to cache the
results.
A computed column is a read-only virtual column. You can’t write anything
into the column, and, if it saves any data at all, it’s a cached value to save the
effort of recalculating it later. For example, you might store the full name of the
customer as a convenience.
You can create a computed column when you create the table, or you can add
it to the table after the event.
For example, suppose we want to add a shortened form of the ordered
datetime column, with just the date. This will be handy for summarizing by day.
You can add the new column as follows:
-- PostgreSQL >= 12
ALTER TABLE sales
ADD COLUMN ordered_date date
GENERATED ALWAYS AS (cast(ordered as date))
STORED;
-- MSSQL
ALTER TABLE sales
ADD ordered_date AS (cast(ordered as date))
PERSISTED;
-- MariaDB / MySQL
ALTER TABLE sales
ADD ordered_date date
GENERATED ALWAYS AS (cast(ordered as date))
STORED;
-- SQLite>=3.31.0
ALTER TABLE sales
ADD ordered_date date
GENERATED ALWAYS AS (cast(ordered as date))
VIRTUAL;
-- Oracle (STORED)
ALTER TABLE sales
ADD ordered_date date
GENERATED ALWAYS AS (trunc(ordered));
As you see, most DBMSs use the standard GENERATE ALWAYS syntax.
MSSQL, however, uses its own simpler syntax which doesn’t specify the data
type but infers it from the calculation.
You’ll also notice different types of computed column:
VIRTUAL columns are not stored and are recalculated. This is the default in
MSSQL.
STORED columns save a copy of the result and will only recalculate if the
underlying value has changed.
MSSQL calls this PERSISTED. In Oracle, it’s the default. SQLite does
support this as well, but only if you create the table that way; if you add the
column later, it can only be VIRTUAL.
You can now fetch the data complete with virtual column:
If you have the option, the better option is STORED or equivalent. It takes a
little more space, but saves on processing later.
Summary
Much of your work will involve not only real tables but generated virtual tables.
Virtual tables include
A join
A Common Table Expression or a table subquery
A view or, in some cases, a Table Valued Function
A temporary table.
Views
A view is a saved SELECT statement. It can be made as complex as you like and
then fetched as a virtual table.
The benefits of views include
They can be a convenient way of working with data.
They can act as an interface to your data, particularly where the original or
modified form doesn’t match your requirements.
They offer a simple table view of complex data when accessed from external
applications.
Temporary Tables
There are times when it is better to store results rather than regenerate them
every time. You can save them into a caching table.
The benefits include
It’s more efficient not to have to recalculate what will be the same results.
You might want to work with a dated snapshot of your data.
If your cache is intended to be particularly short-lived, you might use a
temporary table. A temporary table is one which will self-destruct at the end of
the session.
Whether the caching table is temporary or permanent, you can copy data into
it using a SELECT statement. You can also create a new table and copy into it in
a single statement.
Computed Columns
In modern DBMSs, you can create virtual columns in a table which give the
results of a calculation.
A VIRTUAL computed column will regenerate the value every time you
fetch from the table. A STORED computed column, a.k.a. PERSISTED in
MSSQL, will cache the results until other data has changed.
A computed column can be used for convenience. If it’s a STORED column,
it also has the benefit of saving on processing.
Coming Up
A SELECT statement doesn’t have to be the end of the story. In some cases, it
can be one step in a more complex story.
A subquery allows you to embed a SELECT statement inside a query. This
can be used to fetch values from other tables or to use one table to filter another.
It’s particularly handy if you want to incorporate aggregate data in another query.
The next chapter will look at subqueries in more detail.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_7
Running a SELECT statement, assuming that there’s no error, gives you a result.
That result is a virtual table, and it will have rows and columns.
For our purposes, we are interested in three possible virtual tables:
One row and one column: You get just one value, though technically it’s still
in a table. We’ll call this a single value.
One column and multiple rows: When the time comes, we’ll call this a list.
Multiple rows and multiple columns: In this context, a single row with
multiple columns counts as the same sort of thing. This is more like the sort of
thing we think about when talking about virtual tables.
Of course, that result may be empty, but that’s treated as NULLs.
You’ll get these types of results from the following examples.
For example, one row and one column:
Id
392
Email
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
~ 64 rows ~
That last category, the virtual table, could also be the result of a very broad
query such as SELECT * FROM customerdetails. It all works the same
way.
Any of these results, depending on the context, can be used in a subsequent
query where a single value, a list, or a (virtual) table might have been expected.
For example, using a single value:
SELECT *
FROM saleitems
WHERE bookid=(SELECT id FROM books WHERE
title='Frankenstein');
Here, the single value query is wrapped inside parentheses and used the way
you would if you already knew the value of the bookid you’re matching:
SELECT *
FROM books
WHERE authorid IN (
SELECT id FROM authors WHERE born BETWEEN '1700-
01-01' AND '1799-12-31'
);
The IN operator expects a list of values, which we get from the one column
in the nested SELECT statement:
-- Oldest Customers
SELECT *
FROM customers
WHERE dob=(SELECT min(dob) FROM customers);
(You’ll note that there’s more than one oldest customer, because they happen
to be born on the same day. It happens.)
In both cases, the subquery is evaluated once, and the results are used in the
main query. The result may be a list, as in the female authors, or a single value as
in the oldest customer.
A non-correlated subquery is independent of the main query. If you highlight
the subquery alone and run it, you’ll get a result.
Here’s an example of a correlated subquery:
-- MSSQL
SELECT
id, title, (
SELECT coalesce(givenname+' ',''
+ coalesce(othernames+' ','')
+ familyname
FROM authors
WHERE authors.id=books.authorid
) AS author
FROM books;
-- Oracle
SELECT
id, title, (
SELECT ltrim(givenname||' ')
||ltrim(othernames||' ')
||familyname
FROM authors
WHERE authors.id=books.authorid
) AS author
FROM books;
In this case, the subquery is evaluated once for every row. Look at the
subquery in the first example earlier, spread out to be more readable:
(
SELECT
coalesce(givenname||' ','')
|| coalesce(othernames||' ','')
|| familyname
FROM authors
WHERE authors.id=books.authorid
)
The SELECT clause is expecting a single value for the author column, and
so the subquery should deliver a single value, which it does. You can’t use
multiple columns in this context, so you need to concatenate the names to give
the single value.
Just as importantly, you can’t have multiple rows either. Here, the WHERE
clause filters the result to a single row, where the id matches the authorid in
the main query: WHERE authors.id=books.authorid.
For every row in the books table, the subquery runs again to match the next
authorid.
If there’s no match, the subquery comes back with a NULL.
You can recognize a correlated subquery by the fact that the query references
something from the main query. As a result, you can’t highlight the subquery and
run it alone, because it needs that reference to be complete.
Incidentally, note the WHERE clause in the subquery. In a sense, it’s
overqualified, and we could have used this: WHERE id=authorid. This is in
spite of the fact that an id column appears in both the subquery and the main
query.
When the subquery is evaluated, column names will be defined from the
inside out. For the id column, there’s one in the inner authors table, so SQL
doesn’t bother to notice that there’s also one in the outer books table. For the
authorid column, there isn’t one in the authors table, so it falls through the
one in the books table.
That’s how it works in SQL, but it’s probably better to qualify the columns
as we did in this example to minimize confusion for us humans.
As a rule, a correlated subquery is an expensive operation because it’s
reevaluated so often. That doesn’t mean you shouldn’t use one, just that you
should consider the alternatives, if there are any. You don’t generally get to
choose which type of subquery you will need, but it will help in deciding
whether there’s a better alternative.
SELECT
id, title, (
SELECT coalesce(givenname||' ','')
|| coalesce(othernames||' ','')
|| familyname
FROM authors
WHERE authors.id=books.authorid
) AS author,
(SELECT born FROM authors
WHERE authors.id=books.authorid) AS born,
(SELECT born FROM authors
WHERE authors.id=books.authorid) AS died
FROM books;
Apart from being tedious, it’s also expensive, and, of course, there’s a better
way to do it, using a join:
SELECT
id, title,
coalesce(givenname||' ','')
|| coalesce(othernames||' ','')
|| familyname AS author,
born, died
FROM books AS b LEFT JOIN authors AS a ON
b.authorid=a.id;
-- Oracle
-- FROM books b LEFT JOIN authors a ON
b.authorid=a.id;
In fact, you’ll probably find that a correlated subquery is often best replaced
by a join. There’s also some cost in the join, but after that, the rest of the data is
free.
On the other hand, if the subquery is non-correlated, then it’s not so
expensive. For example, here’s the difference between customers’ heights and
the average height:
SELECT
id, givenname, familyname,
height,
height-(SELECT avg(height) FROM customers) AS diff
FROM customers;
Even though the average is involved in a calculation in every row, it’s only
calculated once in the non-correlated subquery.
By the way, there’s an alternative way to do the preceding query involving
window functions, which we’ll look at in Chapter 8. However, in this case,
there’s not much difference in the result.
You’ll have noticed that, in this case, the subquery references the same table
as the main query. That doesn’t make it a correlated subquery, as it doesn’t
reference the actual rows in the main query. You can verify that if you highlight
the subquery and run it by itself—it will work.
The subquery in this example was an aggregate query. You can also use an
aggregate in a correlated query. Here’s a way of generating a running total:
We’ve had to alias the table in the subquery to something like ss (subsales?)
to distinguish it from the same table in the main query. That’s so that the
expression ss.ordered<=sales.ordered can reference the correct tables.
Here, the subquery calculates the sum of the totals up to and including
the current sale, ordered by the ordered column.
You possibly noticed that the query took a little while to run. As we noted, a
correlated subquery is costly, and one which involves aggregates is especially
costly. Fortunately, there’s also a window function for that, as we’ll see in the
next chapter.
Subqueries in the WHERE Clause
A subquery can also be used to filter your data.
Again, you may find an alternative to subqueries, such as JOINs, but one
compelling use case is when the subquery is an aggregate query.
Here are some cases where the subquery makes the point clearly and simply.
SELECT *
FROM customers
WHERE dob=(SELECT min(dob) FROM customers);
You can also do the same to find customers shorter than the average:
SELECT *
FROM customers
WHERE height<(SELECT avg(height) FROM customers);
In both cases, the aggregate query was on the same table as the main query.
You might have thought that you could use an expression like WHERE
dob=min(dob) or WHERE height<avg(height), but it wouldn’t work;
aggregates are calculated after the WHERE clause.
Big Spenders
Suppose you want to identify your “big spenders”—the customers who have
spent the highest amounts. For that, you will need data from the customers
and sales tables.
Here, we’ll use subqueries as part of a multistep process.
To begin with, you’ll want to identify what you regard as large purchases:
SELECT * FROM sales WHERE total>160;
In here, we’re only interested in the customerid, which we’ll use to select
from the customers table:
SELECT *
FROM customers
WHERE id IN(SELECT customerid FROM sales WHERE
total>160);
id … familyname givenname …
42 … Knott May …
58 … Ting Jess …
91 … North June …
140 … Byrd Dicky …
40 … Face Cliff …
141 … Rice Jasmin …
~ 32 rows ~
To recreate what we had in the previous query, we’ve qualified the star
(customers.*) and used DISTINCT to remove duplicates of customers who
may have appeared in the list more than once.
The advantage of using a join is that you can also get sales data for the
asking, so this gives a slightly richer result:
SELECT *
FROM customers JOIN sales ON
customers.id=sales.customerid
WHERE sales.total>=160;
Here, we’ve removed the DISTINCT and the customers., so you’ll get a
lot of data:
To find customers with large total sales will require an aggregate subquery:
SELECT *
FROM customers
WHERE id IN(
SELECT customerid FROM sales
GROUP BY customerid HAVING sum(total)>=2000
);
id … familyname givenname …
42 … Knott May …
58 … Ting Jess …
26 … Twishes Bess …
91 … North June …
69 … Mentary Rudi …
140 … Byrd Dicky …
~ 57 rows ~
Max
2023-05-15 00:46:00.864446
2023-05-25 00:42:26.783461
2023-05-16 05:27:53.810977
2023-05-06 01:40:02.346894
2023-05-19 07:41:25.104524
2023-05-07 19:01:06.756387
~ 269 rows ~
If you count the rows, you may find that the main query returned fewer rows
than the subquery. That would happen if there were some NULL ordered
datetimes. At some point, we should learn to ignore these, either by filtering
them out or removing them altogether.
The question is, why weren’t those sales included in the full query? And the
answer is that it’s all about the IN() operator.
Remember in Chapter 3, we discussed the NOT IN quirk. The discussion
also applies to a plain IN. The NULL datetimes in the subquery would result in
the equivalent of testing WHERE ordered=NULL, which, as we all know,
always fails.
Now that we have sales for each customer, it’s a simple matter to join that to
the customers table to get more details:
SELECT *
FROM sales JOIN customers ON
sales.customerid=customers.id
WHERE ordered IN(SELECT max(ordered) FROM sales GROUP
BY customerid);
You can now extract any customer or sales data you might want to work
with.
Duplicated Customers
We’ve seen in Chapter 2 how to find duplicates. Suppose, for example, you want
to find duplicate customer names:
SELECT
givenname||' '||familyname AS fullname,
-- MSSQL: givenname+' '+familyname AS fullname,
count(*) as occurrences
FROM customers
GROUP BY familyname, givenname
HAVING count(*)>1;
You get
fullname Occurrences
Judy Free 2
Annie Mate 2
Mary Christmas 2
Ken Tuckey 2
Corey Ander 2
Ida Dunnit 2
Paul Bearer 2
Terry Bell 2
/* Note
================================================
MSSQL: Use givenname+' '+familyname
================================================
*/
SELECT *
FROM customers
WHERE givenname||' '||familyname IN (
SELECT givenname||' '||familyname FROM
customers
GROUP BY familyname, givenname
HAVING count(*)>1
);
This will give us the rest of the customer details. The reason we had to
concatenate the customers’ names is that you can only have a single column in
the IN() expression.
SELECT
id, title,
CASE
WHEN price<13 THEN 'cheap'
WHEN price<=17 THEN 'reasonable'
WHEN price>17 THEN 'expensive'
END AS price_group
FROM books;
You’ll get something like this:
id title price_group
2078 The Duel Cheap
503 Uncle Silas Reasonable
2007 North and South Expensive
702 Jane Eyre Expensive
1530 Robin Hood, The Prince of Thieves Cheap
1759 La Curée Reasonable
~ 1201 rows ~
Now, suppose you want to summarize the table. The problem is that you
can’t do this:
We’ve commented out the columns we’re not grouping, but it still won’t
work because of that pesky clause order thing: the alias price_group is
created in the SELECT clause which comes after the GROUP BY clause, so it’s
not available for grouping. Of course, you can then reproduce the calculation in
the GROUP BY clause:
price_group num_books
expensive 320
[NULL] 105
reasonable 467
cheap 309
Remember that the default fall through for the CASE expression is NULL.
Those books which are unpriced will end up in the NULL price group.
Depending on the DBMS, you’ll see this somewhere in the result set as a
separate group.
Remember that a SELECT statement generates a virtual table. As such, it can
be used in a FROM clause in the form of a subquery.
Note that there’s a special requirement for a FROM subquery: it must have an
alias, even if you’ve no plans to use it. We have no special plans here, so it’s just
called sq (“SubQuery”) for no particular reason. If you want to, say, join the
subquery with another table or virtual table, then the alias will be useful.
Nested Subqueries
A subquery is a SELECT statement with its own FROM clause. In turn, that
FROM clause might be from another subquery. If you have a subquery within a
subquery, it’s a nested subquery.
For example, let’s look at duplicate customer names again. You can find
candidates with the following aggregate query:
They’re just the names. Suppose you want more details. For that, you can
join the customers table with the preceding query:
SELECT
c.id, c.givenname, c.familyname, c.email
FROM customers AS c JOIN (
SELECT familyname, givenname
FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
) AS n ON c.givenname=n.givenname AND
c.familyname=n.familyname;
We’ve seen something like this before. You’ll now get the candidate
customers:
SELECT
givenname, familyname,
-- PostgreSQL, MSSQL:
string_agg(email,', ') AS email,
string_agg(cast(id AS varchar(3)),', ') AS ids
-- MariaDB/MySQL:
group_concat(email SEPARATOR ', ') AS email,
group_concat(cast(id AS varchar(3)) SEPARATOR
', ')
AS ids
-- SQLite:
group_concat(email,', ') AS email,
group_concat(cast(id AS varchar(3)),', ') AS
ids
-- Oracle:
listagg(email,', ') AS email,
listagg(cast(id AS varchar(3)),', ') AS ids
FROM ( -- previous SELECT as subquery
SELECT c.id, c.givenname, c.familyname, c.email
FROM customers AS c JOIN (
SELECT familyname, givenname
FROM customers
GROUP BY familyname, givenname HAVING
count(*)>1
) AS n ON c.givenname=n.givenname AND
c.familyname=n.familyname
) AS sq
GROUP BY familyname, givenname;
Now you’ll get something like
SELECT ...
FROM ...
WHERE EXISTS(subquery);
The subquery will either return a result or not. If it does, then the WHERE
EXISTS is satisfied, and the row is passed; if it doesn’t, then the WHERE
EXISTS isn’t satisfied, and the row will be filtered.
For example, you can test the idea with the following statement:
Since 1=1 is always true, you’ll get all of the rows from the authors table.
Although you would normally only use FROM dual with Oracle, MariaDB
and MySQL also support this. In this case, MariaDB and MySQL don’t like the
WHERE clause without a FROM, so we’ve thrown it in to keep them happy.
Similarly, you can return nothing:
The subquery selects some rows, which is enough to satisfy the WHERE
clause, so you’ll get all the authors. If you had tried WHERE price<0, then
you’d get none of the authors.
This variation is, of course, simpler. However, it’s quite likely that, on the
inside, SQL does exactly the same thing, so how you write it is really a matter of
taste.
On the other hand, if you’re looking for authors without books (in our
catalogue), then it’s a different matter.
This won’t work:
SELECT * FROM authors
WHERE id NOT IN(SELECT authorid FROM books);
Well, technically, it will work, but not the way we would have wanted. Recall
again from Chapter 3 the “NOT IN quirk.” Since there are some NULLs in the
authorid column, the NOT IN operator eventually evaluates something like
... AND id=NULL AND .... The id=NULL always fails, and the ...
AND ... combines that failure with the rest and causes the whole expression to
fail.
Using WHERE NOT EXISTS will, however, work:
You won’t see WHERE EXISTS much in the wild, since you can generally
do the same thing with either a join or the IN operator. However, there are times
where it has an advantage or is more intuitive. That’s especially because WHERE
EXISTS can be more expressive and particular when NOT IN doesn’t work.
Still, if you’re working with one of these DBMSs, you might want to see
what it’s all about. In any case, there’s an alternative for most things,
especially with Common Table Expressions, in the next section.
SQLite does, however, have an interesting quirk with the WHERE clause,
which you’ll see if you hang around.
SELECT
id, title,
price, price*0.1 AS tax, price+tax AS inc
FROM books;
It won’t work. That’s because each column is independent of the rest. You
can’t use an alias as part of another calculation in the SELECT clause. We got
around this by calculating the inc column separately: price*1.1 AS inc.
It gets worse if you try something like this:
SELECT
id, title,
price, price*0.1 AS tax
FROM books
WHERE tax>1.5;
Here, the problem is that the SELECT clause is evaluated after the WHERE
clause, so the aliased calculation for tax isn’t available yet in the WHERE
clause. Again, we could recalculate the value in the WHERE clause: WHERE
price*1.1>1.5.
Except with SQLite. You can indeed use aliases in the WHERE clause and
also in the GROUP BY clause.
Finally, if, for example, you want to get multiple columns from a subquery in
the SELECT clause, this won’t work either:
SELECT
id, title,
(SELECT givenname, othernames, familynames
FROM authors WHERE authors.id=books.authorid)
FROM books
WHERE tax>1.5;
A subquery in the SELECT clause can only return one value, which is all
right if you concatenate the names and then return the result. Otherwise, you’re
stuck with three subqueries, which is both costly and tedious.
SQL can solve this by applying a subquery to each row. This is called a
LATERAL JOIN in some DBMSs, or an APPLY in some others.
Adding Columns
In the first two examples earlier, you can use an expression like this:
Note .
The subquery must be given an alias, even though it’s not used.
PostgreSQL, MySQL, and MSSQL allow you to put the column aliases in
the subquery aliases instead: (SELECT price*0.1) AS sq(tax).
Not Oracle.
The example for PostgreSQL and MySQL uses the dummy condition ON
true. MySQL will allow you to leave this out, but PostgreSQL requires it.
Note in particular that the second subquery will happily calculate the
expression price+tax AS inc. This is because the subqueries are evaluated
one after the other, so the expressions can accumulate.
The LATERAL or CROSS APPLY subquery is applied to every row of the
main query. In principle, that could be pretty expensive, but, as it turns out, it’s
not so bad. It’s particularly useful if you need to include a series of intermediate
steps in a more complex calculation—it’s easy to understand and easy to
maintain.
SQL also has a type of join called CROSS JOIN. In a cross join, each
row of one table is joined with each row of the other table. This result is also
known as a Cartesian product. That’s a lot of combinations, and it’s usually
not what you want.
A CROSS APPLY is not the same thing, though it is a type of join. It’s
closer to an OUTER JOIN.
You’ll see a use for a cross join later when we cross join with a single row
virtual table.
Multiple Columns
As we noted, SQL won’t let you fetch multiple columns from a single subquery
in the SELECT clause, because everything in the SELECT clause is supposed to
be scalar—a single value.
However, you can fetch multiple columns if the context is table-like, such as
in the FROM clause. For example:
In this case, you can just as readily use a normal outer join to get the same
results:
SELECT
books.id, title,
givenname, othernames, familyname,
home
FROM books LEFT JOIN authors ON
authors.id=books.authorid;
The latter form is definitely simpler (we’ve left off the table aliases for
simplicity and qualified the books.id column out of necessity).
On the other hand, if the subquery is an aggregate query, the lateral join is
convenient, since you’re going to need a subquery anyway: remember you can’t
mix aggregate and non-aggregate data in a single SELECT statement.
For example, suppose you want a list of customers with the total sales for
each customer. You’ll need an aggregate query to get the totals, joined to the
customers table. You could do this:
Syntax
A Common Table Expression is defined as part of the query, before the main
part:
The CTE is given a name, though not necessarily cte of course. Thereafter, it
is used as a normal table in the main query. You can define multiple CTEs as
follows:
WITH
cte AS (subquery),
another AS (subquery)
SELECT columns FROM ...;
-- Prepare Data
WITH sq AS (
SELECT
id, title,
CASE
WHEN price<13 THEN 'cheap'
WHEN price<=17 THEN 'reasonable'
WHEN price>17 THEN 'expensive'
END AS price_group
FROM books
)
-- Use Prepared Data
SELECT price_group, count(*) AS num_books
FROM sq
GROUP BY price_group;
It doesn’t look much different, but the important part is that you now have
your query in two parts: the first part defines the subquery, and the second uses
it. It’s a much better way of organizing your code.
The subquery has been transferred to a CTE at the beginning of the query.
From there on, the main SELECT statement references the CTE as if it were just
another table.
The advantage is that the query is written according to the plan: first prepare
the data, and then use the data.
MSSQL currently doesn’t require a semicolon at the end of a statement,
but you should be in the habit of using it anyway.
Just use the semicolon at the end of every statement, and all will be fine.
Don’t fall for this nonsense:
;WITH (...)
Here’s another example, which we’ll use further in the next few chapters. If you
look at the sales table:
If you want to summarize the table, such as to get monthly totals, the data is
too fine-detailed. Instead, you can prepare the data by formatting the ordered
as a year-month value:
WITH salesdata AS (
SELECT
-- PostgreSQL, Oracle
to_char(ordered,'YYYY-MM') AS month,
-- MariaDB/MySQL
-- date_format(ordered,'%Y-%m') AS month,
-- MSSQL
-- format(ordered,'yyyy-MM') AS month,
-- SQLite
-- strftime('%Y-%m',ordered) AS month,
total
FROM sales
)
SELECT month, sum(total) AS daily_total
FROM salesdata
GROUP BY month
ORDER BY month;
Month daily_total
2022-05 6966.50
2022-06 12733.00
2022-07 17314.00
2022-08 19093.00
2022-09 20295.50
2022-10 27797.50
~ 14 rows ~
In real life, much of what you want to summarize isn’t in the right form, but
you can prepare it in a CTE to get it ready.
We’ll have another look at CTEs in Chapter 9, where we’ll see more
techniques we can apply.
Summary
In this chapter, we’ve had a look at using variations on subqueries in a query.
We’ve already seen some subqueries in previous chapters, but here we had a
closer look at how they work.
Subqueries can be used in any clause. The results of the subquery must
match the context of the clause:
Subqueries in the SELECT clause or in simple WHERE expressions need to
return a single value.
Subqueries used in an IN() expression need to return a single column.
Subqueries used in the FROM clause need to return a virtual table.
You can also use subqueries in the ORDER BY clause, though you’d
probably want to use the expression in the SELECT clause instead.
You can also use subqueries with the WHERE EXISTS expression or in
LATERAL joins.
Subqueries in the FROM clause can be nested, though you would probably
want to use a Common Table Expression instead.
Coming Up
In Chapter 5, we had a look at aggregating data. Generally, aggregate values
can’t be mixed with non-aggregate values without throwing a few subqueries
into the mix.
Window functions are a group of functions which do the job of applying
subqueries to each row. There are two main groups of window functions:
The aggregate functions can be used to apply an aggregate to each row of a
non-aggregate query. They can also be used to accumulate or aggregate in
groups.
The sequencing functions can be used to generate a value based on the
position of the row in the dataset. They can be used to indicate the row
position or some grouping. They can also be used to fetch values from other
rows.
With window functions, you’ll be able to generate datasets which combine
plain data with more analytical data.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_8
8. Window Functions
Mark Simon1
(1) Ivanhoe VIC, VIC, Australia
You’ll notice that there’s a foreign key from the saleitems table to the
sales table, which would normally disallow deleting the sales if there are any
items attacked. However, if you check the script which generates the sample
database, you’ll notice the ON DELETE CASCADE clause, which will
automatically delete the orphaned sale items.
The important part is the OVER() clause which generates the window to be
summarized.
There are three main window clauses:
PARTITION BY: This calculates the function for the group defined. It is
equivalent to GROUP BY.
The default partition is the whole table.
ORDER BY: This calculates the function cumulatively, in the order defined. In
other words, it generates running totals.
This order does not need to be the same as the table’s ORDER BY clause.
There is also an optional framing clause. This creates a sliding window within
the partition.
The framing clause requires an ORDER BY window clause. By default,
the frame is the rows from the beginning to the current row, but that needs to
be qualified when we get to that.
In the following samples, there is normally an ORDER BY clause at the end
of the SELECT statement, which is the same as what’s in the OVER() clause.
This isn’t necessary, but it makes the results easier to follow.
SELECT
id, givenname, familyname,
count(*)
FROM customerdetails;
SELECT
id, givenname, familyname,
count(*) OVER ()
FROM customerdetails;
The OVER() clause changes the aggregate function into a window function.
This aggregate function will now be generated for each column. You’ll see later
that the OVER() clause defines any grouping, known as partitions, the order,
and the number of rows to be considered in the aggregate.
For such a simple case, you can get the same result with a subquery:
SELECT
id, givenname, familyname,
(SELECT count(*) FROM customers)
FROM customerdetails;
The window function becomes more interesting when you apply one of the
window clauses. For example:
SELECT
id, givenname, familyname,
count(*) OVER (ORDER BY id)
FROM customerdetails;
This will give the running count up to and including the current row, in order
of id. The actual table results may or may not be in row order, especially if you
include other expressions, so it’s better to add that to the end:
SELECT
id, givenname, familyname,
count(*) OVER (ORDER BY id) AS running_count
FROM customerdetails
ORDER BY id;
The running_count column looks very much like a simple row number.
We’ll see later that it’s not necessarily the same if the ORDER BY column isn’t
unique.
Aggregate Functions
Normally, you can’t use aggregate functions in a normal query unless you
squeeze them into a subquery. However, they can be repurposed as window
functions.
Previously, you saw that you can use the expression count(*) OVER ()
to give the total number on every row. You can also do something similar with
the sum() or avg() functions.
For example, suppose you want to compare sales totals with the overall
average:
SELECT
id, ordered, total,
total-avg(total) OVER () AS difference
FROM sales;
-- PostgreSQL: Sunday=0
SELECT
EXTRACT(dow FROM ordered) AS weekday_number,
total
FROM sales;
-- MSSQL: Sunday=1
SELECT
datepart(weekday,ordered) AS weekday_number,
total
FROM sales;
-- Oracle: Sunday=1
SELECT
to_char(ordered,'D')+0 AS weekday_number,
total
FROM sales;
-- MariaDB/MySQL: Sunday=1
SELECT
dayofweek(ordered) AS weekday_number,
total
FROM sales;
-- SQLite: Sunday=0
SELECT
strftime('%w',ordered) AS weekday_number
total
FROM sales;
You’ll see they all have a different way to do it, and they can’t even agree on
the day number. Fortunately, they all agree on the first day of the week:
weekday_number Total
0 28
1 34
1 58.5
1 50
1 17.5
0 13
~ 5549 rows ~
WITH
data AS (
SELECT
... AS weekday,
total
FROM sales
)
-- to be done
;
WITH
data AS (
SELECT
... AS weekday_number,
total
FROM sales
),
summary AS (
SELECT weekday_number, sum(total) AS total
FROM data
GROUP BY weekday_number
)
-- etc
Finally, you can compare the daily totals to the grand totals using a window
aggregate:
WITH
data AS (...),
summary AS (...)
SELECT
weekday_number, total,
total/sum(total) OVER()
FROM weekday_number
ORDER BY weekday_number;
3 45959.5 0.141
4 47528 0.145
5 42372.5 0.13
6 48415.5 0.148
WITH
data AS (...),
summary AS (...)
SELECT
weekday, total,
100*total/sum(total) OVER() AS proportion
FROM summary
;
If you want to display the percentage symbol, that’s up to the DBMS. You
can try one of the following:
-- PostgreSQL
to_char(100*total/sum(total) OVER(),'99.9%')
-- MariaDB/MySQL
format(100*total/sum(total) OVER(),2) || '%'
-- MSSQL
format(100*total/sum(total) OVER(),'0.0%')
-- SQLite: aka printf(...)
select format('%.1f%%',100*total/sum(total)
OVER())
-- Oracle
to_char(100*total/sum(total) OVER(),'99.9') || '%'
This looks more convincing:
We’ve used OVER() to calculate the grand total for the table. However, we
can also use a sliding window, as we’ll see in the next section.
SELECT
id, givenname, familyname,
count(*) OVER (ORDER BY id) AS running_count
FROM customerdetails
ORDER BY id;
In this example, the id, being the primary key, is unique. That will give us a
false idea of how this works, so let’s look at using the height, which is not
unique. We’ll also filter out the NULL heights to make it more obvious:
SELECT
id, givenname, familyname,
height,
count(*) OVER (ORDER BY height) AS running_count
FROM customerdetails
WHERE height IS NOT NULL
ORDER BY height;
You’ll see some repeated heights and how they affect the window function:
id givenname familyname Height running_count
597 Ike Andy 153 2
283 Ethel Glycol 153 2
451 Fred Knott 153.8 3
194 Rod Fishing 154.3 4
534 Minnie Bus 156.4 6
352 Basil Isk 156.4 6
~ 267 rows ~
When using ORDER BY in the OVER clause, it means count the number of
rows up to the current value. That may or may not be what you wanted.
That’s quite a mouthful, but that’s the way the SQL language is developing:
Why say something in two words if you can say it in twenty1?
Here, the word RANGE refers to the value of height. For example, in the
fifth row earlier, the value is the same as the next row, so count(*) includes
both.
The obvious alternative is
SELECT
id, givenname, familyname,
height,
count(*) OVER (ORDER BY height
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW) AS running_count
FROM customerdetails
WHERE height IS NOT NULL
ORDER BY height;
It’s a little bit unfair: two customers on the same height are arbitrarily
positioned one before the other. We’ll see more of this unfairness later.
The framing clause can take the following form:
As we saw, the difference between ROWS and RANGE is that RANGE includes
all the rows which match the current value, while ROWS doesn’t.
The start and end expressions, a.k.a. the frame borders, can take one of
the following forms:
Expression Meaning
UNBOUND PRECEDING Beginning
n PRECEDING Number of rows before the current row
CURRENT ROW
n FOLLOWING Number of rows after the current row
UNBOUND FOLLOWING End
ROWS|RANGE start
--PostgreSQL, Oracle
to_char(ordered_date,'YYYY-MM') AS
ordered_month,
-- MariaDB/MySQL
-- date_format(ordered_date,'%Y-%m')
AS ordered_month,
-- MSSQL
-- format(ordered_date,'yyyy-MM') AS
ordered_month,
-- SQLite
-- strftime('%Y-%m',ordered_date) AS
ordered_month,
sum(total) AS daily_total
FROM sales
WHERE ordered IS NOT NULL
GROUP BY ordered_date;
SELECT
ordered_date, daily_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS 6 PRECEDING) AS week_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS UNBOUNDED PRECEDING) AS running_total
FROM daily_sales
ORDER BY ordered_date;
For both framing clauses, we’ve used the shorter form, since we want to go
up to the current row. We could have left off the framing clause altogether for the
running total, but we needed to change from the default RANGE BETWEEN just
in case two daily totals were the same.
You’ll get something like the following:
Note that for the first seven days, the week and running totals are the same,
because there are no totals from before then. However, from there on, the
running total keeps accumulating while the week total is clamped to the current
seven days.
If you look hard enough, you may also see some gaps in the dates. That
means that there were no sales on those days and can also mean trouble for
interpreting what you mean, since one row is not necessarily one day. We’ll
address that problem in Chapter 9.
Remember, you’re not limited to the count() and sum() functions. For
example, you can create sliding averages as well:
SELECT
ordered_date, daily_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS 6 PRECEDING) AS week_total,
avg(daily_total) OVER(ORDER BY ordered_dat
ROWS 6 PRECEDING) AS week_average,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS UNBOUNDED PRECEDING) AS running_total
FROM daily_sales
ORDER BY ordered_date;
The week average is the average over the seven days including the current
day:
You can also select sliding minimums and maximums or averages so far.
You’ll have to decide which of them is useful for your own purposes.
SELECT
ordered_date, daily_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS 6 PRECEDING) AS week_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS UNBOUNDED PRECEDING) AS running_total,
sum(daily_total) OVER(PARTITION BY ordered_month)
AS monthly_total
FROM daily_sales
ORDER BY ordered_date;
sum(daily_total) OVER(
PARTITION BY ordered_month
ORDER BY ordered_date ROWS UNBOUNDED PRECEDING
) AS month_running_total
SELECT
ordered_date, daily_total,
sum(daily_total) OVER(ORDER BY ordered_date
ROWS UNBOUNDED PRECEDING) AS running_total,
sum(daily_total) OVER(PARTITION BY ordered_month)
AS month_total,
sum(daily_total) OVER(ORDER BY ordered_month)
AS running_month_total,
sum(daily_total) OVER(PARTITION BY ordered_month
ORDER BY ordered_date ROWS UNBOUNDED
PRECEDING)
AS month_running_total
FROM daily_sales
ORDER BY ordered_date;
You’ll see something like this (the column names have been abbreviated to
fit in the page):
The names may be somewhat confusing, so here’s a table of what’s going on:
(Again, the column names have been abbreviated to make it all fit.)
Notice how we’re using the group column ordered_month both to
partition and for a running total. Because its default frame is RANGE ..., it
will produce the total for all of the values so far, which effectively is a total for
the whole month. This is the sort of thing you can expect if you order by a non-
unique row.
The hardest part of it all is thinking of good names for the results.
As summaries, these are all good candidates for saving as a view.
Note, however, that in SQL Server only, you cannot include an ORDER
BY clause in a view without additional trickery. As a result, you should at
least make sure that your SELECT statement includes the columns you want
to order by, and then include the ORDER BY clause when using the view.
-- customer_sales
SELECT c.id AS customerid, c.state, c.town, total
FROM customerdetails AS c JOIN sales AS s
ON c.id=s.customerid
We’ll then want to summarize the data by grouping by state, town, and
customer id. Again, that will go into another CTE:
-- totals
SELECT state, town, customerid, sum(total) AS
total
FROM customer_sales
GROUP BY state, town, customerid
WITH
customer_sales AS (
SELECT c.id AS customerid, c.state, c.town,
total
FROM customerdetails AS c JOIN sales AS s
ON c.id=s.customerid
),
totals AS (
SELECT state, town, customerid, sum(total) AS
total
FROM customer_sales
GROUP BY state, town, customerid
)
SELECT state, town, customerid, total AS
customer_total
FROM totals
ORDER BY state, customerid;
Now for the window functions. First, to get the group total by state, we can
use
To get the group total per town, remember that the town name can appear in
more than one state. To use PARTITION BY town would be a mistake, as the
town names would be conflated. Instead, we use
WITH
customer_sales AS (
SELECT c.id AS customerid, c.state, c.town,
total
FROM customerdetails AS c JOIN sales AS s
ON c.id=s.customerid
),
totals AS (
SELECT state, town, customerid, sum(total) AS
total
FROM customer_sales
GROUP BY state, town, customerid
)
SELECT
state, town, customerid, total AS customer_total,
sum(total) OVER(PARTITION BY state) AS
state_total,
sum(total) OVER(PARTITION BY state, town) AS
town_total
FROM totals
ORDER BY state, customerid;
SELECT
id, givenname, familyname,
height,
count(*) OVER (ORDER BY height
ROWS UNBOUNDED PRECEDING) AS running_count
FROM customers
WHERE height IS NOT NULL
ORDER BY height;
SELECT
id, givenname, familyname,
height,
row_number() OVER (ORDER BY height) AS
running_count
FROM customers
WHERE height IS NOT NULL
ORDER BY height;
SELECT
id, givenname, familyname,
height,
row_number() OVER (ORDER BY height) AS row_number,
count(*) OVER (ORDER BY height) AS count,
rank() OVER (ORDER BY height) AS rank,
dense_rank() OVER (ORDER BY height) AS dense_rank
FROM customers
WHERE height IS NOT NULL
ORDER BY height;
-- PostgreSQL, SQLite
SELECT
id, givenname, familyname,
height,
row_number() OVER () AS row_number,
count(*) OVER () AS count,
rank() OVER () AS rank,
dense_rank() OVER () AS dense_rank
FROM customers
WHERE height IS NOT NULL
ORDER BY height;
SELECT
id, ordered_date, total,
row_number() OVER (PARTITION BY ordered_date) AS
row_number
FROM sales
ORDER BY ordered;
The row numbers may not be in the expected order, since the order wasn’t
specified. To finish the job, we should also include that:
SELECT
id, ordered_date, total,
row_number() OVER (
PARTITION BY ordered_date ORDER BY ordered
) AS row_number
FROM sales
ORDER BY ordered;
You can use the group row number in a creative way. For example, you
might want to show the date for only the first sale for the day. You can show the
date selectively using a CASE ... END expression:
CASE
WHEN row_number() OVER
(PARTITION BY ordered_date ORDER BY
ordered)=1
THEN CAST(ordered_date AS varchar(16))
ELSE ''
END AS ordered_date,
SELECT
id,
CASE
WHEN row_number() OVER
(PARTITION BY ordered_date ORDER BY
ordered)=1
THEN CAST(ordered_date AS varchar(16))
ELSE ''
END AS ordered_date,
row_number() OVER (PARTITION BY ordered_date) AS
item,
total
FROM sales
ORDER BY ordered;
Paging Results
One reason why you might want the overall row number is that you might want
to break up your results into pages. For example, suppose you want your results
in pages of, say, twenty, and you now want to display page 3 of that.
We can start with our pricelist view and include the row_number()
window function:
SELECT
id, title, published, author,
price, tax, inc,
row_number() OVER(ORDER BY id) AS row_number
FROM aupricelist;
WITH cte AS (
SELECT
id, title, published, author,
price, tax, inc,
row_number() OVER(ORDER BY id) AS row_number
FROM aupricelist
)
SELECT *
FROM cte
WHERE row_number BETWEEN 40 AND 59
ORDER BY id;
Oracle has a built-in value called rownum. Sadly, you still need to use it
from a CTE or a subquery.
Of course, you don’t have to order by the id. You can use the title, or the
price, as long as you include it in both the window function and in the ORDER
BY clause. And, of course, you can also use DESC.
There is an alternative way to do this. Officially, you can use the OFFSET
... FETCH ... clause:
This skips over the first 40 rows and fetches the next 20 rows after that.
Unofficially, some DBMSs support LIMIT ... OFFSET:
Of course, these two alternatives are much simpler than using the window
function technique, but there is an advantage with using the window function.
Suppose you’re sorting by something non-unique, such as the price. The
problem with the normal paging techniques, including the row_number()
earlier, is that the page stops strictly at the number of rows (or less if there are no
more).
If you decide to keep the prices together, you can instead use something like
WITH cte AS (
SELECT
id, title, published, author,
price, tax, inc,
rank() OVER(ORDER BY price) AS rank
FROM aupricelist
)
SELECT *
FROM cte
WHERE rank BETWEEN 40 AND 59
ORDER BY price;
As long as the groupings aren’t too big, it should give you nearly the same
results, but with all the books of one price together.
SELECT
id, givenname, familyname, height,
ntile(10) OVER (order by height) AS decile
FROM customers
WHERE height IS NOT NULL;
In this sample, you’ll see that three customers have the same height (167.1),
but one of them didn’t fit in the earlier decile, so was pushed into the next.
That’s more of the unfairness mentioned earlier, as is due to the fact that
ntiles are calculated purely on the row number and the value.
If you were, for example, awarding prizes or discounts to customers in
certain deciles, it would be unfair to miss out just because the sort order is
unpredictable.
This might be a deal breaker, if you rely on the ntile. There is, however, a
workaround.
We’ll call this value bin, which is a common statistical name for groups.
We can put that into a CTE and run the following:
SQLite doesn’t have a floor() function, but you can use cast(... AS
int) instead:
Here, as well as the OVER clause, we need to supply two values. The
column value refers to which data in the other row you want. The number
value refers to how many rows back or forward to get it from. If you want, you
can leave it out, in which case it will default to 1.
For example, suppose you want to look at sales for each day, as well as for
the previous and next days. You can write
SELECT
ordered_date, daily_total,
lag(daily_total) OVER (ORDER BY ordered_date)
AS previous,
lead(daily_total) OVER (ORDER BY ordered_date)
AS next
FROM daily_sales
ORDER BY ordered_date;
You’ll see:
You’ll notice that the previous for the first row is NULL; so is the next
for the last row.
You might think that’s a bit pointless if you can just move your eyes to look
up or down a row. However, you can also incorporate the lag or lead in a
calculation. For example, suppose you want to compare sales for each day to a
week before. You could use
SELECT
ordered_date, daily_total,
lag(daily_total,7) OVER (ORDER BY ordered_date)
AS last_week,
daily_total
- lag(daily_total,7) OVER (ORDER BY
ordered_date)
AS difference
FROM daily_sales
ORDER BY ordered_date;
This results in
Here, the expression lag(total,7) gets the value for seven rows before.
As you’d expect, the first seven rows have NULL for the value.
There are two important conditions if you want to use lag or lead
meaningfully:
There must be only one row for each instance you want to test. For example,
you can’t have two rows with the same date.
There must be no gaps. For example, there can’t be a missing date.
That’s because we’re interpreting each row as one day. If you’re just working
with a sequence or sales regardless of the date, it won’t matter.
If you look carefully (and patiently) through the data, you will find that there
are a few missing dates. That means that the previous row isn’t always
“yesterday,” and the seven rows previous isn’t always “last week.” We’ll see
how to plug these gaps in Chapter 9.
Summary
Window functions are functions which give a row-by-row value based on a
“window” or a group of rows.
Window functions include
Aggregate functions
These include all of the major nonwindow aggregate functions, such as
count() and sum().
Ranking functions and grouping
These include row_number(), rank(), and dense_rank() to
generate a position, as well as ntile() to generate ordered groups.
Functions which fetch data from other rows
These include lag() and lead().
Window Clauses
A window function features an OVER() clause:
Coming Up
In Chapter 7, we’ve already discussed how Common Table Expressions work. In
fact, we’ve used them pretty extensively throughout the book.
In the next chapter, we’ll have another look at CTEs and examine some of
their more sophisticated features. In particular, we’ll have a look at the dreaded
recursive CTE.
Footnotes
1 You’ll see this sort of thing in all of the newer features in SQL. You might say that SQL is the new
COBOL.
COBOL was (and still is) an early programming language which was supposed to appeal to less
mathematical business programmers. It is noted for its verbosity.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_9
You have already made use of CTEs to prepare data for use in aggregates and
other operations.
Here, we will take a further look at some of the more powerful features of
CTEs.
CTEs As Variables
In Chapter 4, we tested some calculations with a test value:
WITH vars AS (
SELECT ' abcdefghijklmnop ' AS string
-- FROM dual -- Oracle
)
SELECT
string,
-- sample string functions
FROM vars;
Later in this chapter, we’ll see a more sophisticated version of this technique
when we look at table literals. For now, let’s look at how we can use this.
Some DBMSs as well as all programming languages have a concept of
variables. A variable is a temporary named value. Where the DBMS supports it,
you declare a variable name and assign a value which you use in a subsequent
step. For example, in MSSQL, you can write this:
-- MSSQL
DECLARE @taxrate decimal(4,2);
SET @taxrate = 12.5;
SELECT
id, title,
price, price/@taxrate/100 AS tax
FROM books;
To run this, you would need to highlight all of the statements and run in one
go.
This chapter won’t focus on these variables, but you’ll see more on using
variables in Chapter 10. Instead, we’ll have a look at using a common table
expression to do a similar job.
Strictly speaking, what we’re going to use is not variables but constants,
which means that we will set their value once only. However, we can get away
with using the looser term “variable,” as it’s more generic.
There are two main benefits to defining variables:
You can specify an arbitrary value once, but use it multiple times.
You move arbitrary values to a preparation section.
In the preceding CTE example, where we’re not working with real data, we
simply selected from the CTE itself. In more realist examples, we will cross join
the CTE with other tables.
WITH vars AS (
SELECT 0.1 AS taxrate
-- FROM dual -- Oracle
)
We can now combine the CTE with the books table, using a simple cross
join:
WITH vars AS (
SELECT 0.1 AS taxrate
-- FROM dual -- Oracle
)
SELECT * FROM books, vars;
A cross join combines every row from one table to every row from another.
Since the vars CTE only has one row, the cross join simply has the effect of
adding another column to the books table.
SQL has a more modern syntax for a cross join: books CROSS JOIN
vars. Here, we’ll use the older syntax because it’s simpler and more readable.
We can now calculate the price list with tax:
This gives us
Of course, we could just as readily have used 0.1 instead of the taxrate
and dispensed with the CTE and the cross join. However, the CTE has the
benefit of allowing us to set the tax rate once at the beginning, where it’s easy to
maintain and can be used multiple times later.
Deriving Constants
The values don’t need to be literal values. You can also derive the values from
another query. For example, to get the oldest and youngest customers, first set
the minimum and maximum dates in variables:
-- vars CTE
SELECT min(dob) AS oldest, max(dob) AS youngest
FROM customers
You can then cross join that with the customers table to get the matching
customers:
WITH vars AS (
SELECT min(dob) AS oldest, max(dob) AS youngest
FROM customers
)
SELECT *
FROM customers, vars
WHERE dob IN(oldest, youngest);
To get the shorter customers, you can set the average height in a variable:
This is the sort of thing you can’t do otherwise, because the average is an
aggregate.
Customerid last_order
550 2023-04-18 09:18:51.933845
272 2023-04-28 09:15:17.85286
70 2023-04-19 14:00:44.880376
190 2023-04-09 10:12:53.416293
539 2023-04-22 16:14:16.173923
314 2023-04-11 03:33:57.825786
~ 269 rows ~
Here, we have two important pieces of data: the customer id and the date and
time of the most recent order. Using this in a subquery, we can join the results
with the customers and sales tables to get more details:
Note that the CTE was used to join the two tables and act as a filter. We don’t
actually need its results in the output.
-- cte
SELECT familyname, givenname FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
Here, customers are grouped by both names, and the groups are filtered for
more than one instance.
Putting that in a CTE, we can join that to the customers table:
WITH names AS (
SELECT familyname, givenname FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
)
SELECT
c.id, c.givenname, c.familyname,
c.email, c.phone
-- etc
FROM customers AS c
JOIN names ON c.givenname=names.givenname
AND c.familyname=names.familyname
ORDER BY c.familyname, c.givenname;
We’ve joined the CTE and the customers table using two columns and
included their email addresses and phone numbers (if any) so that we can chase
them up.
For the most part, it’s a matter of taste whether you do it this way or add the
aliases inside the CTE. If you do include the names, they will override any
aliases in the CTE.
One reason you might prefer CTE parameter names is if you think it’s more
readable, as you have all the names in one place. Later, we’ll be writing more
complex CTEs which involve multiple CTEs and unions, and it will definitely be
easier to follow with parameter names, so you’ll be seeing more of that style
from here on.
SELECT columns
FROM (
SELECT columns FROM table
) AS sq;
A CTE can make this more manageable by putting this subquery at the
beginning:
WITH cte AS (
SELECT columns FROM table
)
SELECT columns
FROM cte;
SELECT columns
FROM (
SELECT columns FROM (
SELECT columns FROM table
) AS sq1
) AS sq2;
That’s called nesting subqueries, and it can become a nightmare if things get
too complex.
Thankfully, CTEs work much more simply:
WITH
sq1 AS (SELECT columns FROM table),
sq2 AS (SELECT columns FROM sql1)
SELECT columns FROM sq2;
You can have multiple CTEs chained this way, as long as you remember to
separate them with a comma. As you see in this example, each subquery can
refer to a previous one in the chain.
We’ll build this up a little more later, and we’ll see that additional CTEs
don’t necessarily have to refer to the previous ones.
WITH names AS (
SELECT familyname, givenname FROM customers
GROUP BY familyname, givenname HAVING count(*)>1
)
SELECT
c.id, c.givenname, c.familyname,
c.email, c.phone
FROM customers AS c
JOIN names ON c.givenname=names.givenname
AND c.familyname=names.familyname
ORDER BY c.familyname, c.givenname;
WITH
names AS (
SELECT familyname, givenname FROM customers
GROUP BY familyname, givenname HAVING
count(*)>1
),
duplicates(givenname, familyname, info) AS (
SELECT
c.givenname, c.familyname,
cast(c.id AS varchar(5)) || ': ' ||
c.email
-- MSSQL: Use +
FROM customers AS c -- Oracle: No AS
JOIN names ON c.givenname=names.givenname
AND c.familyname=names.familyname
)
SELECT * from duplicates
ORDER by familyname, giv1enname;
Note .
The duplicates CTE has the parameter names for simplicity. There’s
no need to do that with the names CTE, as there are no calculated values;
however, you may want to do that for consistency.
Instead of listing the id separately, we’ve cast it to a string and
concatenated it to the email address. This is to get ready for what follows.
For simplicity, we’ve ignored the phone number, since it may be missing.
The next step is to consolidate them by combining the info column values:
WITH
names AS ( ),
duplicates(givenname, familyname, info) AS ( )
SELECT
givenname, familyname, count(*),
-- PostgreSQL, MSSQL
string_agg(info,', ') AS info
-- MySQL/MariaDB
-- group_concat(info SEPARATOR ', ') AS info
-- SQLite
-- group_concat(info,', ') AS info
-- Oracle
-- listagg(info,', ') AS info
FROM duplicates
GROUP BY familyname, givenname
ORDER by familyname, givenname;
Recursive CTEs
As you’ve seen, a feature of using CTEs is that one CTE can refer to a previous
CTE. Another feature is that a CTE can refer to itself.
Anything which refers to itself is said to be recursive. If you’re a
programmer, recursive functions are functions which call themselves and are
very risky if not handled properly. Similarly, a recursive CTE can be very risky
if you’re not careful.
A recursive CTE takes one of two forms, depending on your DBMS:
N
1
2
3
...
8
9
10
Generating a Sequence
We’ve already seen how to generate a sequence of numbers:
WITH cte AS (
-- Anchor
SELECT 0 AS n
UNION ALL
-- Recursive
SELECT n+1 FROM cte WHERE n<100
)
SELECT * FROM cte;
The thing to remember is that the recursive member has a WHERE clause to
limit the sequence. Without that, the recursive query would try to run forever,
and as you know, nothing lasts forever.
MSSQL has a built-in safety limit of 100 recursions, which we’ll have to
circumvent later:
-- MSSQL
WITH cte (
)
SELECT ... FROM cte OPTION(MAXRECURSION ...);
The others don’t, but for PostgreSQL, MariaDB, and MySQL, you can
readily set a time limit:
-- PostgreSQL
SET statement_timeout TO '5s';
-- MariaDB
SET MAX_STATEMENT_TIME=1; -- seconds
-- MySQL
SET MAX_EXECUTION_TIME=1000; -- milliseconds
If you’re sure about your recursion terminating properly, you don’t need to
worry about this. In MSSQL, you will, however, need to increase or disable the
recursion limit for some queries.
However, it won’t hurt to include a simple number sequence in what follows
just to be safe.
One case where a sequence can be useful is to get a sequence of dates. This
will simply define a start date and add one day in the recursive member.
The CTE starts simply enough:
Note that the first value, d, has been cast to a date, with the exception of
SQLite, which doesn’t have a date type. The n set to 1 is added as a sequence
number, but is really unnecessary. It’s added here to illustrate how you can use it
to stop overrunning your CTE.
The recursive part is also easy enough, but adding one day varies between
DBMSs:
-- PostgreSQL
WITH RECURSIVE dates(d, n) AS (
SELECT date'2023-01-01', 1
UNION
SELECT d+1, n+1 FROM dates
WHERE d<'2023-05-01' AND n<10000
)
SELECT * FROM dates;
-- MariaDB / MySQL
WITH RECURSIVE dates(d, n) AS (
SELECT date'2023-01-01', 1
UNION
SELECT date_add(d, interval 1 day), n+1 FROM
dates
WHERE d<'2023-05-01' AND n<10000
)
SELECT * FROM dates;
-- MSSQL
WITH dates(d, n) AS (
SELECT cast('2023-01-01' as date), 1
UNION ALL
SELECT dateadd(day,1,d), n+1 FROM dates
WHERE d<'2023-05-01' AND n<10000
)
SELECT * FROM dates;
-- SQLite
WITH RECURSIVE dates(d, n) AS (
SELECT '2023-01-01', 1
UNION
SELECT strftime('%Y-%m-%d',d,'+1 day'), n+1
FROM dates
WHERE d<'2023-05-01' AND n<10000
)
SELECT * FROM dates;
-- Oracle
WITH dates(d, n) AS (
SELECT date '2023-01-01', 1 FROM dual
UNION ALL
SELECT d+1, n+1 FROM dates
WHERE d<date'2023-05-01' AND n<10000
)
SELECT * FROM dates;
D n
2023-01-01 1
2023-01-02 2
2023-01-03 3
2023-01-04 4
2023-01-05 5
2023-01-06 6
~ 121 rows ~
-- MSSQL, Oracle
WITH
allyears(year) AS (
SELECT 1940
UNION ALL
SELECT year+1 FROM allyears WHERE year<2010
)
Next, get the customer (id) and the year of birth of the customers:
You’ll need the LEFT JOIN to include all of the sequence of years even if it
doesn’t match a customer year; after all, that’s why it’s there.
year nums
1940 1
1941 1
1942 1
1943 1
1944 1
1945 1
~ 71 rows ~
SELECT *
FROM daily_sales
ORDER BY ordered_date;
However, if you look hard enough, you’ll find some dates missing. We’re
about to fill them in.
For this, we’ll need the following:
The daily_sales view
A CTE with the first and last dates of the daily sales
A sequence of dates
You already know how to generate a sequence of dates. This time, instead of
starting and stopping on arbitrary dates, we’ll start and stop on the first and last
dates of the daily_sales view. We can put those values in a CTE for
reference:
WITH
vars(first_date, last_date) AS (
SELECT min(ordered_date), max(ordered_date)
FROM daily_sales
)
-- PostgreSQL
WITH RECURSIVE
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION
SELECT d+1 FROM vars, dates WHERE
d<last_date
)
-- MariaDB / MySQL
WITH RECURSIVE
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION
SELECT date_add(d, interval 1 day)
FROM vars, dates WHERE d<last_date
)
-- MSSQL
WITH
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION ALL
SELECT dateadd(day,1,d)
FROM vars, dates WHERE d<last_date
)
-- SQLite
WITH RECURSIVE
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION
SELECT strftime('%Y-%m-%d',d,'+1 day')
FROM vars, dates WHERE d<last_date
)
-- Oracle
WITH
vars(first_date, last_date) AS ( ),
dates(d) AS (
SELECT first_date FROM vars
UNION ALL
SELECT d+1 FROM vars, dates WHERE
d<last_date
)
For those DBMSs which use the keyword RECURSIVE, you use it once at
the beginning, even if some of the CTEs aren’t recursive.
Notice that we’ve cross-joined the vars and dates, which is the usual
technique of applying variables to another table. We could have written CROSS
JOIN, but it’s not worth the effort.
We can now complete our query using a LEFT JOIN to get all of the
sequence of dates:
ordered_date daily_total
2022-04-08 97.5
2022-04-09 96
2022-04-10 191
2022-04-11 201.5
2022-04-12 91
2022-04-13 160
~ 387 rows ~
Traversing a Hierarchy
Another use case for a recursive CTE is to traverse a hierarchy. The hierarchy
we’re going to look at is in the employees table:
Of course, in a real employees table, there would be more details; we’ve only
included enough here to make the point.
In particular, you’ll see that in the employees table, there is a
supervisorid column which is a foreign key to the same table:
employees.supervisorid ➤ employees.id
SELECT
e.id AS eid,
e.givenname, e.familyname,
s.id AS sid,
s.givenname||' '||s.familyname AS supervisor
-- s.givenname+' '+s.familyname AS supervisor -
- MSSQL
FROM employees AS e LEFT JOIN employees AS s
ON e.supervisorid=s.id -- Oracle:
No AS
ORDER BY e.id;
The trick is, when joining a table to itself, you need to give the table two
different aliases to qualify the join.
The join is similar to the self-join earlier. The current employee is referred to
in the e table alias, and this aliased table is joined to the CTE, which will be the
supervisor. The raw data will be from the aliased table, while the supervisor’s
details will be concatenated as the new supervisors parameter.
Normally, you’d want to limit the recursion with a WHERE clause. For this
one, the join will do the job, as it will stop when there are no more to be joined.
The magic is in the expression for the supervisors string. In the
recursive member, the CTE represents inherited values.
SELECT
..., cast('' AS char(255)), 1
FROM employees WHERE supervisorid IS NULL
SELECT
..., cast('' AS nvarchar(255)), ...
FROM employees WHERE supervisorid IS NULL
UNION ALL
SELECT
...,
cast(cte.givenname+' '+cte.familyname
+' < '+cte.supervisors as
nvarchar(255)), ...
FROM cte JOIN employees AS e ON
cte.id=e.supervisorid
-- Others
cte.givenname||' '||cte.familyname
|| CASE WHEN n>1 THEN ' < ' ELSE '' END
|| cte.supervisors
-- MSSQL
cte.givenname+' '+cte.familyname
+ CASE WHEN n>1 THEN ' < ' ELSE '' END
+ cte.supervisors
inserts from a virtual table, generated by the VALUES clause. That also
means that, in principle, you should be able to use VALUES ... as a virtual
table without actually inserting anything. Unfortunately, it’s not quite so
straightforward.
A table literal is an expression which results in a collection of rows and
columns—a virtual table. If things go according to plan, it could look like this:
Not all DBMSs see it that way. Some DBMSs do allow just such an
expression, but others have something a little more complicated.
A little later, we’ll want to work with a virtual table to experiment with, so
the first step will be to put this into a CTE. Using the standard notation, you can
use
id value
a apple
b banana
c cherry
Note that we’ve included the column names in the CTE name.
For the other DBMSs, there are various alternatives:
-- MSSQL
WITH cte(id,value) AS (
SELECT * FROM
(VALUES ('a','apple'), ('b','banana'),
('c','cherry')) AS sq(a,b)
)
SELECT * FROM cte;
-- MySQL (not MariaDB)
WITH cte(id,value) AS (
VALUES ROW('a','apple'), ROW('b','banana'),
ROW('c','cherry')
)
SELECT * FROM cte;
-- Oracle
WITH cte(id,value) AS (
SELECT 'a','apple' FROM dual
UNION ALL SELECT 'b','banana' FROM dual
UNION ALL SELECT 'c','cherry' FROM dual
)
SELECT * FROM cte;
As you see, the prize for the most awkward version goes to Oracle, which
doesn’t yet support a proper table literal. Apparently, that’s coming soon.
MSSQL does support a table literal, but, for some unknown reason, it has to
be inside a subquery, complete with a dummy subquery name and dummy
column names.
MySQL also supports a table literal, but requires each row inside a ROW()
constructor, because MySQL has a non-standard values() function which
conflicts with using it simply as a table literal. This is one of the cases where
MariaDB and MySQL are not the same.
The actual code is commented out, because the DBMSs all have their own
ways. It gets further complicated because of the date literals.
We’re going to try this with the following series of dates:
dob today
1940-07-07 2023-01-01
1943-02-25 2023-01-01
1942-06-18 2023-01-01
1940-10-09 2023-01-01
1940-07-07 2022-12-31
1943-02-25 2022-12-31
1942-06-18 2022-12-31
1940-10-09 2022-12-31
1940-07-07 2023-07-07
1943-02-25 2023-02-25
1942-06-18 2023-06-18
1940-10-09 2023-10-09
-- Oracle
WITH dates(dob, today) AS (
SELECT date'1940-07-07',date'2023-01-01' FROM
dual
UNION ALL SELECT date'1943-02-25',date'2023-
01-01'
FROM dual
UNION ALL SELECT date'1942-06-18',date'2023-
01-01'
FROM dual
-- etc
)
Note .
For PostgreSQL, MariaDB/MySQL, and Oracle, you can use the simple
expression date'...' to interpret a literal as a date.
You now have a virtual table with a collection of test dates. You can now try
out your age calculation:
-- PostgreSQL
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
extract(year from age(today,dob)) AS age
FROM dates;
-- MariaDB/MySQL
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
timestampdiff(year,dob,current_timestamp) AS
age
FROM dates;
-- MSSQL
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
datediff(year,dob,today) AS age
FROM dates;
-- SQLite
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
cast(
strftime('%Y.%m%d', today) -
strftime('%Y.%m%d', dob)
as int) AS age
FROM dates;
-- Oracle
WITH dates(dob, today) AS (
-- etc
)
SELECT
dob, today,
trunc(months_between(today,dob)/12) AS age
FROM dates;
We’ve already noted in Chapter 4 how MSSQL gets the age wrong, and this
is one way you can test this.
The problem is that we’ve had to get the weekday number in order to sort
this correctly. It would have been nicer to use the weekday name instead. We can
then use an additional virtual table to sort the names.
First, let’s redo the data CTE with the day name:
-- PostgreSQL, Oracle
WITH data AS (
SELECT to_char(ordered,'FMDay') AS weekday,
total
FROM sales
)
-- MSSQL
WITH data AS (
SELECT datename(weekday,ordered) AS weekday,
total
FROM sales
)
-- MariaDB/MySQL
WITH data AS (
SELECT date_format(ordered,'%W'), total
FROM sales
)
You’ll notice that SQLite isn’t included in the list. That’s because it doesn’t
have a method of getting the weekday name. If you need it, you’ll want the
reverse technique in the next section.
The summary CTE will now group by the weekday name:
WITH
data AS (
SELECT
... AS weekday,
total
FROM sales
),
summary AS (
SELECT weekday, sum(total) AS total
FROM data
GROUP BY weekday
)
-- etc
We’ll now need a table literal with the days of the week as well as a
sequence number.
sequence weekday
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
Finally, to do the sorting, you can join the summary CTE with the
weekdays CTE and sort by the sequence number:
WITH
data AS ( ),
summary AS ( ),
weekdays(dob, today) AS ( )
SELECT
summary.weekday, summary.total,
100*total/sum(summary.total) OVER()
FROM summary JOIN weekdays
ON summary.weekday=weekdays.weekday
ORDER BY weekdays.sequence;
One advantage of this technique is that you can change the sequence
numbering in the table literal, for example, to start on Wednesday if that suits
you better.
By the way, if you’re going to sort by weekday, or anything like it, very
often, you might be better off saving the data in a permanent lookup table.
-- Oracle
WITH statuses(status,name) AS (
SELECT 1,'Gold' FROM DUAL
UNION ALL SELECT 2,'Silver' FROM DUAL
UNION ALL SELECT 3,'Bronze' FROM DUAL
)
Again, the benefit is that you can change the status names on the fly.
You can also do the same sort of thing with author and customer genders.
Another thing you can do with this technique is to translate from one set of
names to another set of names.
You may be wondering why we don’t include the full name of the gender
or the vip status in the table itself. Remember that you should only record a
piece of data once, and it should be the simplest version possible. Storing a
value as a single character, as with the gender, or an integer, as with the vip
status, reduces the possibility of data error or variation, and you can spell it
out later when you want.
Splitting a String
If you have the courage to look in the script which generated the database, you’ll
find two recursive CTEs near the end:
-- Populate Genres
INSERT INTO genres(genre)
WITH split(bookid,genre,rest,genres) AS (
...
)
SELECT DISTINCT genre
FROM split
WHERE split.genre IS NOT NULL;
-- Populate Book Genres
INSERT INTO bookgenres(bookid,genreid)
WITH split(bookid,genre,rest,genres) AS (
...
)
SELECT split.bookid,genres.id
FROM split JOIN genres ON split.genre=genres.genre
WHERE split.genre IS NOT NULL;
Some DBMSs don’t like string literals with a line break inside. For those
that will accept the line break, it will be part of the data, and we won’t want
that.
Be sure to write the string on one line, even if it’s very long.
For the recursive CTE, we’ll build two values: the individual item and a
string containing the rest of the original string. The CTE can be called split:
WITH
cte(fruit) AS (),
split(fruit, rest) AS (
The anchor member will get the first item from the string, up to the comma,
and the rest, after the comma:
WITH
cte(fruit) AS (),
-- PostgreSQL
split(fruit, rest) AS (
SELECT
substring(fruit,0,position(',' in fruits)),
substring(fruit,position(',' in fruits)+1)||','
FROM cte
)
-- MariaDB, MySQL
split(fruit, rest) AS (
SELECT
substring(fruit,1,position(',' in fruits)-1),
substring(fruit,position(',' in fruits)+1)||','
FROM cte
)
-- MSSQL
split(fruit, rest) AS (
SELECT
cast(substring(fruit,0,charindex(',',fruits)) as
varchar(255)),
cast(substring(fruit,charindex(',',fruits)+1,255)+
as varchar(255))
FROM cte
)
-- SQLite
split(fruit, rest) AS (
SELECT
substring(fruit,0,instr(fruits,',')),
substring(fruit,instr(fruits,',')+1)||','
FROM cte
)
-- Oracle
split(fruit, rest) AS (
SELECT
substr(fruit,1,instr(fruits,',')-1),
substr(fruit,instr(fruits,',')+1)||','
FROM cte
)
Note that for MSSQL we’ve had to cast the calculation to varchar(255)
because of a peculiarity with string compatibility.
For the recursive member, we use the rest value. First, we get the string up
to the first comma, which becomes the fruit value. Then, we get the rest of
the string from the comma, which becomes the new value for rest:
WITH
cte(fruit) AS (),
-- PostgreSQL
split(fruit, rest) AS (
SELECT ...
UNION
SELECT
substring(rest,0,position(',' in rest)),
substring(rest,position(',' in rest)+1)
FROM cte WHERE rest<>''
)
-- MariaDB, MySQL
split(fruit, rest) AS (
SELECT ...
UNION
SELECT
substring(rest,1,position(',' in rest)-1),
substring(rest,position(',' in rest)+1)
FROM cte WHERE rest<>''
)
-- MSSQL
split(fruit, rest) AS (
SELECT ...
UNION ALL
SELECT
substring(rest,0,charindex(',', rest)),
substring(rest,charindex(',', rest)+1,255)
FROM cte WHERE rest<>''
)
-- SQLite
split(fruit, rest) AS (
SELECT ...
UNION
SELECT
substring(rest,0,instr(rest,',')),
substring(rest,instr(rest,',')+1)
FROM cte WHERE rest<>''
)
-- Oracle
split(fruit, rest) AS (
SELECT ...
UNION ALL
SELECT
substr(rest,1,instr(rest,',')-1),
substr(rest,instr(rest,',')+1)
FROM cte WHERE rest<>''
)
Note that we don’t add a comma to the rest value this time: that was just to
get started.
We have also added WHERE rest<>'' to the FROM clause. This is
because we need to stop recursing when there’s no more of the string to search.
You can now try it out:
WITH
cte(fruit) AS (),
split(fruit,rest) AS ()
SELECT * FROM split;
fruit rest
Apple Banana,Cherry,Date,Elderberry,Fig,
Banana Cherry,Date,Elderberry,Fig,
Cherry Date,Elderberry,Fig,
Date Elderberry,Fig,
Elderberry Fig,
Fig [NULL]
Of course, we don’t need to see the rest value in the output: it’s just there
so you can see its progress.
name list
colours Red,Orange,Yellow,Green,Blue,Indigo,Violet
elements Hydrogen,Helium,Lithium,Beryllium,Boron,Carbon
numbers One,Two,Three,Four,Five,Six,Seven,Eight,Nine
WITH
cte(name, items) AS (),
-- PostgreSQL
split(name, item, rest) AS (
SELECT
name,
substring(items,0,position(',' in items)),
substring(items,position(',' in
items)+1)||','
FROM cte
)
-- MariaDB, MySQL
split(name, list, rest) AS (
SELECT
name,
substring(items,1,position(',' in
items)-1),
substring(items,position(',' in
items)+1)||','
FROM cte
)
-- MSSQL
split(name, list, rest) AS (
SELECT
name,
cast(substring(items,0,charindex(',',
items)) as varchar(255)),
substring(items,charindex(',',
items)+1,255)+','
FROM cte
)
-- SQLite
split(name, list, rest) AS (
SELECT
name,
substring(items,0,instr(items,',')),
substring(items,instr(items,',')+1)||','
FROM cte
)
-- Oracle
split(name, list, rest) AS (
SELECT
name,
substr(items,1,instr(items,',')-1),
substr(items,instr(items,',')+1)||','
FROM cte
)
As for the recursive member, again it’s the same idea, with the name value
included:
WITH
cte(name, items) AS (),
-- PostgreSQL
split(name, list, rest) AS (
SELECT ...
UNION
SELECT
name,
substring(rest,0,position(',' in rest)),
substring(rest,position(',' in rest)+1)
FROM cte WHERE rest<>''
)
-- MariaDB, MySQL
split(name, list, rest) AS (
SELECT ...
UNION
SELECT
name,
substring(rest,1,position(',' in rest)-1),
substring(rest,position(',' in rest)+1)
FROM cte WHERE rest<>''
)
-- MSSQL
split(name, list, rest) AS (
SELECT ...
UNION ALL
SELECT
name,
cast(substring(rest,0,charindex(',',
rest)) as varchar(255)),
substring(rest,charindex(',', rest)+1,255)
FROM cte WHERE rest<>''
)
-- SQLite
split(name, list, rest) AS (
SELECT ...
UNION
SELECT
name,
substring(rest,0,instr(rest,',')),
substring(rest,instr(rest,',')+1)
FROM cte WHERE rest<>''
)
-- Oracle
split(name, list, rest) AS (
SELECT ...
UNION ALL
SELECT
name,
substr(rest,1,instr(rest,',')-1),
substr(rest,instr(rest,',')+1)
FROM cte WHERE rest<>''
)
WITH
cte(name, items) AS ()
split(name, item, rest) AS ()
SELECT *
FROM split
ORDER BY name, item;
When it’s all going, you should see something like the following:
As you see the recursive CTE was able to work with multiple rows of data.
Summary
In this chapter, we had a closer look at using Common Table Expressions. A
common table expression generates a virtual table that you can use later in the
main query. In the past, you would make do with a subquery in the FROM clause.
The reason why you would use a CTE or a FROM subquery is that you might
need to prepare data but you don’t want to go to the trouble of saving it either in
a view or a temporary table. CTEs are more ephemeral than temporary tables in
that they are not saved at all.
CTEs have a number of advantages over FROM subqueries:
You define the CTE before using it, making the query more readable and more
manageable.
You can chain multiple dependent or independent CTEs simply. If you wanted
to do that with FROM subqueries, you would have to nest them, which gets
unwieldy very quickly.
CTEs can be recursive, so you can use them to iterate through data.
Simple CTEs
The simplest use of a CTE is to prepare data for further processing. Some uses
include
Defining a set of constant values, either as literals or as calculated values
Preparing aggregate data, to be combined with non-aggregate queries
Parameter Names
A CTE is expected to have a name or alias for each column. You can define the
names inside the CTE, or you can define them as part of the CTE definition.
Multiple CTEs
Some queries involve multiple steps. These steps can be implemented by
chaining multiple CTEs.
Recursive CTEs
A recursive CTE is one which references itself. It can be used for iterating
through a set of data.
Some uses of recursive CTEs include
Generating a sequence of values
Traversing a hierarchy through a self-join
Splitting strings into smaller parts
Coming Up
So far, we’ve worked on a number of important major concepts. In the next
chapter, we’ll have a look at a few additional techniques you can use to work
smarter with your database:
Triggers allow you to automate a process whenever some of the data changes.
Pivot tables are basically a two-dimensional aggregate query.
Variables allow you to hold interim values when there’s too much going on.
Footnotes
1 Some SQLs, but not all, include additional structures such as DO ... WHILE in an SQL script. They’re
not really a standard part of the SQL language, but can be used in situations where you’re desperate to do
something iteratively.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
M. Simon, Leveling Up with SQL
https://doi.org/10.1007/978-1-4842-9685-1_10
Throughout the book, we’ve looked at pushing our knowledge and application of
SQL a little further and explored a number of techniques, some new and some
not so new.
When looking at some techniques, in particular, those involving aggregates
and common table expressions, we also got a sense of pushing SQL deeper, with
multitiered statements.
In this chapter, we’ll go a little beyond simple SQL and explore a few
techniques which supplement SQL. They’re not directly related to each other,
but they all allow you to do more in working with your data.
SQL triggers are small blocks of code which run automatically in some
response to some database event. We’ll look at how these work and how you
would write one. In particular, we’ll look at a trigger to automatically archive
data which has been deleted.
Pivot tables are aggregates in two dimensions. They allow you to build
summaries in both row and column data. We’ll look at an example of preparing
data to be summarized and how we produce a pivot table.
Variables are pieces of temporary data which can be used to maintain values
between statements. They allow us to run a group of SQL statements, while they
hold interim values which are passed from one statement to another. In this
chapter, we’ll look at using variables to hold temporary values while we add data
to multiple tables.
Understanding Triggers
Sometimes, a simple SQL query isn’t quite enough. Sometimes, what you really
want is for a query to start off one or more additional queries. Sometimes, what
you want is a trigger.
A trigger is a small block of code which will be run automatically when
something happens to the database. There are various types of triggers, including
DML (Data Manipulation Language) triggers run when some change is made
to the data tables, as when the INSERT, UPDATE, or DELETE statements are
executed.
DDL (Data Definition Language) triggers run when changes are made to the
structure of the database, such as when CREATE, ALTER, or DROP statements
are executed.
Logon triggers run when a user has logged in.
One reason you might use DDL or Logon triggers is if you want to track
activity by storing this in a logging table.
Here, we’re going to look more at a DML trigger.
Triggers can be used to fill in some shortcomings of standard DBMS
behavior. Here are some examples which might call for a trigger:
You might have an activity table which wants a date column updated every
time you make a change. You can use a trigger to set the column for every
insert or update.
Suppose you have a rental table, where you enter a start and a finish date.
You’d like the finish date to default to the start date if it isn’t entered. SQL
defaults aren’t quite so clever, but you can set a trigger to set the finish date
when you insert a new row.
SQL has no auditing in the normal sense of the word. You can create a trigger
to add some data to a logging table every time a row is added, updated, or
deleted.
In this example, we’re going to create a trigger to keep a copy of data which
we’re going to delete from the sales table.
In some of the preceding chapters, we’ve had to contend with the fact that in
the sales table, some rows have NULLs for the ordered date/time. Presumably,
those sales never checked out.
We’ve been pretty forgiving so far and filtered them out from time to time,
but the time has come to deal with them. We can delete all of the NULL sales as
follows:
-- Not Yet!
DELETE FROM sales WHERE ordered IS NULL;
Note that there’s a foreign key from the saleitems table to the sales
table, which would normally disallow deleting the sales if there are any items
attacked. However, if you check the script which generates the sample database,
you’ll notice the ON DELETE CASCADE clause, which will automatically
delete the orphaned sale items.
When should you delete data? The short answer is never. The longer
answer is more complicated. You would delete data that was entered in error,
or you would delete test data when you’ve finished testing.
In this case, we’re going to delete the sales with a NULL for the
ordered date; we’ll assume that the sale was never checked out and that the
customer won’t ever come back and finish it. However, we’ll keep a copy of
it anyway, just in case.
Most DBMSs handle triggers in a very similar way, but there are variations.
We’ll go over the basics first and then the details for individual DBMSs.
None of the DBMSs do it exactly the same way, but it’s roughly right:
The trigger, of course, has a name: CREATE TRIGGER something.
The trigger is attached to a table: ON some_table.
The trigger is attached to an event.
The event is typically one of BEFORE, AFTER, or INSTEAD OF, followed
by one of the DML statements. In this example, we want to do something with
the old data before it’s deleted.
For the sample trigger, we’re going to copy the old data into a table called
deleted_sales. This means that we’re going to have to get to the data
before it’s vanished. The appropriate event is
BEFORE DELETE
It’s going to be a little complicated, because we want to copy not only the
data from the sales table but also from the saleitems table. We’ll do that
by concatenating those items into one string. You really shouldn’t keep multiple
items that way, but it’s good enough for an archive, and you can always pull it
apart if you ever need to.
The archive table looks something like this:
-- PostgreSQL, MSSQL
WITH cte AS (
...
)
INSERT INTO deleted_sales(saleid, customerid,
items,
deleted_date)
SELECT saleid,customerid, items, current_timestamp
FROM cte;
As you see, with some DBMSs you start with the CTE, as you would using a
SELECT statement, while in others you start with the INSERT clause.
As for the CTE itself, we’ll derive that from the data to be deleted.
For most DBMSs, each row to be deleted is represented in a virtual row
called old (:old in Oracle). MSSQL instead has a virtual table called
deleted.
If we were simply archiving from one table, we wouldn’t need the CTE, and
we could simply copy the rows with
However, it’s not so simple when there’s another table involved. Here, the
plan is to read the book ids and quantities from the other table and combine them
using string_agg, group_concat, or listagg according to DBMS.
To generate the data, we’ll use a join and aggregate the results:
WITH cte(saleid,customerid,items) AS (
SELECT
s.id, s.customerid,
string_agg(si.bookid||':'||si.quantity,';')
FROM sales AS s JOIN saleitems AS si ON
s.id=si.saleid
WHERE s.id=old.id
GROUP BY s.id, s.customerid
)
The preceding sample is for PostgreSQL, but the others are nearly identical
—just the variations in the string_agg() function, concatenation, and table
aliases.
The items string will contain something like the following:
123:3;456:1;789:2
-- Before
SELECT * FROM sales order by id;
SELECT * FROM saleitems order by id;
SELECT * FROM deleted_sales order by id;
-- Delete with Trigger
DELETE FROM sales WHERE ordered IS NULL;
-- After
SELECT * FROM sales order by id;
SELECT * FROM saleitems order by id;
SELECT * FROM deleted_sales order by id;
PostgreSQL Triggers
PostgreSQL has the least convenient form of trigger, in that you first need to
prepare a function to contain the trigger code. A function is a named block of
code, which can be called later at any time.
To prepare for the function and trigger, we can start with a few DROP
statements:
As you see, the function has the code for the CTE and for copying the data
into the deleted_sales table. Here are a few points about the function itself:
A function has a name (do_archive_sales) and returns a result of a
certain type, in this case a TRIGGER.
PostgreSQL has a number of alternative coding languages you can use to write
a function, but the standard one is called plpgsql.
Technically, a function definition is a string. However, using single quotes
would interfere with single quotes inside the function definition. PostgreSQL
allows an alternative string delimiter, in this case the $$ code. This is the most
mysterious part of writing PostgreSQL functions.
Once you have the function in place, creating the trigger is simple:
CREATE TRIGGER archive_sales_trigger
BEFORE DELETE ON sales
FOR EACH ROW
EXECUTE FUNCTION do_archive_sales();
MySQL/MariaDB Triggers
With MariaDB/MySQL, the trigger can be written in a single block. First, we’ll
write the code to drop the trigger:
DELIMITER $$
DELIMITER ;
Here, the delimiter is changed to $$. It doesn’t have to be that, but it’s a
combination you’re unlikely to use for anything else. The new delimiter is used
to mark the end of the code and switched back to the semicolon after that.
After that, the trigger code is much as described:
DELIMITER $$
CREATE TRIGGER archive_sales_trigger
BEFORE DELETE ON sales
FOR EACH ROW
BEGIN
INSERT INTO
deleted_sales(saleid,customerid,items,deleted_date)
WITH cte(saleid,customerid,items) AS (
SELECT
s.id, s.customerid,
group_concat(si.bookid||':'||si.quantity
SEPARATOR ';')
FROM sales AS s JOIN saleitems AS si ON
s.id=si.saleid
WHERE s.id=old.id
GROUP BY s.id, s.customerid
)
SELECT saleid,customerid,items,current_timestamp
FROM cte;
END; $$
DELIMITER ;
MSSQL Triggers
MSSQL also has a simple, direct way of creating a trigger. However, there’s a
complicating factor, which we’ll need to work around.
Before that, however, we’ll add the code to drop the trigger:
With other DBMSs, you create a BEFORE DELETE trigger to capture the
data before it’s gone. With MSSQL, you don’t have that option: there’s only
AFTER DELETE and INSTEAD OF DELETE. In both cases, there is a virtual
table called deleted which has the rows to be deleted.
The problem with AFTER DELETE is that, even though the deleted
virtual table has the deleted rows from the sales table, it’s too late to get the
rows from the saleitems table, as they have also been deleted, but there’s no
virtual table for that.
For that, we’ll take a different approach. We’ll use an INSTEAD OF
DELETE event, which is to say that MSSQL will run the trigger instead of
actually deleting the data. The trick is to finish off the trigger by doing the delete
at the end:
The deleted virtual table still has the rows which haven’t actually been
deleted, but were going to be before the trigger stepped in. All we need from that
is the id to identify the sales which should be deleted at the end, together with
the cascaded sale items.
The other complication is that MSSQL won’t let you concatenate strings
with numbers, so you’ll have to cast the numbers as strings:
cast(si.bookid AS varchar)+':'+cast(si.quantity AS
varchar)
SQLite Triggers
Of all the DBMSs in this book, SQLite has by far the simplest and most direct
version of coding a trigger.
First, we can write the code to drop the trigger:
The code to create the trigger is almost identical to the discussion earlier:
Oracle Triggers
Writing trigger code in Oracle is similar to the basic code outlined earlier, but
there are a few complicating factors which we’ll need to work around.
Before that, we can write the code to drop the trigger:
/
CREATE TRIGGER archive_sales_trigger
BEFORE DELETE ON sales
FOR EACH ROW
BEGIN
...
END;
/
The forward slash (/) before and after the code defines the block. Everything
between the slashes, including the statements terminated with a semicolon, will
be treated as one block of code.
The second complication is that Oracle doesn’t like making changes to the
table doing the triggering. The solution is to tell Oracle that code is part of a
separate transaction:
/
CREATE TRIGGER archive_sales_trigger
BEFORE DELETE ON sales
FOR EACH ROW
DECLARE
PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
INSERT INTO deleted_sales(saleid, customerid,
items, deleted_date)
WITH cte(saleid,customerid,items) AS (
SELECT
s.id, s.customerid,
listagg(si.bookid||':'||si.quantity,';')
FROM sales s JOIN saleitems si ON
s.id=si.saleid
WHERE s.id=:old.id
GROUP BY s.id, s.customerid
)
SELECT saleid, customerid, items,
current_timestamp
FROM cte;
COMMIT;
END;
/
Pivoting Data
One of the important principles of good database design is that each column
does a different job. On top of that, each column is independent of the other
columns. That’s one reason why we put so much effort separating out the town
details from the customers table in Chapter 2.
There are some situations, however, where this sort of design doesn’t suit
analysis. Take, for example, a typical ledger type of table:
date description food travel accommodation misc
... ... ... ... ... ...
... ... ... ... ... ...
This is a layout that’s very easy to understand and analyze. If you want to get
the totals for a particular category, just add down the column. If you want to get
the totals for a particular item, just add across. This sort of thing used to be done
by hand until spreadsheets were invented to let the computer do all the hard
work.
You may see this sort of design in database tables you come across.
However, it’s not a good design for SQL tables:
Putting a value in one column precludes putting it in another: the columns are
deeply dependent.
You will end up with very many empty spaces.
A new category means adding a new column to the table design. You may end
up with a huge number of columns.
The data is harder to analyze, because now you need calculate across columns:
SQL aggregate functions are designed to aggregate across rows.
A better design would be
We’re going to see how to pivot data from the sales and customers
tables to get total sales by state and VIP categories. The result will look
something like this:
In principle, you could transpose the table and have the VIP groups go down,
with the states going across. This version, however, will look neater.
Manually Pivoting Data
As we’ve already seen before, you often need to prepare the data before you
aggregate it. This particular summary will need data from four tables:
customers, towns, vip, and sales. Fortunately, the customerdetails
view already combines the customers and towns tables, so we can reduce
the number to three.
All the preparation will be done in multiple CTEs:
WITH
statuses AS (
...
),
customerinfo AS (
...
),
salesdata AS (
)
...
The vip table has a status number. The statuses CTE will be a table
literal which allocates a name to the number.
The customerinfo CTE will join the tables together and select the
columns we want to summarize.
The salesdata will be an aggregate query which will be a first step in our
pivot table summary.
With those CTEs, we’ll run another aggregate query which will result in our
pivot table.
The status CTE is simple. We just need to match status numbers with
names:
WITH
statuses(status, statusname) As (
-- PostgreSQL, SQLite, MariaDB (Not MySQL):
VALUES (1,'Gold'), (2,'Silver'), (3,'Bronze')
-- MySQL:
VALUES row(1,'Gold'), row(2,'Silver'),
row(3,'Bronze')
-- MSSQL:
SELECT * FROM (VALUES (1,'Gold'),(2,'Silver'),
(3,'Bronze'))
-- Oracle:
SELECT 1,'Gold' FROM dual
UNION ALL SELECT 2,'Silver' FROM dual
UNION ALL SELECT 3,'Bronze' FROM dual
)
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
SELECT customerdetails.id, state,
statuses.statusname
FROM
customerdetails
LEFT JOIN vip ON customerdetails.id=vip.id
LEFT JOIN statuses ON
vip.status=statuses.status
)
SELECT *
FROM customerinfo;
Id state statusname
407 NSW Bronze
299 QLD Gold
21 [NULL] Gold
597 TAS [NULL]
106 NSW Gold
26 VIC Gold
~ 303 rows ~
At this point, you can group it by state or status name to see how many of
each you have, but we’re more interested in the total sales.
For that, we’ll need to join the preceding with the sales table in another
CTE:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
SELECT state, statusname, total
FROM customerinfo JOIN sales
ON customerinfo.id=sales.customerid
)
SELECT *
FROM salesdata;
All of this is just to get the data ready. What we’re going to do now is
generate our group rows.
Obviously, you’ll need an aggregate query, grouping by state. Normally, it
would have looked something like this:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
)
SELECT state, sum(total)
FROM salesdata
GROUP BY state;
to give us this:
State sum
WA 20274
ACT 6781.5
TAS 28193
VIC 79199.5
NSW 101889
NT 6151
QLD 53331.5
SA 30977.5
However, to get that ledger table appearance, we’ll use aggregate filters to
generate three separate totals:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
)
SELECT
state,
sum(CASE WHEN statusname='Gold' THEN total END) AS
gold,
sum(CASE WHEN statusname='Silver' THEN total END)
AS silver,
sum(CASE WHEN statusname='Bronze' THEN total END)
AS bronze
FROM salesdata;
You might be tempted to ask whether there’s an easier way to do it. The
answer is not really. The hard part was always going to be the preparation of the
data for pivoting.
However, for a few DBMSs, the final step can be achieved with a built-in
feature.
SELECT ...
FROM ...
PIVOT (aggregate FOR column IN(columnnames)) AS alias
The aggregate is the aggregate function you want to apply. In this case, it’s
sum(total).
The column is the column whose values you want across the table. In this
case, it’s statusname.
The columnnames is a list of values which will be the columns across the
pivot table. In this case, it’s Gold, Silver, Bronze.
The alias is any alias you want to give. It’s not used here, but it’s required.
The pivot table is, after all, a virtual table.
In our case, the pivot table will look like this:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
)
SELECT *
FROM salesdata
-- MSSQL:
PIVOT (sum(total) FOR statusname IN (Gold, Silver,
Bronze))
AS whatever
-- Oracle:
PIVOT (sum(total) FOR statusname IN ('Gold' AS Gold,
'Silver'
AS Silver, 'Bronze' AS Bronze))
;
This is a little bit simpler than the filtered aggregates we used previously.
However, note that there are some quirks with this technique.
The syntax for MSSQL and Oracle is not identical:
In MSSQL, the column names list is a plain list of names. Also, note that the
PIVOT clause requires an alias.
In Oracle, the list of column names is a list of strings; however, they are
aliased to prevent the single quotes from appearing in the names. The PIVOT
clause itself does not require an alias.
You’ll notice that the state doesn’t make an appearance in the PIVOT
clause; only the statusname and total. Any column not mentioned in the
PIVOT clause will appear as grouping rows. You can have more complex pivot
tables if there’s more than one such column, but you need to make sure that the
(virtual) table you want to pivot doesn’t have any stray unwanted columns.
You’ll also notice that the IN expression isn’t a normal IN expression. To
begin with, it’s not a list of values, but a list of column names.
On top of that, you can’t use a subquery to get the list of column names. You
have to know ahead of time what the column names are going to be, and you’ll
have to type them in yourself.
Using the pivot feature is not quite as convenient as it might have been, but,
if it’s available, is still simpler than the filtered aggregates. However, you will
still need to put in some effort in preparing your data first.
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
), -- extra comma
pivottable AS (
SELECT *
FROM salesdata
PIVOT ...
)
SELECT *
FROM pivottable
;
If you run this, you’ll get the same result as before; we’ve just put the result
into the pivottable CTE.
The next step is to add the UNPIVOT clause at the end of the SELECT
statement:
WITH
statuses(status, statusname) AS (
...
),
customerinfo(id, state, statusname) AS (
...
),
salesdata(state, statusname, total) AS (
...
),
pivottable AS (
...
)
SELECT *
FROM pivottable
-- MSSQL:
UNPIVOT (
total FOR statuses IN (Gold,Silver,Bronze)
) AS w
-- Oracle:
UNPIVOT (
total FOR statuses IN (Gold,Silver,Bronze)
)
The UNPIVOT clause is even more mysterious than the PIVOT clause. The
only column that’s specifically mentioned is the statuses column, and, again,
you need to list the possible values. From there, the DBMS magically works out
that there is a state column, and whatever’s left will appear in another column,
which we have called total.
Code Blocks
If you’re using a client which makes it easy to run one statement at a time, you
may find it gets a little confused when working with blocks of multiple
statements. It will be easier to work with if you surround your block with
delimiters.
For the various DBMSs, the delimiters look like this:
-- PostgreSQL
DO $$
...
END $$;
-- MariaDB/MySQL
DELIMITER $$
...
$$
DELIMITER ;
-- MSSQL
GO
...
GO
-- Oracle
/
...
/
In the end, you will probably just highlight all of the lines of code and run
them together. That’s what we recommend in trying the following code. Don’t
try running just one line at a time.
In the following code, we’ll do what we did in Chapter 3 in adding a new
sale. Then, we made a point of recording the new sale id, so that we could use it
in subsequent statements. This time, however, we’ll use variables to store interim
values, so we can run the code in a single batch.
The code will broadly follow these steps:
1.
Set up the data to be used.
2.
Insert the sale.
3.
Get the new sale id into a variable.
4.
Insert the sale items, using the sale id.
5.
Update the sale items with their prices, using the sale id.
6. Update the new sale with the total, using the sale id, of course.
While we’re at it, we’ll set a few other variables:
A variable to store the customer’s id
A variable to store the ordered date/time
It would be nice to have another variable with the sale items. However, most
DBMSs aren’t adept at defining multivalued variables without a lot of extra fuss
in defining custom data types to do the job. Here, we’re trying to keep things
simple.
What follows will be four similar versions of how to write the code block.
DO $$
...
END $$ ;
DO $$
DECLARE
cid INT := 42;
od TIMESTAMP := current_timestamp;
sid INT;
END $$ ;
The variable names can be anything you like, but you run the risk of
competing with column names in the following code. Some developers prefix
the names with an underscore (such as _cid).
The sid variable is an integer which will be assigned later. The cid and od
variables are for the customer id and ordered date/time. They are assigned from
the beginning with the special operator :=.
The code proper is inside a BEGIN ... END block. It will be all of the
code you used in Chapter 3, but run together. The important part is that the
variable sid is used to manage the new sale id:
DO $$
DECLARE
cid INT := 42;
od TIMESTAMP := current_timestamp;
sid INT;
BEGIN
INSERT INTO sales(customerid, ordered)
VALUES(cid, current_timestamp)
RETURNING id INTO sid;
UPDATE saleitems AS si
SET price=(SELECT price FROM books AS b
WHERE b.id=si.bookid)
WHERE saleid=sid;
UPDATE sales
SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=sid)
WHERE id=sid;
END $$;
The sid variable gets its value from the RETURNING clause in the first
INSERT statement. From there on, it’s used in the remaining statements.
You can test the results using
You should see the new sale and sale items at the top.
DELIMITER $$
BEGIN
END; $$
DELIMITER ;
DELIMITER $$
BEGIN
SET @cid = 42;
SET @od = current_timestamp;
SET @sid = NULL;
END; $$
DELIMITER ;
Variables are prefixed with the @ character. This makes them a little more
obvious and avoids possible conflict with column names.
The statement SET @sid = NULL; is unnecessary. Since you don’t
declare variables, we’ve included the statement just to make it clear that we’ll be
using the @sid variable a little later.
The whole code looks like this:
DELIMITER $$
BEGIN
SET @cid = 42;
SET @od = current_timestamp;
SET @sid = NULL; -- unnecessary; just to make
clear
UPDATE saleitems
SET price=(SELECT price FROM books
WHERE books.id=saleitems.bookid)
WHERE saleid=@sid;
UPDATE sales
SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=@sid)
WHERE id=@sid;
END;
$$
DELIMITER ;
When you add a new row with an autogenerated primary key, you need to
get the new value to use later. The last_insert_id() function fetches the
most recent autogenerated value in the current session. You’ll notice that it
doesn’t specify which table: that’s why you need to call it immediately after the
INSERT statement.
As you see, the rest of the code is generally the same as in Chapter 3, with
the @sid variable used to manage the new sale id.
You can test the results using
You should see the new sale and sale items at the top.
GO
...
GO
The GO keyword isn’t actually a part of Microsoft’s SQL language (or any
other SQL, for that matter). It’s actually an instruction to the client software to
treat what’s inside as a single batch and to run it as such. Some clients allow you
to indent the keyword, and some allow you to add semicolons and comments on
the same line, but the safest thing is not to indent it and not add anything else to
the line.
Microsoft doesn’t have a block to declare variables, but it does have a
statement. To declare three variables, you can use three statements:
GO
DECLARE @cid INT = 42;
DECLARE @od datetime2 = current_timestamp;
DECLARE @sid INT;
GO
or you can use a single statement with the variables separated by commas:
GO
DECLARE
@cid INT = 42,
@od datetime2 = current_timestamp,
@sid INT;
GO
Variables are prefixed with the @ character, which makes them easy to spot
and easy to distinguish from column names.
The @sid variable is an integer which will be assigned later.
The rest of the code is similar to what we did in Chapter 3, but the new sale
id will be managed in the @sid variable:
GO
DECLARE @cid INT = 42;
DECLARE @od datetime2 = current_timestamp;
DECLARE @sid INT;
UPDATE saleitems
SET price=(SELECT price FROM books
WHERE books.id=saleitems.bookid)
WHERE saleid=@sid;
UPDATE sales
SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=@sid)
WHERE id=@sid;
GO
The @sid variable gets its value from the scope_identity() function.
You’ll notice that it doesn’t specify which table: that’s why you need to call it
immediately after the INSERT statement. From there on, it’s used in the
remaining statements.
You can test the results using
You should see the new sale and sale items at the top.
When the time comes, the whole block will be run as a single batch.
Variables are declared inside a DECLARE section:
/
DECLARE
cid INT := 42;
od TIMESTAMP := current_timestamp;
sid INT;
/
The variable names can be anything you like, but you run the risk of
competing with column names in the following code. Some developers prefix
the names with an underscore (such as _cid).
The sid variable is an integer which will be assigned later. The cid and od
variables are for the customer id and ordered date/time. They are assigned from
the beginning with the special operator :=.
The code proper is inside a BEGIN ... END block. It will be all of the
code you used in Chapter 3, but run together. The important part is that the
variable sid is used to manage the new sale id:
/
DECLARE
cid INT := 42;
od TIMESTAMP := current_timestamp;
sid INT;
BEGIN
INSERT INTO sales(customerid,ordered)
VALUES(cid, od)
RETURNING id INTO sid;
UPDATE saleitems
SET price=(SELECT price FROM books
WHERE b.id=saleitems.bookid)
WHERE saleid=sid;
UPDATE sales
SET total=(SELECT sum(price*quantity) FROM
saleitems
WHERE saleid=sid)
WHERE id=sid;
END;
/
The sid variable gets its value from the RETURNING clause in the first
INSERT statement. From there on, it’s used in the remaining statements.
You can test the results using
You should see the new sale and sale items at the top.
Review
In this chapter, we’ve looked at a few additional techniques that can be used to
get more out of our database.
Triggers
Triggers are code scripts which run in response to something happening in the
database. Typically, these include INSERT, UPDATE, and DELETE events.
Using a trigger, you can intercept the event and make your own additional
changes to the affected table or another table. Some triggers can go further and
work more closely with the DBMS or operating system.
We explored the concept by creating a trigger which responds to deleting
from the sales table. In this case, we copied data from the sale and matching
sale item into an archive table.
Different DBMSs vary in detail, but generally they follow the same
principles:
A trigger is defined for an event on a table.
The trigger code has access to the data about to be affected.
Using this data, the trigger code can go ahead and perform additional SQL
operations.
Pivot Tables
A pivot table is a virtual table which summarizes data in both rows and columns.
It’s a sort of two-dimensional aggregate.
For the most part, raw table data isn’t ready to be summarized this way. You
would put some effort into preparing the data in the right form and making it
available in one or more CTEs.
You can create a pivot table manually using a combination of two
techniques:
An aggregate query generates the vertical groups and the data to be
summarized.
A SELECT statement with aggregate filters generates a summary for each
horizontal category.
MSSQL and Oracle both have a non-standard PIVOT clause which will, to
some extent, automate the second process earlier. However, it still requires some
input from the SQL developer to finish the job.
SQL Variables
In this chapter, we used variables to streamline the code, first introduced in
Chapter 3, which adds a sale by inserting into multiple tables and updating them.
Most of the SQL we’ve worked with involved single statements. Some of
those statements were effectively multipart statements with the use of CTEs to
generate interim data.
In the case where you need more complex code to run in multiple statements,
you may need to store interim values. These values are held in variables, which
are temporary pieces of data.
In this chapter, we used variables for two purposes:
To hold fixed values to be used in the code
To store an interim value generated by some of the code
In most DBMSs, variables are declared and used within a block of code. In
most cases, the variables and their values will vaporize after the code block is
run. MariaDB/MySQL, however, will retain variables beyond the run.
SQLite doesn’t support variables. It is expected that the hosting application
will handle the temporary data that variables are supposed to manage.
Summary
Although you can go a long way with straightforward SQL statements and
features, you can often get more out of your DBMS with some additional
features:
Triggers are used to run some code in response to some database event. They
can be used to add some further processing to your database automatically.
Pivot tables are virtual tables which provide a compact view of your
summaries. You can generate a pivot table using a combination of aggregate
queries, but some DBMSs offer a pivot feature to simplify the process.
SQL variables are used to store temporary values between other SQL
statements. They can be used to store interim values that can be used in
subsequent statements.
Using what you’ve learned here and in previous chapters, you can build
more complex queries to work with and analyze your database.
Appendix A: Cultural Notes
The sample database was based on the way we do things in Australia. This is
pretty similar to the rest of the world, of course, but there are some details that
might need clearing up.
Australian addresses don’t make much use of cities, which have a pretty
broad definition in Australia.
Towns
Depending on how you define a town, there are about 15,000–20,000 towns in
Australia.
In the sample database, town names have been deliberately selected as those
occurring at least three times in Australia, though not necessarily in the sample.
States
Australia has eight geographical states. Technically, two of them are territories,
since they don’t have the same political features.
Each state has a two- or three-letter code.
Name Code
Northern Territory NT
New South Wales NSW
Australian Capital Territory ACT
Victoria VIC
Queensland QLD
South Australia SA
Western Australia WA
Tasmania TAS
Postcodes
A postcode is a four-digit code typically, though not exclusively, associated with
a town:
Two adjacent towns may have the same postcode.
A large town may have more than one postcode.
A large organization may have its own postcode.
The postcode is closely associated with the state, though some towns close to
the border may have a postcode from the neighboring state.
Phone Numbers
In Australia, a normal phone number has ten digits. For nonmobile numbers, the
first two digits are an area code, starting with 0, which indicates one of four
major regions. Mobile phones have a region code of 04.
There are also special types of phone numbers. Numbers beginning with
1800 are toll free, while numbers starting with 1300 are used for large
businesses that are prepared to pay for them.
Shorter numbers starting with 13 are for very large organizations. Other
shorter numbers are for special purposes, such as emergency numbers.
Australia maintains a group of fake phone numbers, and all of the phone
numbers used in the database are, of course, fake. Don’t waste your time trying
to phone one.
Email Addresses
There are a number of special domains reserved for testing or teaching. These
include example.com and example.net, which is why all of the email
addresses use them.
This is true over the world.
Dates
Short dates in Australia are in the day/month/year format, which can get
particularly confusing when mixed with American and Canadian dates. It is for
this reason that we recommend using the month name instead of the month
number or, better still, the ISO8601 format.
Writing SQL
In general, all DBMSs write the actual SQL in the same way. There are a few
differences in syntax and in some of the data types.
Semicolons
MSSQL does not require the semicolon between statements. However, apart
from being best practice to use it, Microsoft has stated that it will be required in
a future version,1 so you should always use one.
Data Types
All DBMSs have their own variations on data types, but they have a lot in
common:
SQLite doesn’t enforce data types, but has general type affinities.
PostgreSQL, MySQL/MariaDB, and SQLite support boolean types, while
MSSQL and Oracle don’t. MySQL/MariaDB tends to treat boolean values as
integers.
Dates
Oracle doesn’t like ISO8601 date literals (yyyy-mm-dd). However, it is easy
enough to get this to work. You can also use the to_date() function or the
to_timestamp() function to accept different date formats.
MariaDB/MySQL only accepts ISO8601 date literals. If you want to feed it a
different format, you can use the str_to_date() function.
SQLite doesn’t actually have a date data type, so it’s a bit more complicated.
Generally, it’s simplest to use a TEXT type to store ISO8601 strings, with
appropriate functions to process it.
Case Sensitivity
Generally, the SQL language is case insensitive. However
MySQL/MariaDB as well as Oracle may have issues with table names,
depending on the underlying operating system.
Strings may well be case sensitive depending on the DBMS defaults and
additional options when creating the database or table. By default
MSSQL and MySQL/MariaDB are case insensitive.
PostgreSQL, SQLite, and Oracle are case sensitive.
There’s one more peculiarity in SQLite:
Matching strings is case sensitive.
Matching patterns (LIKE) is case insensitive.
Quote Marks
In standard SQL
Single quotes are for 'values'.
Double quotes are for "names".
However
MySQL/MariaDB has two modes. In traditional mode, double quotes are also
used for values, and you need the unofficial backtick for names. In ANSI
mode, double quotes are for names.
MSSQL also allows (and seems to prefer) square brackets for names.
Personally, I discourage this, so it’s not an issue.
Limiting Results
This is a feature omitted in the original SQL standards, so DBMSs have followed
their own paths. However
PostgreSQL, Oracle, and MSSQL all now use the OFFSET ... FETCH
... standard, with some minor variations.
PostgreSQL, MySQL/MariaDB, and SQLite all support the non-standard
LIMIT ... OFFSET ... clause. (That’s right, PostgreSQL has both.)
MSSQL also has its own non-standard TOP clause.
Oracle also supports a non-standard row number.
Filtering (WHERE)
DBMSs also vary in how values are matched for filtering.
Unlike most DBMSs, SQLite will allow you to use an alias from the
SELECT clause in the WHERE clause, which contradicts the standard clause
order.
Case Sensitivity
This is discussed earlier.
String Comparisons
In standard SQL, trailing spaces are ignored for string comparisons, presumably
to accommodate CHAR padding. More technically, shorter strings are right-
padded to longer strings with spaces.
PostgreSQL, SQLite, and Oracle ignore this standard, so trailing spaces are
significant. MSSQL and MySQL/MariaDB follow the standard.
Dates
Oracle’s date handling is mentioned earlier. This will affect how you express a
date comparison.
There is also the issue of how the ??/??/???? is interpreted. It may be the
US d/m/y format, but it may not. It is always better to avoid this format.
Wildcard Matching
All DBMSs support the basic wildcard matches with the LIKE operator.
PostgreSQL doesn’t support wildcard matching with non-string data.
As for extensions to wildcards
PostgreSQL, MySQL/MariaDB, and Oracle support regular expressions, but
each one handles them differently.
MSSQL doesn’t support regular expressions, but does have a simple set of
extensions to basic wildcards.
SQLite has recently added native support for regular expressions
(www.sqlite.org/releaselog/3_36_0.html).
Calculations
Basic calculations are the same, with the exceptions as follows. Functions, on
the other hand, are very different.
Of the DBMSs listed earlier, SQLite has the fewest built-in functions,
assuming that the work would be done mostly in the host application.
Arithmetic
Arithmetic is mostly the same, but working with integers varies slightly:
PostgreSQL, SQLite, and MSSQL will truncate integer division; Oracle and
MySQL/MariaDB will return a decimal.
Oracle doesn’t support the remainder operator (%), but uses the mod()
function.
Formatting Functions
Generally, they’re all different. However
PostgreSQL and Oracle both have the to_char() function.
Microsoft has the format() function.
SQLite only has a format() function, a.k.a. printf(), and is the most
limited.
MySQL/MariaDB has various specialized functions.
Date Functions
Again, all of the DBMSs have different sets of functions. However, for simple
offsetting
PostgreSQL and Oracle have the interval which makes adding to and
subtracting from a data simple.
MySQL/MariaDB has something similar, but less flexible.
MSSQL relies on the dateadd() function.
SQLite doesn’t do dates, but it has some functions to process date-like strings.
Concatenation
This is a basic operation for strings:
MSSQL uses the non-standard + operator to concatenate. Others use the ||
operator, with the partial exception of MySQL/MariaDB as follows.
MySQL/MariaDB has two modes. In traditional mode, there is no
concatenation operator; in ANSI mode, the standard || operator works.
All DBMSs support the non-standard concat() function, with the exception of
SQLite.
Oracle treats the NULL string as an empty string. This is particularly
noticeable when concatenating with a NULL which doesn’t produce a NULL
result as expected.
String Functions
Suffice to say that although there are some SQL standards
Most DBMSs ignore them.
Those that support them also have additional variations and functions.
This means that these examples will all require special attention.
Generally, the DBMSs support the popular string functions, such as
lower() and upper() but sometimes in different ways. There is, however, a
good deal of overlap between DBMSs.
Joining Tables
Everything is mostly the same. However
Oracle doesn’t permit the keyword AS for table aliases.
SQLite doesn’t support the RIGHT join.
Nobody knows why.
Aggregate Functions
The basic aggregate functions are generally the same between DBMSs. Some of
the more esoteric functions are not so well supported by some.
PostgreSQL, Oracle, and MSSQL support an optional explicit GROUP BY
() clause, which doesn’t actually do anything important, but helps to illustrate a
point. The others don’t.
Manipulating Data
All DBMSs support the same basic operations. However
Oracle doesn’t support INSERT multiple values without a messy workaround,
though there is talk of supporting it soon. MSSQL supports them, but only to a
limit of 1000 rows, but there is also a less messy workaround for this limit.
The rest are OK.
Manipulating Tables
All DBMSs support the same basic operations, but each one has its own
variation on actual data type and autogenerated numbers.
Among other things, this means that the create table scripts are not cross-
DBMS compatible.
MSSQL has a quirk regarding unique indexes on nullable columns, for which
there is a workaround.
SELECT setval(pg_get_serial_sequence('customers',
'id'), max(id))
FROM customers;
For Oracle, alter the table you’ve just added data to. For example:
GO
CREATE something AS
...
;
GO
That doesn’t include CREATE TABLE, which will happily mix in with the
rest of the statements.
SELECT
id, customers.*
FROM customers;
For the OFFSET ... LIMIT ... clause, which fetches a limited number
of rows, the OFFSET value cannot be calculated.
As you know, in a GROUP BY query, you can only select aggregates or
what’s in the GROUP BY clause. With MariaDB/MySQL, that won’t work if the
GROUP BY column is calculated. You really should be using CTEs anyway.
Don’t forget to set your session to ANSI mode to have MariaDB/MySQL
behave like the rest in the use of double quotes and concatenation:
In particular, we’ll assume that, apart from the basics, you know about
collections such as tuples, lists, and dictionaries. Of course, you’ll be familiar
with creating a function. You’ll also need to know about installing and
importing modules.
Before any of this can happen, however, you will probably have to install the
appropriate module.
Once you’ve done that, we’ll go through the following steps:
1.
import the database module.
2.
Make a connection to the database and store the connection object and a
corresponding cursor object.
3.
Run your SQL and process the results.
4.
Close the connection.
A connection object represents a connection to the database, and you can use
it to manage your database session.
More importantly, a cursor object is what you’ll use to send SQL to the
database and to send and receive the data involved. The connection object also
has some data manipulation methods, but what they really do is create a cursor
and pass on the rest of the work to a cursor.
# MariaDB/MySQL
pip3 install mysql-connector-python
# PostgreSQL
pip3 install psycopg2-binary
# Oracle
pip3 install oracledb
The module for the preceding MariaDB and MySQL is the same. However,
there is a dedicated MariaDB module if you need more specialized features.
# MSSQL (Windows)
pip3 install pyodbc
import pyodbc
print(pyodbc.drivers())
You’ll see a collection of one or more drivers. The one you want will be
something like
The command is too long to fit on this page. You should enter the command on
one line, with no break or spaces in the URL.
Once you’ve got Homebrew installed, you can use it to install the correct driver
for MSSQL:
Again, the command is too long to fit. You can write it on two lines as long as the
first line ends with a backslash; otherwise, write it on one line without the
backslash.
But wait, there’s more. You then need to install the next part, at the same time
accepting the license agreement:
Now, you can install the module. You may have trouble installing it simply,
especially if you’re using an M1 Macintosh, so it’s safer to run this:
After this, you will need to get the name of the driver.
In Python, run the following:
import pyodbc
print(pyodbc.drivers())
You’ll see a collection of one or more drivers. The one you want will be
something like
Creating a Connection
Overall, to make a connection and cursor to the database, your code will look
something like this:
import dbmodule
connection = dbmodule.connect(...)
cursor = connection.cursor()
connection.close()
where dbmodule is the relevant module for the DBMS. Specifically, for the
various DBMSs, the code will be as follows.
Connecting to SQLite
The relevant module for SQLite is called sqlite3. After importing the
module, you need to make the connection to the database.
SQLite databases are in simple files. You’ll find there are no further
credentials to worry about, since that’s supposed to be handled in the host
application. All you need to do is to reference the file.
To connect to SQLite
import sqlite3
connection = sqlite3.connect(file) # path name of
the file
cursor = connection.cursor()
The file string is the full or relative path name of the SQLite file.
Connecting to MSSQL
The module or MSSQL is called pyodbc. In principle, it can be used for any
database which supports ODBC.
A connection in MSSQL can be a string with all of the connection details.
This string is called a DSN—a Data Source Name. However, for readability and
maintainability, it’s easier to add the details as separate function parameters. In
general, it looks like this:
import pyodbc
connection = pyodbc.connect(
driver='ODBC Driver 18 for SQL Server',
TrustServerCertificate='yes',
server='...',
database='bookshop',
uid='...',
pwd='...'
)
cursor = connection.cursor()
server='...,1432
Connecting to MariaDB/MySQL
The relevant module to connect to MariaDB/MySQL is called
mysql.connector. To connect to the database, you will need to indicate
which server and database, as well as your username and password:
import mysql.connector
connection = mysql.connector.connect(
user='...',
password='...',
host='...',
database='bookshop'
)
cursor = connection.cursor()
The host is typically the IP address of the database server. The standard
port number is 3306. If you need to change the port number, you can add it as
another parameter: port=3305.
Connecting to PostgreSQL
The module to connect to PostgreSQL is called psycopg2. To connect to the
database, you will need to indicate which server and database, as well as your
username and password:
import psycopg2
connection = psycopg2.connect(
database='...',
user='...',
password='...',
host='...'
)
cursor = connection.cursor()
The host is typically the IP address of the database server. The standard
port number is 5432. If you need to change the port number, you can add it as
another parameter: port=5433.
Connecting to Oracle
The module to connect to Oracle is called oracledb. To connect to the
database, you will need to indicate which server and database, as well as your
username and password:
import oracledb
connection = oracledb.connect(
user='...',
password='...',
host='...',
service_name='...'
)
cursor = connection.cursor()
The host is typically the IP address of the database server. The standard
port number is 1521. If you need to change the port number, you can add it as
another parameter: port=1522.
connection.execute(sql)
Before we process the data, we’ll want to get a list of column names. This
information is available in the cursor.description object. The
cursor.description object is a tuple of tuples, one for each column. The
data inside each of the tuples may include information about the type of data, but
that’s not available for all DBMS connections.
The column names will be the first item of each tuple. We can gather the
names using a list comprehension:
This adds the first member of each tuple to the columns list.
The data from the SELECT statement will be available from the cursor
object. The object includes methods to fetch one or more rows, but can also be
iterated to fetch the rows.
You can iterate through the cursor as follows:
zip(columns,row)
Here, the result will be a collection of tuples with the first member being a
column name and the second member being a corresponding value from the row.
Technically, it’s not a collection, but an iterator which is close enough for the
next step.
Our next step will be to turn that into a dictionary object, using the first
member of each tuple as keys for the second member of the tuple.
This will produce a set of dictionary objects:
data = []
for row in cursor:
data.append(dict(zip(colums,row)))
print(data)
connection.close()
import ...
connection = ... . connect(...)
data = []
for row in cursor:
data.append(dict(zip(colums,row)))
print(data)
connection.close()
That will work, but it’s too hard-coded to be useful. Instead, we’ll get the
customer id from the user:
To put the customer id into the sql string, we could try something like this:
42 OR 1=1
(customerid,)
Remember that a tuple with a single value requires a comma at the end.
The code should now look like
We’ll get to those strings in a moment. Before we do, we need to look out for
the new sale id.
In SQL, there are two main methods of getting a newly generated id:
Return it from the INSERT statement.
Fetch it in a separate step.
The first method is better, but isn’t supported by all DBMSs at this stage.
We’ll need to take that into account with the first SQL string.
The other thing is that we’ll include placeholders in these strings. That’s not
strictly necessary at this point, since we’re not including user input. However,
it’s safer and makes adding the values easier.
To make the code a little more reusable, we’ll wrap it inside a function:
def addsale(customerid, items, date):
insertsale = '...' # Add new sale
insertitems = '...' # Add sale items with
books
updateitems = '...' # Update sale items
with book prices
updatesale = '...' # Update sale with
total
return saleid
multiline = '''
Multi
Line
String
'''
The other thing is whether you use single or double quotes. Many developers
use double quotes both for single-line strings and multiline strings. In this
appendix, we’re using single quotes. It doesn’t matter, as long as you’re
consistent.
SQL Strings for PostgreSQL
PostgreSQL can return the new id from a RETURNING clause in the INSERT
statement. Later, we’ll fetch that value.
The strings look like this:
insertsale = '''
INSERT INTO sales(customerid, ordered)
VALUES(%s,%s) RETURNING id;
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(%s,%s,%s);
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM
books WHERE
books.id=saleitems.bookid) WHERE saleid=%s;
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=%s) WHERE id=%s;
'''
insertsale = '''
INSERT INTO sales(customerid, ordered) VALUES(?,?)
RETURNING id;
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(?,?,?);
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM
books
WHERE books.id=saleitems.bookid) WHERE
saleid=?;
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=?) WHERE id=?;
'''
insertsale = '''
INSERT INTO sales(customerid, ordered)
OUTPUT inserted.id VALUES(?,?);
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(?,?,?);
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM
books
WHERE books.id=saleitems.bookid) WHERE
saleid=?;
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=?) WHERE id=?;
'''
Apart from the OUTPUT clause, these are basically the statements we used
earlier.
SQL Strings for MariaDB/MySQL
MariaDB/MySQL doesn’t return the new id from the INSERT statement, so
we’ll have to get that later using a different technique. The strings look like this:
insertsale = '''
INSERT INTO sales(customerid, ordered)
VALUES(%s,%s);
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(%s,%s,%s);
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM
books
WHERE books.id=saleitems.bookid) WHERE
saleid=%s;
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=%s) WHERE id=%s;
'''
insertsale = '''
INSERT INTO sales(customerid, ordered)
VALUES(:1, to_timestamp(:2,'YYYY-MM-DD
HH24:MI:SS'))
RETURNING id INTO :3
'''
insertitems = '''
INSERT INTO saleitems(saleid, bookid, quantity)
VALUES(:1,:2,:3)
'''
updateitems = '''
UPDATE saleitems SET price=(SELECT price FROM
books
WHERE books.id=saleitems.bookid) WHERE
saleid=:1
'''
updatesale = '''
UPDATE sales SET total=(SELECT sum(price*quantity)
FROM saleitems WHERE saleid=:1) WHERE id=:2
'''
Watch out for this quirk: you cannot end the statements with a semicolon! If
you do, you’ll get an error message: SQL command not properly
ended, which is somewhat counterintuitive.
Note that the insertsale string includes an expression with single
quotes. That’s OK if the string is delimited with triple characters. If you’re
writing it on one line, you might need to use double quotes for the string.
# Not Oracle
cursor.execute(insertsale, (customerid, date))
For Oracle, you need to define an additional variable to capture the new id:
# Oracle
id = cursor.var(oracledb.NUMBER)
cursor.execute(insertsale, (customer, date, id))
To retrieve the new sale id, that depends on whether the id is returned from
the INSERT statement or not.
For those PostgreSQL and MSSQL, which return a value, you can fetch that
value using
# PostgreSQL, MSSQL
saleid = cursor.fetchone()[0]
The fetchone() method returns the first (and subsequent row from the
result set) as a tuple. Here, we want the first and only item.
For SQLite and MariaDB/MySQL, which don’t return a value, there is a
special lastrowid property:
# SQLite, MariaDB/MySQL
saleid = cursor.lastrowid
For Oracle, the new sale id is sort of in the id variable, but you still need to
extract it completely:
# Oracle
saleid = int(id.getvalue()[0])
# PostgreSQL, MSSQL
cursor.execute(insertsale, (customer, date))
saleid = cursor.fetchone()[0]
# SQLite, MariaDB/MySQL
cursor.execute(insertsale, (customer, date))
saleid = cursor.lastrowid
# Oracle
id = cursor.var(oracledb.NUMBER)
cursor.execute(insertsale, (customer, date, id))
saleid = int(id.getvalue()[0])
return saleid
Remember not to mess around with indentation. All of the code should be
one level in to be part of the addsale() function.
(
{ 'bookid': 123, 'quantity': 3},
{ 'bookid': 456, 'quantity': 1},
{ 'bookid': 789, 'quantity': 2},
)
Within the function, the tuple will appear in the items variable. We can
iterate through the tuple using the for loop.
In each iteration, we’ll execute the insertitems statement, which inserts
one item at a time. The data will be a tuple with the sale id from the previous
step, as well as the bookid and quantity members of the dictionary object.
The code will look like this:
# cursor.execute
# saleid
return saleid
cursor.execute(updateitems, (saleid,))
cursor.execute(updatesale, (saleid, saleid))
connection.commit()
The updateitems query needs only the sale id. Even though it’s only one
value, it still needs to be in a tuple, which is why there’s the extra comma at the
end. The updatesale query needs the sale id twice, once for the main query
and once for its subquery.
At the end of the job, you need to commit the transaction, which means to
save the changes permanently in the database. Otherwise, the whole process is a
waste of time.
The function now looks like this:
# cursor.execute
# saleid
for item in items:
cursor.execute(
insertitems,
(saleid, item['bookid'], item['quantity'])
)
cursor.execute(updateitems, (saleid,))
cursor.execute(updatesale, (saleid, saleid))
connection.commit()
return saleid
addsale (
42, # customer
id
( # items
{ 'bookid': 123, 'quantity': 3},
{ 'bookid': 456, 'quantity': 1},
{ 'bookid': 789, 'quantity': 2},
),
datetime.now() # current
date/time
)
Index
A
Aggregate filters
Aggregate functions
basic functions
contexts
count(*) OVER ()
CTE
daily totals vs. grand totals
day-by-day summary
day number
DBMS
descriptions
each day sales
NULL
numerical statistics
OVER ()
percentage symbol
sales totals
strings and dates
total/sum(total) OVER()
weekday, percentage/sorting
Aggregate queries
Aggregate window functions
daily sales view
framing clause
ORDER BY clause
sliding window
daily totals
dates
framing clauses
sliding averages
week averages
Aggregating data
aggregate filter
calculated values
arbitrary strings
CASE statements
CTE
customers
delivery statistics
GROUP BY clause
month name
monthnumber
clause order
distinct values
error message
FROM/WHERE clauses
GROUP BY () clause
group concatenation
grouping sets
CUBE
data
GROUP BY clause
renaming values, Oracle
ROLLUP
sorting results
totals
query
subtotals
UNION clause
CTE
grand total
levels
query
SELECT statements
sorting order
state/customer ids
summaries
virtual table
Aggregating process
Aliases
ALTER TABLE statements
The American Standard Code for Information Interchange (ASCII)
Arithmetic mean
B
Books and authors
BookWorks
Business rules
C
Caching table
Calculations
Calculations in SQL
CASE expression
CASE … END expression
ELSE expression
with NULLs
coalesce()
using aliases
AS keyword
Calculations in SQL
built-in functions
CASE expression
coalesce
short-circuited
uses of CASE
casting
coding languages
data types
date operations
date arithmetic
date extracting in Microsoft SQL
date extracting in PostgreSQL, MariaDB/MySQL and Oracle
date/time, entering and storing
formatting a date
getting, date/time
grouping and sorting, date/time
forms
FROM clause
individual/multiple columns
with NULLs
author names
numeric
See Numeric calculations
ORDER BY clause
SELECT clause
using aliases
alias names
AS is optional
basic SQL clauses
WHERE clause
cast() function
Casting types
CHAR(length)
CHECK constraint
ck_customers_postcode
coalesce() function
Collation
Columns
changing the town
countries table
CREATE VIEW statement
DROP COLUMN
foreign key
old address columns
primary key
SELECT statement
street address column
UPDATE statement
Common Table Expressions (CTEs)
aggregates
duplicate names
most recent sale, per customer
benefits
calculations
monthly totals
price groups
query
sales table
WITH clause
constants
deriving constants
hard-coded
duplicated names
consolidated list
id
info column values
layout
parameter names
phone number
query
results
FROM subquery
hard-coded constants
multiple chain
multiple CTEs
nesting subqueries
parameter names
recursive
See Recursive CTEs
subquery
syntax
uses
variables
virtual table
Computed column/calculated column
creation
data
DBMSs
mini-view
ordered datetime column
read-only virtual column
types
VIRTUAL
Concatenation
Constants
deriving constants
hard-coded
Correlated subquery
countries.sql
CREATE TABLE statement
CROSS JOIN
Cultural notes
address
pattern
postcodes
states
towns
currencies
dates
email addresses
measurements
phone numbers
prices
D
Data
Database
Database design
Database integrity
CHECK constraint
column constraints
domain
familyname
nullable column
ALTER TABLE statement
CHECK constraint
changes in SQLite
DEFAULT value
NOT NULL constraint
standard constraint types
suggestions
table constraint
UNSIGNED INT
Database Management Software (DBMS)
aggregate functions
calculations
arithmetic
concatenation
date functions
formatting functions
SELECT without FROM
string functions
database client
data manipulation
filtering (WHERE)
case sensitivity
dates
string comparisons
wildcard matching
joining tables
MariaDB
primary keys, autoincrementing
quirks and variations
See Quirks and variations
rule
sample database
sorting (ORDER BY)
table literals
table manipulation
triggers
writing code
Database tables
basic principles
changes to table structures
columns
customers table
improved database design
indexes
See Index
table design and columns
See Columns
temporary table
virtual table
well-designed table
Data Definition Language (DDL)
Data Manipulation Language (DML)
Data rules
Data types
aggregate queries
calculating columns
calculating with NULLs
aliases
CASE expression
subqueries
views
date literals
joins
join types
ON clause
syntax
number literals
string literals
Date functions
Deciles
Denormalized data
Design principles
Domains
E
extract() function
F
Foreign key
Formatting functions
Frequency table
G
GROUP BY clause
grouping() function
H
Histograms
I
Index
anonymous index
author’s name
books table
clustered index/index organized table
costs
CREATE INDEX
customers table
HAVING clause
primary key
SELECT clause
UNIQUE column
Unique Index
IN operator
Information
J, K
Joining tables
L
LATERAL JOIN (CROSS APPLY)
adding columns
expression
principle
multiple columns
aggregate query
FROM clause
list of customers
results
SELECT clause
query
SELECT clause
WHERE clause
ltrim() function
M
Many-to-many relationship
associated data
association
associative/bridging table
book genres
book, multiple authors
book table
combination
CTE/aggregate query
data
database
genres table
list of books
result
sales and saleitems tables
SELECT statements
table designing
tables
table structure
UNIQUE constraint
Many-to-many tables
associative table
bookgenres table
book’s id
INNER JOIN
number of rows
results
MariaDB
MariaDB/MySQL
connection
quirks and variations
SQL strings
triggers
variables
Median
Microsoft quirks and variations
Microsoft SQL
connection
module, windows
recursive CTEs
SQL strings
table literals
triggers
variables
Mode
Multiple tables
adding author
adding book
authors table
books table
child table
joins
new book
new sale
addition
process
sale completion
sales items
sales table
parent table
query
Multiple values
bookgenres table
CTE
GROUP BY query
id and title columns
joins
authors/genres tables
bookdetails view
books table
dataset
filtered list
filtering
genre details
genre names
genres table
query
side effects
list
multiple genres, book
principles
SELECT statement
string_agg(column,separator) function
MySQL
N
Natural key
Normalization
Normalized database
Normalized tables
properties
ntile()
cast(… AS int)
customers hights
decile/row_decile
deciles
group size
NULL heights
rank_decile/count _decile
NULLs
NULL strings
Numeric calculations
approximation functions
basic arithmetic
formatting functions
mathematical functions
string
See String calculations
O
ON DELETE CASCADE clause
One-to-many relationship
books and authors
books and authors view
child table/parent table
JOIN
NOT IN(…)
one-to-many joins
books and authors
combinations
FULL JOIN
INNER JOIN
LEFT JOIN
NOT NULL
NULL
options
OUTER JOIN
rows
subquery
unmatched parents
Oracle
uses
One-to-many tables
One-to-maybe relationships
contradiction
customers table
customers
join
LEFT JOIN
secondary table
SELECT *
vip table
VIP columns
One-to-one relationship
Oracle
connection
quirks and variations
SQL strings
triggers
variables
ORDER BY clause
P
Pentiles
percentile_cont() function
Percentiles
Pivoting data
aggregate query
customerdetails view
customerinfo CTE
database tables
definition
design
general rule
grouping
layout
ledger table
multiple CTEs
pivot feature
purpose
separate totals
spreadsheet program
status CTE
testing
total sales
UNPIVOT feature
Pivot tables
advantages
creation
definition
MSSQL/Oracle
raw table data
Planned relationships
PostgreSQL
connection
quirks and variations
SQL strings
triggers
variables
Previous and next rows
comparing sales
daily sales
lag and lead
missing dates
OVER clause
Python
connection
cursor
mysql.connector
oracledb
psycopg2
pyodbc
sqlite3
database connector module
exceptions
shell/command line
fetching database
module
MSSQL module, windows
new sale
addition
code
completion
customerid
methods
sale items
SQL strings
steps
parameters, query
SQL strings
MariaDB/MySQL
MSSQL
Oracle
PostgreSQL
SQLite
triple quote characters
Q
Quirks and variations
MariaDB/MySQL
Microsoft
Oracle
PostgreSQL
R
Ranking functions
basic functions
count(*)
customer heights
dense_rank()
examples
exceptions
expressions
framing clause
ORDER BY value
paging results
CTE
OFFSET … FETCH … clause
pricelist view
prices
PARTITION BY
CASE … END expression
columns
expected order
order date
row_number()
rank()
row_number() function
Recursive CTEs
(cte(n))
daily comparison, missing days
daily_sales view
DBMSs
finding dates
LEFT JOIN
sequence of dates
vars and dates
forms
JOIN, missing values
parts
sequence
adding day
creation
dates
MSSQL
series of number/dates
WHERE clause
traversing hierarchy
cleaner result
employees table
multilevel
single-level
supervisorid column
supervisor’s name
uses
Relational model
Relationships
planned
types
unplanned
S
sales table
Scalar function
SELECT statement
Single value query
SQL
basic SQL
data types
dates
feature
query
quotes
semicolon
writing
SQL clauses
clause order
limiting results
multiple assertions
ORDER BY clause
SELECT clause
sort strings
WHERE clause
wildcard patterns
SQLite
connection
SQL strings
triggers
Standard deviation
Statistics
String calculations
ASCII and Unicode
case sensitivity
CHAR(length)
concatenation
data types for strings
string functions
VARCHAR(length)
String functions
Subqueries
column names
complex query
correlated
cost
definition
expression
FROM clause
NULL
price groups, books
SELECT statement
summarizing table
GROUP BY clause
IN() expression
nested subqueries
non-correlated
ORDER BY clause
SELECT clause
aggregate query
correlated subquery
join
non-correlated subquery
window functions
uses
WHERE clause
aggregates
big spenders
duplicate customers
last order
WHERE EXISTS (…)
correlated subquery
FROM dual
IN() expression
non-correlated subquery
SELECT NULL/SELECT 1/0
testing
T
Table
Table design
constraints
data manipulation statements
foreign key
indexes
set operations
types of data
Table literals
data
anchor member
CTE
recursive member
DBMSs
definition
lookup table
MSSQL
sorting
advantage
data CTE
names
sales per weekday
sequence number
strings
summary CTE
standard notation
statement
string
anchor member
recursive CTE
recursive member
rest
splitting
WHERE rest<>
testing
age calculation
dates CTE
series of dates
virtual table
Table Valued Function (TVF)
DBMSs
definition
Microsoft SQL
PostgreSQL
pricelist()
Temporary table
benefits
creation
database
INSERT … SELECT … statement
query
SELECT statement
TEMP
uses
towns.sql.
Towns table
Triggers
activity table
archive table
creation
data, archiving
data deletion
DBMSs
definition
deleted_sales
foreign key
logging table
Logon triggers
MariaDB/MySQL
MSSQL
NULL sales
Oracle
PostgreSQL
pros and cons
rental table
sales table
sales deletion
SQLite
syntax
types
uses
Triggers
U
Unicode
UNIQUE clause
Unplanned relationships
V
Value
Value functions
VARCHAR(length)
Variables
code blocks
DBMSs
definition
function/procedure
MariaDB/MySQL
MSSQL
Oracle
PostgreSQL
purposes
statements
system variables
uses
Variables
Views
aupricelist
benefits
caching data
cascade views
conditions
convenience
CREATE VIEW … AS clause
DBMS
external applications
importance
interface
limitations
materialized views
ORDER BY clause
pricelist view
SELECT *
SELECT statement
syntax
TVF
See Table Valued Function (TVF)
uses
Virtual tables
multiple rows and multiple columns
one column and multiple rows
one row and one colum
query
W, X, Y, Z
WHERE clause
Window clauses
Window functions
aggregate windows
ORDER BY clause
OVER () clause
PARTITION BY clause
subtotals
expressions
monthly totals
ordered_month
PARTITION BY multiple columns
PARTITION BY/ORDER BY
syntax
window
Footnotes
1 Microsoft’s comment on semicolons: https://docs.microsoft.com/en-us/sql/t-
sql/language-elements/transact-sql-syntax-conventions-transact-
sql#transact-sql-syntax-conventions-transact-sql. TLDR: Semicolons are
recommended and will be required in the future.
2 Don’t even think about storing passwords simply in a database table. This isn’t the place to discuss how
to manage user data safely, but storing plain passwords is very dangerous and irresponsible.