Section I - Setup: 2.1A - Scalar Subqueries
Section I - Setup: 2.1A - Scalar Subqueries
Data Analysis.
Section I - Setup
To set up, you will need two things: SQLiteStudio and the files for this class (which you
likely have already if you are reading this document).
SQLiteStudio can be downloaded from its website:
https://sqlitestudio.pl/index.rvt?act=download
The files for this class can be downloaded here:
https://github.com/thomasnield/oreilly_advanced_sql_for_data
Import the “thunderbird_manufacturing.db” database file, which we will be using for
almost all of the examples.
FROM CUSTOMER_ORDER
WHERE CUSTOMER_ID IN (
SELECT CUSTOMER_ID
FROM CUSTOMER
WHERE STATE = 'TX'
)
Depending on how they are used, subqueries can be more expensive or less expensive than
joins. Subqueries that generate a value for each record tend to me more expensive, like the
example above.
FROM CUSTOMER_ORDER
INNER JOIN
(
SELECT CUSTOMER_ID,
PRODUCT_ID,
AVG(QUANTITY) as avg_qty
FROM CUSTOMER_ORDER
GROUP BY 1, 2
) cust_avgs
ON CUSTOMER_ORDER.CUSTOMER_ID = cust_avgs.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = cust_avgs.PRODUCT_ID
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ORDER.CUSTOMER_ID,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
QUANTITY,
AVG_QTY
ON CUSTOMER_ORDER.CUSTOMER_ID = cust_avgs.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = cust_avgs.PRODUCT_ID
For instance, we can create two derived tables “TX_CUSTOMERS” and “TX_ORDERS” but
give them names as common table expressions. Then we can proceed to use those two
derived tables like this.
WITH TX_CUSTOMERS AS
(
SELECT * FROM CUSTOMER
WHERE STATE = 'TX'
),
TX_ORDERS AS
(
SELECT * FROM CUSTOMER_ORDER
WHERE CUSTOMER_ID IN (SELECT CUSTOMER_ID FROM TX_CUSTOMERS)
)
2.5 - Unions
To simply append two queries (with identical fields) together, put a UNION ALL between
them.
SELECT
'FEB' AS MONTH,
PRODUCT.PRODUCT_ID,
PRODUCT_NAME,
SUM(PRICE * QUANTITY) AS REV
FROM PRODUCT LEFT JOIN CUSTOMER_ORDER
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
UNION ALL
SELECT
'MAR' AS MONTH,
PRODUCT.PRODUCT_ID,
PRODUCT_NAME,
SUM(PRICE * QUANTITY) AS REV
FROM PRODUCT LEFT JOIN CUSTOMER_ORDER
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
Using UNION instead of UNION ALL will remove duplicates, which should not be necessary in
this case.
You should strive not to use unions as they often encourage bad, inefficient SQL. Strive to
use CASE statements or other tools instead. In this example, it would have been better to do
this:
SELECT
CASE
WHEN ORDER_DATE BETWEEN '2017-02-01' AND '2017-02-28' THEN 'FEB'
WHEN ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31' THEN 'MAR'
END AS MONTH,
PRODUCT.PRODUCT_ID,
PRODUCT_NAME,
SUM(PRICE * QUANTITY) AS REV
FROM PRODUCT LEFT JOIN CUSTOMER_ORDER
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-02-01' AND '2017-02-28'
GROUP BY ORDER_DATE
Putting the DISTINCT keyword inside of it will only concatenate the DISTINCT product ID’s.
SELECT ORDER_DATE,
group_concat(DISTINCT PRODUCT_ID) as product_ids_ordered
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-02-01' AND '2017-02-28'
GROUP BY ORDER_DATE
Exercise 1
Bring in all records for CUSTOMER_ORDER, but also bring in the total quantities ever ordered
each given PRODUCT_ID and CUSTOMER_ID.
ANSWER:
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ORDER.CUSTOMER_ID,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
QUANTITY,
sum_qty
FROM CUSTOMER_ORDER
INNER JOIN
(
SELECT CUSTOMER_ID,
PRODUCT_ID,
SUM(QUANTITY) AS sum_qty
FROM CUSTOMER_ORDER
GROUP BY 1, 2
) total_ordered
ON CUSTOMER_ORDER.CUSTOMER_ID = total_ordered.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = total_ordered.PRODUCT_ID
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ORDER.CUSTOMER_ID,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
QUANTITY,
sum_qty
ON CUSTOMER_ORDER.CUSTOMER_ID = total_ordered.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = total_ordered.PRODUCT_ID
3.1 - Literals
Literals are characters in a regex pattern that have no special function, and represent that
character verbatim. Numbers and letters are literals. For example, The regex TX will match
the string TX
SELECT 'TX' REGEXP 'TX' --true
Some characters, as we have seen, have special functionality in a regex. If you want to use
these characters as literals, sometimes you have to escape them with a preceding \. These
characters include:
[\^$.|?*+()
So to qualify a U.S. currency amount, you will need to escape the dollar sign $ and the
decimal place .
SELECT '$181.12' REGEXP '\$181\.12' -- true
We can also specify certain characters, and they don’t necessarily have to be ranges:
SELECT 'A6' REGEXP '[ATUX][469]' --true
SELECT 'B8' REGEXP '[ATUX][469]' --false
Conversely, we can negate a set of characters by starting the range with ^:
SELECT 'A6' REGEXP '[^ATUX][^469]' --false
SELECT 'B8' REGEXP '[^ATUX][^469]' --true
3.2 - Anchoring
If you don’t want partial matches but rather full matches, you have to anchor the beginning
and end of the String with ^ and $ respectively.
For instance, [A-Z][A-Z] would qualify with SMU. This is because it found two alphabetic
characters within those three characters.
SELECT 'SMU' REGEXP '[A-Z][A-Z]' --true
If you don’t want that, you will need to qualify start and end anchors, which effectively
demands a full match when both are used:
SELECT 'SMU' REGEXP '^[A-Z][A-Z]$' --false
You can also anchor to just the beginning or end of the string to check, for instance, if a
string starts with a number followed by an alphabetic character:
SELECT '9FN' REGEXP '^[0-9][A-Z]' --true
SELECT 'RFX' REGEXP '^[0-9][A-Z]' --false
3.3 - Repeaters
Sometimes we simply want to qualify a repeated pattern in our regular expression. For
example, this is redundant:
SELECT 'ASU' REGEXP '^[A-Z][A-Z][A-Z]$' --true
We can instead explicitly identify in curly brackets we want to repeat that alphabetic
character 3 times, by following it with a {3}.
SELECT 'ASU' REGEXP '^[A-Z]{3}$' --true
We can also specify a min and max number of repetitions, such as a minimum of 2 but max
of 3.
SELECT 'ASU' REGEXP '^[A-Z]{2,3}$' --true
SELECT 'TX' REGEXP '^[A-Z]{2,3}$' --true
Leaving the second argument blank will result in only requiring a minimum of repetitions:
SELECT 'A' REGEXP '^[A-Z]{2,}$' --false
SELECT 'ASDIKJFSKJJNXVNJGTHEWIROQWERKJTX' REGEXP '^[A-Z]{2,}$' --true
To allow 1 or more repetitions, use the +. This will qualify with 1 or more alphanumeric
characters.
SELECT 'ASDFJSKJ4892KSFJJ2843KJSNBKW' REGEXP '^[A-Z0-9]+$' --true
SELECT 'SDFJSDKJF/&SSDKJ$#SDFKSDFKJ' REGEXP '^[A-Z0-9]+$' --false
To allow 0 or 1 repetitions (an optional character), follow the item with a ?. This will allow
two characters to be preceded with a number, but it doesn’t have to:
SELECT '9FX' REGEXP '^[0-9]?[A-Z]{2}$' --true
SELECT 'AX' REGEXP '^[0-9]?[A-Z]{2}$' --true
You can use several repeaters for different clauses in a regex. Below, we qualify a string of
alphabetic characters, a dash - followed by a string of numbers, and then another - with a
string of alphabetic characters.
SELECT 'ASJSDFH-32423522-HUETHNB' REGEXP '^[A-Z]+-[0-9]+-[A-Z]+$' --true
3.4 Wildcards
A dot . represents any character, even whitespaces.
SELECT 'A-3' REGEXP '...' --true
You can also use it with repeaters to create broad wildcards for any number of characters.
SELECT 'A-3' REGEXP '.{3}' --true
SELECT 'A-3' REGEXP '.+' --true
SELECT 'A-3' REGEXP '.*' --true
```sql
SELECT 'WHISKY/23482374/ZULU/23423234/FOXTROT/6453' REGEXP '^([A-Z]+/[0-
9]+/?)+$' --true
The pipe | operator functions as an alternator operator, or effectively an OR. It allows you
to qualify any number of regular expressions where at least one of them must be true:
SELECT 'ALPHA' REGEXP '^(FOXTROT|ZULU|ALPHA|TANGO)$' --true
EXERCISE
Find all customers with an address ending in “Blvd” or “St”:
SELECT * FROM CUSTOMER
WHERE ADDRESS REGEXP '.*(Blvd|St)$'
FROM CUSTOMER_ORDER
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
AND ORDER_DATE = '2017-03-01'
GROUP BY 1, 2
Note you can also create a temporary (or permanent) table from a SELECT query. This is
helpful to persist expensive query results and reuse it multiple times during a session.
SQLite is a bit more convoluted to do this than other platforms:
CREATE TEMP TABLE ORDER_TOTALS_BY_DATE AS
WITH ORDER_TOTALS_BY_DATE AS (
SELECT ORDER_DATE,
SUM(QUANTITY) AS TOTAL_QUANTITY
FROM CUSTOMER_ORDER
GROUP BY 1
)
SELECT * FROM ORDER_TOTALS_BY_DATE
FROM CUSTOMER_ORDER
INNER JOIN CUSTOMER
ON CUSTOMER_ORDER.CUSTOMER_ID = CUSTOMER.CUSTOMER_ID
If you expect records to possibly get multiple discounts, then sum the discounts and GROUP
BY everything else:
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_NAME,
STATE,
ORDER_DATE,
CUSTOMER_ORDER.PRODUCT_ID,
PRODUCT_NAME,
PRODUCT_GROUP
QUANTITY,
PRICE,
SUM(DISCOUNT_RATE) as TOTAL_DISCOUNT_RATE,
PRICE * (1 - SUM(DISCOUNT_RATE)) AS DISCOUNTED_PRICE
FROM CUSTOMER_ORDER
INNER JOIN CUSTOMER
ON CUSTOMER_ORDER.CUSTOMER_ID = CUSTOMER.CUSTOMER_ID
SELECT o1.CUSTOMER_ORDER_ID,
o1.CUSTOMER_ID,
o1.PRODUCT_ID,
o1.ORDER_DATE,
o1.QUANTITY,
o2.QUANTITY AS PREV_DAY_QUANTITY
FROM CUSTOMER_ORDER o1
LEFT JOIN CUSTOMER_ORDER o2
ON o1.CUSTOMER_ID = o2.CUSTOMER_ID
AND o1.PRODUCT_ID = o2.PRODUCT_ID
AND o2.ORDER_DATE = date(o1.ORDER_DATE, '-1 day')
Note if you want to get the previous quantity ordered for that record’s given CUSTOMER_ID
and PRODUCT_ID, even if it wasn’t strictly the day before, you can use a subquery instead
that qualifies previous dates and orders them descending. Then you can use LIMIT 1 to
grab the most recent at the top.
SELECT ORDER_DATE,
PRODUCT_ID,
CUSTOMER_ID,
QUANTITY,
(
SELECT QUANTITY
FROM CUSTOMER_ORDER c2
WHERE c1.ORDER_DATE > c2.ORDER_DATE
AND c1.PRODUCT_ID = c2.PRODUCT_ID
AND c1.CUSTOMER_ID = c2.CUSTOMER_ID
ORDER BY ORDER_DATE DESC
LIMIT 1
) as PREV_QTY
FROM CUSTOMER_ORDER c1
4.5B Recursive Self Joins
At some point of your career, you may encounter a table that is inherently designed to be
self-joined. For instance, run this query:
SELECT * FROM EMPLOYEE
This is a table containing employee information, including their manager via a MANAGER_ID
field. Here is a sample of the results below.
This MANAGER_ID points to another EMPLOYEE record. If you want to bring in Daniel and his
superior’s information, this isn’t hard to do with a self join.
SELECT e1.FIRST_NAME,
e1.LAST_NAME,
e1.TITLE,
e2.FIRST_NAME AS MANAGER_FIRST_NAME,
e2.LAST_NAME AS MANAGER_LAST_NAME
But what if you wanted to display the entire hierarchy above Daniel? Well shoot, this is
hard because now you have to do several self joins to daisy-chain your way to the top. What
makes this even harder is you don’t know how many self joins you will need to do. For
cases like this, it can be helpful to leverage recursive queries.
A recursion is a special type of common table expression (CTE). Typically, you “seed” a
starting value and then use UNION or UNION ALL to append the results of a query that uses
each “seed”, and the result becomes the next seed.
In this case, we will use a RECURSIVE common table expression to seed Daniel’s ID, and then
append each MANAGER_ID of each EMPLOYEE_ID that matches the seed. This will give a set of
ID’s for employees hierarchical to Daniel. We can then use these ID’s to navigate Daniel’s
hierarchy via JOINS, IN, or other SQL operators.
-- generates a list of employee ID's hierarchical to Daniel
Recursive queries are a bit tricky to get right, but practice them if you deal frequently with
hierarchical records. You will likely use them with a specific part of the hierarchy in focus
(e.g. Daniel’s superiors). It’s harder to show the hierarchy for everyone at once, but there
are ways. For instance, you can put a RECURSIVE operation in a subquery and use
GROUP_CONCAT.
SELECT e1.* ,
(
WITH RECURSIVE hierarchy_of(x) AS (
SELECT e1.ID
UNION ALL -- append each manager ID recursively
SELECT MANAGER_ID
FROM hierarchy_of INNER JOIN EMPLOYEE
ON EMPLOYEE.ID = hierarchy_of.x -- employee ID must equal previous
recursion
)
FROM EMPLOYEE e1
Note recursive queries also can be used to improvise a set of consecutive values without
creating a table. For instance, we can generate a set of consecutive integers. Here is how
you create a set of integers from 1 to 1000.
WITH RECURSIVE my_integers(x) AS (
SELECT 1
UNION ALL
SELECT x + 1
FROM my_integers
WHERE x < 1000
)
SELECT * FROM my_integers
You can apply the same concept to generate a set of chronological dates. This recursive
query will generate all dates from today to ‘2030-12-31’:
WITH RECURSIVE my_dates(x) AS (
SELECT date('now')
UNION ALL
SELECT date(x, '+1 day')
FROM my_dates
WHERE x < '2030-12-31'
)
SELECT * FROM my_dates
FROM CUSTOMER_ORDER
GROUP BY 1, 2
We should use a cross join to resolve this problem. For instance, we can leverage a CROSS
JOIN query to generate every possible combination of PRODUCT_ID and CUSTOMER_ID.
SELECT
CUSTOMER_ID,
PRODUCT_ID
FROM CUSTOMER
CROSS JOIN PRODUCT
In this case we should bring in CALENDAR_DATE and cross join it with PRODUCT_ID to get
every possible combination of calendar date and product. Note the CALENDAR_DATE comes
from the CALENDAR table, which acts as a simple list of consecutive calendar dates. Note we
could also have used a recursive query, as shown in the previous example, to generate the
dates. We’ll stick with a simple table instead for now in case you are not comfortable with
recursion yet. We should only filter the calendar to a date range of interest, like 2017-01-01
and 2017-03-31.
SELECT
CALENDAR_DATE,
PRODUCT_ID
FROM PRODUCT
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
Then we can LEFT JOIN to our previous query to get every product quantity sold by
calendar date, even if there were no orders that day:
SELECT CALENDAR_DATE,
all_combos.PRODUCT_ID,
TOTAL_QTY
FROM
(
SELECT
CALENDAR_DATE,
PRODUCT_ID
FROM PRODUCT
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
) all_combos
LEFT JOIN
(
SELECT ORDER_DATE,
PRODUCT_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
) totals
ON all_combos.CALENDAR_DATE = totals.ORDER_DATE
AND all_combos.PRODUCT_ID = totals.PRODUCT_ID
totals AS (
SELECT ORDER_DATE,
PRODUCT_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
)
SELECT CALENDAR_DATE,
all_combos.PRODUCT_ID,
TOTAL_QTY
ON all_combos.CALENDAR_DATE = totals.ORDER_DATE
AND all_combos.PRODUCT_ID = totals.PRODUCT_ID
GROUP BY 1, 2, 3, 4
Exercise 4
For every CALENDAR_DATE and CUSTOMER_ID, show the total QUANTITY ordered for the date
range of 2017-01-01 to 2017-03-31:
ANSWER:
SELECT CALENDAR_DATE,
all_combos.CUSTOMER_ID,
coalesce(TOTAL_QTY, 0) AS TOTAL_QTY
FROM
(
SELECT
CALENDAR_DATE,
CUSTOMER_ID
FROM CUSTOMER
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
) all_combos
LEFT JOIN
(
SELECT ORDER_DATE,
CUSTOMER_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
) totals
ON all_combos.CALENDAR_DATE = totals.ORDER_DATE
AND all_combos.CUSTOMER_ID = totals.CUSTOMER_ID
WITH all_combos AS (
SELECT
CALENDAR_DATE,
CUSTOMER_ID
FROM CUSTOMER
CROSS JOIN CALENDAR
WHERE CALENDAR_DATE BETWEEN '2017-01-01' and '2017-03-31'
),
totals AS (
SELECT ORDER_DATE,
CUSTOMER_ID,
SUM(QUANTITY) as TOTAL_QTY
FROM CUSTOMER_ORDER
GROUP BY 1, 2
)
SELECT CALENDAR_DATE,
all_combos.CUSTOMER_ID,
coalesce(TOTAL_QTY, 0) AS TOTAL_QTY
Section V - Windowing
Windowing functions allow you to greater contextual aggregations in ways much more
flexible than GROUP BY. Many major database platforms support windowing functions.
Since SQLite does not support windowing functions, we are going to use PostgreSQL. While
PostgreSQL is free and open-source, there are a few steps in getting it set up. Therefore to
save time we are going to use Rextester, a web-based client that can run PostgreSQL
queries.
http://rextester.com/l/postgresql_online_compiler
In the resources for this class, you should find a “customer_order.sql” file which can be
opened with any text editor. Inside you will see some SQL commands to create and
populate a CUSTOMER_ORDER table and then SELECT all the records from it. Copy/Paste the
contents to Rextester and the click the “Run it (F8)” button.
Notice it will create the table and populate it, and the final SELECT query will execute and
display the results. Note that the table is not persisted after the operation finishes, so you
will need to precede each SELECT exercise with this table creation and population before
your SELECT.
5.1 PARTITION BY
Sometimes it can be helpful to create a contextual aggregation for each record in a query.
Windowing functions can make this much easier and save us a lot of subquery work.
For instance, it may be helpful to not only get each CUSTOMER_ORDER for the month of
MARCH, but also the maximum quantity that customer purchased for that PRODUCT_ID. We
can do that with an OVER PARTITION BY combined with the MAX() function.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MAX(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as
MAX_PRODUCT_QTY_ORDERED
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31'
ORDER BY CUSTOMER_ORDER_ID
Each MAX_PRODUCT_QTY_ORDERED will only be the minimum QUANTITY of that given record’s
PRDOUCT_ID and CUSTOMER_ID. The WHERE will also filter that scope to only within MARCH.
You can have multiple windowed fields in a query. Below, we get a MIN, MAX, and AVG for
that given window.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MIN(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as
MIN_PRODUCT_QTY_ORDERED,
MAX(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as
MAX_PRODUCT_QTY_ORDERED,
AVG(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as
AVG_PRODUCT_QTY_ORDERED
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
You can also mix and match scopes which is difficult to do with derived tables.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MIN(QUANTITY) OVER(PARTITION BY PRODUCT_ID, CUSTOMER_ID) as
MIN_PRODUCT_CUSTOMER_QTY_ORDERED,
MIN(QUANTITY) OVER(PARTITION BY PRODUCT_ID) as MIN_PRODUCT_QTY_ORDERED,
MIN(QUANTITY) OVER(PARTITION BY CUSTOMER_ID) as MIN_CUSTOMER_QTY_ORDERED
FROM CUSTOMER_ORDER
When you are declaring your window redundantly, you can reuse it using a WINDOW
declaration, which goes between the WHERE and the ORDER BY.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
MIN(QUANTITY) OVER(w) as MIN_PRODUCT_QTY_ORDERED,
MAX(QUANTITY) OVER(w) as MAX_PRODUCT_QTY_ORDERED,
AVG(QUANTITY) OVER(w) as AVG_PRODUCT_QTY_ORDERED
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
5.2 ORDER BY
You can also use an ORDER BY in your window to only consider values that comparatively
come before that record.
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
Note you can precede the ORDER BY clause with a DESC keyword to window in the opposite
direction.
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
If you want to incrementally roll the quantity by each row’s physical order (not logical
order by the entire ORDER_DATE), you can use ROWS BETWEEN instead of RANGE BETWEEN.
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_ID,
ORDER_DATE,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER (ORDER BY ORDER_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW) AS ROLLING_QUANTITY
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
Note the AND CURRENT ROW is a default, so you can shorthand it like this:
SUM(QUANTITY) OVER (ORDER BY ORDER_DATE ROWS UNBOUNDED PRECEDING) AS
ROLLING_QUANTITY
In this particular example, you could have avoided using a physical boundary by specifying your window with an
ORDER BY CUSTOMER_ORDER_ID. But we covered the previous strategy anyway to see how to execute
physical boundaries. Here is an excellent overview of windowing functions and bounds:
http://mysqlserverteam.com/mysql-8-0-2-introducing-window-functions/
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31'
You need to be very careful mixing PARITITION BY with an ORDER BY that uses a physical
boundary! If you sort the results, it can get confusing very quickly because you lose that
physical ordered context.
SELECT CUSTOMER_ORDER_ID,
ORDER_DATE,
CUSTOMER_ID,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER(PARTITION BY PRODUCT_ID ORDER BY ORDER_DATE) as
total_qty_for_product
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31'
ORDER BY ORDER_DATE
FROM CUSTOMER_ORDER
ORDER BY CUSTOMER_ORDER_ID
EXERCISE
For the month of March, bring in the rolling sum of quantity ordered (up to each
ORDER_DATE) by CUSTOMER_ID and PRODUCT_ID.
SELECT CUSTOMER_ORDER_ID,
ORDER_DATE,
CUSTOMER_ID,
PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER(PARTITION BY CUSTOMER_ID, PRODUCT_ID ORDER BY ORDER_DATE)
as total_qty_for_customer_and_product
FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2017-03-01' AND '2017-03-31'
ORDER BY CUSTOMER_ORDER_ID
engine = create_engine('sqlite:///thunderbird_manufacturing.db')
conn = engine.connect()
for r in results:
print(r)
6.1B Using SQL with Python
You can package up interactions with a database into helper functions. Below, we create a
function called get_all_customers() which returns the results as a List of tuples:
from sqlalchemy import create_engine, text
engine = create_engine('sqlite:///thunderbird_manufacturing.db')
conn = engine.connect()
def get_all_customers():
stmt = text("SELECT * FROM CUSTOMER")
return list(conn.execute(stmt))
print(get_all_customers())
engine = create_engine('sqlite:///thunderbird_manufacturing.db')
conn = engine.connect()
def get_all_customers():
stmt = text("SELECT * FROM CUSTOMER")
return list(conn.execute(stmt))
def customer_for_id(customer_id):
stmt = text("SELECT * FROM CUSTOMER WHERE CUSTOMER_ID = :id")
return conn.execute(stmt, id=customer_id).first()
print(customer_for_id(3))
library(DBI)
library(RSQLite)
dbClearResult(myQuery)
print(myData)
remove(myQuery)
dbDisconnect(db)
You can get detailed information on how to work with R and SQL in the official DBI
documentation: * DBI Interface: https://cran.r-project.org/web/packages/DBI/index.html
* DBI PDF: https://cran.r-project.org/web/packages/DBI/DBI.pdf
• jOOQ - A more modern (but commercial) ORM that fluently allows working with
databases in a type-safe manner.
• Speedment - Another fast turnaround, fluent API that compiles pure Java code from
table schemas to work with databases.
If you are going to go the vanilla JDBC route, it is a good idea to use a connection pool so
you can persist and reuse several connections safely in a multithreaded environment.
HikariCP is a leading option to achieve this and provides an optimal DataSource
implementation, which is Java’s recommended interface for a database connection pool.
A helpful resource to learning how to work with JDBC is Jenkov’s in-depth tutorial:
http://tutorials.jenkov.com/jdbc/index.html
6.3A - Selecting Data with JDBC and HikariCP
To connect to a database using JDBC and HikariCP, you will need the appropriate JDBC
drivers for your database platform (e.g. SQLite) as well as Hikari-CP.
dependencies {
compile 'org.xerial:sqlite-jdbc:3.19.3'
compile 'com.zaxxer:HikariCP:2.6.3'
compile 'org.slf4j:slf4j-simple:1.7.25'
}
Below, we create a simple Java application that creates a Hikari data source with a
minimum of 1 connection and a maximum of 5. Then we create a query and loop through
it’s ResultSet.
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
try {
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:sqlite:/c:/git/oreilly_advanced_sql_for_data/thunderb
ird_manufacturing.db");
config.setMinimumIdle(1);
config.setMaximumPoolSize(5);
while (rs.next()) {
System.out.println(rs.getInt("CUSTOMER_ID") + " " +
rs.getString("CUSTOMER_NAME"));
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
try {
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:sqlite:/c:/git/oreilly_advanced_sql_for_data/thunderb
ird_manufacturing.db");
config.setMinimumIdle(1);
config.setMaximumPoolSize(5);
ResultSet rs = stmt.executeQuery();
while (rs.next()) {
System.out.println(rs.getInt("CUSTOMER_ID") + " " +
rs.getString("CUSTOMER_NAME"));
}
import java.math.BigDecimal;
import java.sql.Connection;
import java.sql.PreparedStatement;
try {
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:sqlite:/c:/git/oreilly_advanced_sql_for_data/thunderb
ird_manufacturing.db");
config.setMinimumIdle(1);
config.setMaximumPoolSize(5);
stmt.setString(1,"Kry Kall");
stmt.setString(2,"BETA");
stmt.setBigDecimal(3, BigDecimal.valueOf(35.0));
stmt.executeUpdate();