SQL - 05 - Window Functions
SQL - 05 - Window Functions
Subqueries
Problem Statement:
You are a Data Analyst at the Food Corporation of India (FCI). You have been tasked to
study the Farmer’s market - Mandis.
What if we want to operate on multiple rows but don’t want the records to be
grouped in the output?
SELECT
vendor_id,
MAX(original_price) AS highest_price
FROM farmers_market.vendor_inventory
GROUP BY vendor_id
Now, here you’ll get the most expensive item per vendor.
We need a function that allows us to rank rows by a value—in our case, ranking
products per vendor by price—called ROW_NUMBER().
SELECT
vendor_id,
market_date,
product_id,
original_price,
ROW_NUMBER() OVER (PARTITION BY vendor_id ORDER BY original_price
DESC) AS price_rank
FROM farmers_market.vendor_inventory
Syntax breakdown:
● I would interpret the ROW_NUMBER() line as “number the inventory rows per
vendor, sorted by original price, in descending order.”
● OVER() - tells the DBMS to apply the function over a window of rows.
● The part inside the parentheses says how to apply the ROW_NUMBER()
function.
● We’re going to PARTITION BY vendor_id (you can think of this like a GROUP
BY without actually combining the rows, so we’re telling it how to split the rows
into groups without aggregating).
● The ORDER BY indicates how to sort the rows. So, we’ll sort the rows by price,
high to low, within each vendor_id partition, and then number each row as per
their price.
● The highest-priced item per vendor will be first assigned row number 1.
Output explanation:
● For each vendor, the products are sorted by original_price, high to low, and
the row numbering column is called price_rank.
● The row numbering starts when you get to the next vendor_id, so the most
expensive item per vendor has a price_rank of 1.
Subquery - You can also return the record of the highest-priced item per vendor by
querying the results of the query we’ve just written:
SELECT * FROM
(
SELECT
vendor_id,
market_date,
product_id,
original_price,
ROW_NUMBER() OVER (PARTITION BY vendor_id ORDER BY original_price DESC)
AS
price_rank
FROM farmers_market.vendor_inventory ORDER BY vendor_id) x
WHERE x.price_rank = 1
Query Breakdown
● You’ll notice that the preceding query has a different structure than the queries
we have written so far.
● The concept of subqueries comes again. There is one query embedded inside
the other! This is also called “querying from a derived table,”.
● We’re treating the results of the “inner” SELECT statement like a table, here
given the table alias x, selecting all columns from it, and filtering to only the rows
with a particular ROW_NUMBER.
Why not put the WHERE clause in the main query itself? - Execution Order
● If we didn’t use a subquery and had attempted to filter based on the values in the
price_rank field by adding a WHERE clause to the first query with the ROW_
NUMBER function, we would get an error.
● The price_rank value is unknown at the time the WHERE clause conditions
are evaluated per row because the window functions have not yet had a chance
to check the entire dataset to determine the ranking.
● FROM - the database gets the data from tables in FROM clause and if
necessary, performs the JOINs,
● WHERE - the data are filtered with conditions specified in the WHERE clause,
● GROUP BY - the data are grouped by conditions specified in the WHERE
clause,
● Aggregate functions - the aggregate functions are applied to the groups
created in the GROUP BY phase,
● HAVING - the groups are filtered with the given condition,
● Window functions,
● SELECT - the database selects the given columns,
● DISTINCT - repeated values are removed,
● UNION/INTERSECT/EXCEPT - the database applies set operations,
● ORDER BY - the results are sorted,
● OFFSET - the first rows are skipped,
● LIMIT/FETCH/TOP - only the first rows are selected
Note : You can also use ROW_NUMBER without a PARTITION BY clause to number
every record across the whole result (instead of numbering per partition).
Transition -> The problem with the output in ROW_NUMBER() is that even for the same
values, we are getting different numbers or ranks but what if you want same rank to be
assigned to same values.
To return all products with the highest price per vendor when there is more than one
with the same price, use the RANK function
The RANK function numbers the results just like ROW_NUMBER does, but gives
rows with the same value the same ranking.
Output Breakdown
● Notice that the ranking for vendor_id 1 goes from 1 to 2 to 4, skipping 3. That’s
because there’s a tie for second place, so there’s no third place.
● If you don’t want to skip numbers like this in your ranking when there is a tie use
the DENSE_RANK function..
● And if you don’t want the ties at all, use the ROW_NUMBER function.
The ROW_NUMBER() and RANK() functions can help answer a question that asks
something like
● “What are the top 10 items sold at the farmer’s market, by price?” (by filtering the
results to rows numbered less than or equal to 10).
Transition
But what if you were asked to return the “top tenth” of the inventory when sorted by
price?
● You could start by running a query that used the COUNT() function,
● dividing the number returned by 10,
● then writing another query that numbers the rows, and
● filtering to those with a row number less than or equal to the number you just
determined.
But that isn’t a dynamic solution; you’d have to modify it as the number of rows
in the database changed.
NTILE function <To be improved>
The SQL Server NTILE() is a window function that distributes rows of an ordered
partition into a specified number of approximately equal groups, or buckets. It assigns
each group a bucket number starting from one. For each row in a group, the NTILE()
function assigns a bucket number representing the group to which the row belongs.
SELECT
vendor_id,
market_date,
product_id,
original_price,
NTILE(10) OVER (ORDER BY original_price DESC) AS price_ntile
FROM farmers_market.vendor_inventory
ORDER BY original_price DESC
● If the number of rows in the results set can be divided evenly, the results will be
broken up into n equally sized groups, labeled 1 to n.
● If they can’t be divided up evenly, some groups will end up with one more row
than others.
● Note that the NTILE is only using the count of rows to split the groups, and
is not using a field value to determine where to make the splits.
● Therefore, it’s possible that two rows with the same value specified in
ORDER BY clause will end up in two different NTILE groups.
SELECT
vendor_id,
market_date,
product_id,
original_price,
AVG(original_price) OVER (PARTITION BY market_date) AS
average_cost_product_by_market_date
FROM farmers_market.vendor_inventory
Breakdown
● The AVG() function in this query is structured as a window function, meaning it
has “OVER (PARTITION BY __ ORDER BY __)” syntax, so instead of returning a
single row per group with the average for that group, like you would get with
GROUP BY, this function displays the average for each partition in every row
within the partition.
● When you get to a new market_date value in the results dataset, the
average_cost_product_by_market_date value changes.
● Using a subquery, we can filter the results to a single vendor, with vendor_id 8,
and
● only display products that have prices above the market date’s average
product cost.
● Here we will also format the average_cost_product_by_market_ date to two
digits after the decimal point using the ROUND() function:
SELECT * FROM
(
SELECT
vendor_id,
market_date,
product_id,
original_price,
ROUND(AVG(original_price) OVER (PARTITION BY market_date ORDER
BY market_date), 2) AS average_cost_product_by_market_date
FROM farmers_market.vendor_inventory )x
WHERE x.vendor_id = 8
AND x.original_price > x.average_cost_product_by_market_date
ORDER BY x.market_date, x.original_price DESC
Another Example
● Another use of an aggregate window function is to count how many items are in
each partition.
The answer to this question would help you identify that the row you’re looking at
represents just one of the products in a counted set:
SELECT
vendor_id,
market_date,
product_id,
original_price,
COUNT(product_id) OVER (PARTITION BY market_date, vendor_id)
vendor_product_count_per_market_date
FROM farmers_market.vendor_inventory
ORDER BY vendor_id, market_date, original_price DESC
Output:
● You can see that even if I’m only looking at one row for vendor 7 on July 6, 2019,
I would know that it is one of 4 products that vendor had in their inventory on
that market date.
SELECT customer_id,
market_date,
vendor_id,
product_id,
quantity * cost_to_customer_per_qty AS price,
SUM(quantity * cost_to_customer_per_qty) OVER (PARTITION BY
customer_id ORDER BY market_date, transaction_time, product_id) AS
customer_spend_running_total
FROM farmers_market.customer_purchases
● We showed what happens when there is only an ORDER clause, and when both
clauses are present.
● What do you expect to happen when there is only a PARTITION BY clause (and
no ORDER BY clause)?
SELECT customer_id,
market_date,
vendor_id,
product_id,
ROUND(quantity * cost_to_customer_per_qty, 2) AS price,
ROUND(SUM(quantity * cost_to_customer_per_qty) OVER (PARTITION BY
customer_id), 2) AS customer_spend_total
FROM farmers_market.customer_purchases
● This version with no in-partition sorting, calculates the total spent by the
customer and displays that summary total on every row.
● So, without the ORDER BY, the SUM is calculated across the entire partition
instead of as a per-row running total.
● We also added the ROUND() function so this final output displays the prices with
two numbers after the decimal point.
Now we can see how SQL can calculate the changes over time.
● LAG retrieves data from a row that is a selected number of rows back in the
dataset. You can set the number of rows (offset) to any integer value x to count x
rows backward, following the sort order specified in the ORDER BY section of
the window function.
● Partition by vendor_id.
LEAD(expr, N, default)
OVER (Window_specification | Window_name)
Parameters used:
SELECT
market_date,
vendor_id,
booth_number,
LAG(booth_number,1) OVER (PARTITION BY vendor_id ORDER BY
market_date, vendor_id) AS previous_booth_number
FROM farmers_market.vendor_booth_assignments
ORDER BY market_date, vendor_id, booth_number
The LAG() function is used to get value from row that precedes the current row.
Output:
● In this case, for each vendor_id for each market_date, we’re pulling the
booth_number the vendor had 1 market date in the past.
● The values are all NULL for the first market date because there is no prior market
date to pull values from.
Breakdown:
● We will create this report by wrapping the query with the LAG function in another
query,
● we can use this inner query results to filter the results to a market_date and
vendors whose current booth_number is different from their
previous_booth_number:
SELECT * FROM
(
SELECT
market_date,
vendor_id,
booth_number,
LAG(booth_number,1) OVER (PARTITION BY vendor_id ORDER BY market_
date, vendor_id) AS previous_booth_number
FROM farmers_market.vendor_booth_assignments
ORDER BY market_date, vendor_id, booth_number
)x
Question: Let’s say you want to find out if the total sales on each
market date are higher or lower than they were on the previous
market date.
Breakdown(crux - they will have to use both GroupBy(total sales for the day) and
Window function(LAG)):
● We are going to use the customer_purchases table from the Farmer’s Market
database, and also adding GROUP BY function, which the previous examples
did not include.
● The window functions are calculated after the grouping and aggregation occur.
● First, we need to get the total sales per market date, using a GROUP BY and
regular aggregate SUM.
SELECT
market_date,
SUM(quantity * cost_to_customer_per_qty) AS market_date_total_sales
FROM farmers_market.customer_purchases
GROUP BY market_date
Then, we can add the LAG() window function to output the previous market_date’s
calculated sum on each row.
We ORDER BY market_date in the window function to ensure it’s the previous market
date we’re comparing to and not another date.
SELECT
market_date,
SUM(quantity * cost_to_customer_per_qty) AS market_date_total_sales,
LAG(SUM(quantity * cost_to_customer_per_qty), 1) OVER (ORDER BY
market_date) AS previous_market_date_total_sales
FROM farmers_market.customer_purchases
GROUP BY market_date
LEAD works the same way as LAG, but it gets the value from the next row instead of
the previous row (assuming the offset integer is 1). You can set the offset integer to any
value x to count x rows forward, following the sort order specified in the ORDER BY
section of the window function.
Rolling Average - Window Frame
What if we want running costs or cumulative or running costs for each customer ?
SELECT employee,
sale,
date,
SUM(sale) OVER (ORDER BY date) AS cum_sales
FROM sales;
In th
● UNBOUNDED PRECEDING: It indicates that the window starts at the first row of
the partition, UNBOUNDED PRECEDING is the default.
● CURRENT ROW indicates the window begins or ends at the current row.
● UNBOUNDED FOLLOWING indicates that the window ends at the last row of
the partition
In the output,
Both ROWS and RANGE clauses in SQL limit the rows considered by the window
function within a partition.
The ROWS clause does that quite literally. It specifies a fixed number of
rows that precede or follow the current row regardless of their value. These
rows are used in the window function.
On the other hand, the RANGE clause logically limits the rows. That means it
considers the rows based on their value compared to the current row.
NTH_VALUE(expression, N)
FROM FIRST
OVER (
partition_clause
order_clause
frame_clause
)
● The NTH_VALUE() function returns the value of expression from the Nth row of
the window frame. If that Nth row does not exist, the function returns NULL. N
must be a positive integer e.g., 1, 2, and 3.
● The FROM FIRST instructs the NTH_VALUE() function to begin calculation at the
first row of the window frame.
https://dev.mysql.com/blog-archive/mysql-8-0-2-introducing-window-functions/
SELECT
first_name,
department_id,
salary,
NTH_VALUE(first_name, 2) OVER (
PARTITION BY department_id
ORDER BY salary DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) second_highest_salary
FROM
farmers_market.employee;
[TODO Question] - use subqueries to find the 2nd highest or 3rd5th highest item/value.
SELECT original_price
FROM
(SELECT original_price
FROM vendor_inventory
ORDER BY original_price
LIMIT 3) AS Comp
ORDER BY original_price
LIMIT 1;