SQL Window Functions
SQL Window Functions
> Partition by
We can use PARTITION BY together with OVER to specify the column over
A window function makes a calculation across multiple rows that are related to
which the aggregation is performed.
SQL for Data Science the current row. For example, a window function allows you to calculate.
Comparing PARTITION BY with GROUP BY, we find the following similarity and
Running totals (i.e. sum values from all the rows before the current row difference
SQL Window Functions 7 day moving averages (i.e. average values from 7 rows before the current row
-
Rankings
Just like GROUP BY, the OVER subclause splits the rows into as many partitions
as there are unique values in a column.
Learn SQL online at www.DataCamp.com
independently. Without the PARTITION BY clause, the result set is one single
partition.
AGGREGATE
WINDOW
> Example dataset FUNCTION FUNCTION
For example, using GROUP BY, we can calculate the average price of bicycles
per model year using the following query.
model_year,
AVG(list_price) avg_price
FROM products
The product table contains the types of bicycles sold, their model year, and list What if we want to compare each product’s price with the average price of
price.
that year? To do that, we use the AVG() window function and PARTITION BY the
SELECT
model_year,
1 Treak 820 - 2016 2016 379.99 Windows can be defined in the SELECT section of the query. product_name,
list_price,
I I N
(PARTITION BY model_year)
ORDER BY order_expression
avg_price
f
FROM products
) AS window_column_alias
To reuse the same window with several window functions, define a named
The [order] table window using the WINDOW keyword. This appears in the query after the
HAVING section and before the ORDER BY section.
The order table contains the order_id and its date. SELECT
f
> Window frame extent
FROM table_name
H IN A window frame is the selected set of rows in the partition over which
WINDO window_name AS (
W
aggregation will occur. Put simply, they are a set of rows that are somehow
PART T O BY partition_expression
1 2016-01-01T00:00:00.000Z I I N
related to the current row.
ORDER BY order_expression
window_ rame_extent
f
2 2016-01-01T00:00:00.000Z
)
A window frame is defined by a lower bound and an upper bound relative to
[ORDER BY ...] the current row. The lowest possible bound is the first row, which is known as
3 2016-01-02T00:00:00.000Z
UNBOUNDED PRECEDING. The highest possible bound is the last row, which is
known as UNBOUNDED FOLLOWING. For example, if we only want to get 5
4 2016-01-03T00:00:00.000Z
5
> Or der by rows before the current row, then we will specify the range using 5 PRECEDING.
2016-01-03T00:00:00.000Z
ORDER BY is a subclause within the OVER clause. ORDER BY changes the basis on UNBOUNDED
PRECEDING
which the function assigns numbers to rows.
N PRECEDING
It
is a must-have for window functions that assign sequences to rows, including N ROWS
RANK and ROW_NUMBER. For example, if we ORDER BY the expression `price` on CURRENT ROW
The order_items table lists the orders of a bicycle store. For each order_id, there
an ascending order, then the lowest-priced item will have the lowest rank.
M ROWS
are several products sold (product_id). Each product_id has a discount value.
M FOLLOWING
Let's compare the following two queries which differ only in the ORDER BY clause.
1 20 0.2 SELECT
SELECT
product_name,
product_name,
1 8 0.07 list_price,
list_price,
RANK() OVER
RANK() OVER
1 10 0.05
> Accompanying Material
(ORDER BY list_price DESC) rank
(ORDER BY list_price ASC) rank
FROM products
FROM products
1 16 0.05
1 4 0.2
You can use this https://bit.ly/3scZtOK to run any of the queries explained in
2 20 0.07
this cheat sheet.
SQL for Data Science from an ordered list of rows, where the order is defined by ORDER BY.
OVER(ORDER BY columns)
Learn SQL online at www.DataCamp.com LAST_VALUE(value_to_return) OVER Returns the last value in an ordered set of LAG(expression Accesses the value stored in a row before
(ORDER BY value_to_order_by) values [,offset[,default_value]]) the current row.
OVER(ORDER BY columns)
NTH_VALUE(value_to_return, n) OVER Returns the nth value in an ordered set of
(ORDER BY value_to_order_by) values.
Both LEAD and LAG take three arguments
To compare the price of a particular bicycle model with the cheapest (or most Expression: the name of the column from which the value is retrieve
> Ranking window functions expensive) alternative, we can use the FIRST_VALUE (or LAST_VALUE).
/* h
Find t e difference in price from /* h
Find t e difference in price from
Offset: the number of rows to skip. Defaults to 1
h h
t e c eapest alternative */
h
t e priciest alternative */
Defaults to NULL.
There are several window functions for assigning rankings to rows. Each of SELECT
SELECT
product_name,
product_name,
With LAG and LEAD, you must specify ORDER BY in the OVER clause.
T
FIRS _VALUE(list_price) OVER (
T
LAS _VALUE(list_price) OVER (
ORDER ORDER
The following are the ranking window functions and their description: BY list_price
BY list_price
LEAD and LAG are most commonly used to find the value of a previous row or
ROWS BE T WEEN
ROWS BE T WEEN
the next row. For example, they are useful for calculating the year-on-year
D
AN
D
AN
increase of business metrics like revenue.
h
) AS c eapest_price,
) AS highest_price
FROM product s
FROM product s
Here is an example of using lag to compare this year's sales to last year's.
ROW_NUMBER()
Assigns a sequential integer Row numbers are not repeated within
to each row within the each partition.
/* h b
Find t e num er of orders in a year */
WI T
H yearly_orders AS (
SELECT
year(order_date) AS year,
Assigns a rank number to Tied values are given the same rank
RANK()
GROUP BY year(order_date)
PERCENT_RANK() Assigns the rank number of Tied values are given the same rank
percentage.
Aggregate functions available for GROUP BY, such as COUNT(), MIN(), MAX(), LAG(num_orders) OVER (ORDER BY year) - num_orders diff_from_last_yea r
partition.
partition_column)
partition.
/* h b
Find t e num er of orders in a year */
WI T H yearly_orders AS (
AVG(expression) OVER (PARTITION BY Find the mean (average) of the expression SELECT
We can use these functions to rank the product according to their prices. partition_column)
in the partition.
year(order_date) AS year,
SELECT
Suppose we want to find the average, maximum and minimum discount for GROUP BY year(order_date)
product_name,
)
list_price,
D
ROW_NUMBER() OVER (OR ER BY list_price) AS ro _num,
w /* h b
Compare t e num er of years compared to ne t year x */
D
RANK() OVER (OR ER BY list_price) AS rank,
order_id,
LEAD(num_orders) D x
OVER (OR ER BY year) ne t_year_order,
T D
PERCEN _RANK() OVER (OR ER BY list_price) AS pct_rank,
product_id,
LEAD(num_orders) OVER (ORDER BY year) - num_orders diff_from_ne t_yeax r
NT ILE( 75) D
OVER (OR ER BY list_price) AS ntile,
discount,
FROM yearly_order s
CUME_ D ST()
I D
OVER (OR ER BY list_price) AS cume_dis t
AVG(discount) OVER (PAR T TI ION BY product_id) AS avg_discount,
FROM product s
X
MA (discount) OVER (PART TI ION x
BY product_id) AS ma _discoun t
FROM order_item s