0% found this document useful (0 votes)
35 views

SQL Window Functions

Window functions allow calculations across related rows like running totals, moving averages, and rankings. They perform aggregations across partitions of rows similarly to GROUP BY but return results for every row rather than aggregating into single rows. Window functions are useful for comparing values to aggregates like comparing a product's price to the average price of other products from the same year.

Uploaded by

Clóvis Nóbrega
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

SQL Window Functions

Window functions allow calculations across related rows like running totals, moving averages, and rankings. They perform aggregations across partitions of rows similarly to GROUP BY but return results for every row rather than aggregating into single rows. Window functions are useful for comparing values to aggregates like comparing a product's price to the average price of other products from the same year.

Uploaded by

Clóvis Nóbrega
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

What are Window Functions?

> Partition by

We can use PARTITION BY together with OVER to specify the column over
A window function makes a calculation across multiple rows that are related to
which the aggregation is performed.

SQL for Data Science the current row. For example, a window function allows you to calculate.

Comparing PARTITION BY with GROUP BY, we find the following similarity and
Running totals (i.e. sum values from all the rows before the current row difference
SQL Window Functions 7 day moving averages (i.e. average values from 7 rows before the current row
-

Rankings
Just like GROUP BY, the OVER subclause splits the rows into as many partitions
as there are unique values in a column.
Learn SQL online at www.DataCamp.com

Similar to an aggregate function (GROUP BY), a window function performs the


However, while the result of a GROUP BY aggregates all rows, the result of a
operation across multiple rows. Unlike an aggregate function, a window function
window function using PARTITION BY aggregates each partition
does not group rows into one single row.

independently. Without the PARTITION BY clause, the result set is one single
partition.

AGGREGATE 
 WINDOW 

> Example dataset FUNCTION FUNCTION

For example, using GROUP BY, we can calculate the average price of bicycles
per model year using the following query.

We will use a dataset on the sales of bicycles as a sample. This dataset


includes: SELECT

model_year,

AVG(list_price) avg_price

FROM products

The [product] table GROUP BY model_year

The product table contains the types of bicycles sold, their model year, and list What if we want to compare each product’s price with the average price of
price.
that year? To do that, we use the AVG() window function and PARTITION BY the

product_id product_name model_year list_price


> Syntax model year, as such.

SELECT

model_year,

1 Treak 820 - 2016 2016 379.99 Windows can be defined in the SELECT section of the query. product_name,

2 Ritchey Timberwolf Frameset - 2016 2016 749.99 SELECT

list_price,

window_ function() OVER (


AVG(list_price) OVER

3 Surly Wednesday Frameset - 2016 2016 999.99 PART T O BY partition_expression

I I N
(PARTITION BY model_year) 


ORDER BY order_expression

avg_price

4 Trek Fuel EX 8 29 - 2016 2016 2899.99 window_ rame_extent

f
FROM products

) AS window_column_alias

5 Heller Shagamaw Frame - 2016 2016 1320.99 FROM table_name


Notice how the avg_price of 2018 is exactly the same whether we use the
PARTITION BY clause or the GROUP BY clause.

To reuse the same window with several window functions, define a named
The [order] table window using the WINDOW keyword. This appears in the query after the
HAVING section and before the ORDER BY section.

The order table contains the order_id and its date. SELECT

window_ unction() OVER(window_name)

f
> Window frame extent
FROM table_name

order_id order_date [ AV G ...]

H IN A window frame is the selected set of rows in the partition over which
WINDO window_name AS (

W
aggregation will occur. Put simply, they are a set of rows that are somehow
PART T O BY partition_expression

1 2016-01-01T00:00:00.000Z I I N
related to the current row.

ORDER BY order_expression

window_ rame_extent
f

2 2016-01-01T00:00:00.000Z
)
A window frame is defined by a lower bound and an upper bound relative to
[ORDER BY ...] the current row. The lowest possible bound is the first row, which is known as
3 2016-01-02T00:00:00.000Z
UNBOUNDED PRECEDING. The highest possible bound is the last row, which is
known as UNBOUNDED FOLLOWING. For example, if we only want to get 5
4 2016-01-03T00:00:00.000Z

5
> Or der by rows before the current row, then we will specify the range using 5 PRECEDING.

2016-01-03T00:00:00.000Z

ORDER BY is a subclause within the OVER clause. ORDER BY changes the basis on UNBOUNDED

PRECEDING
which the function assigns numbers to rows.

The [order_items] table

N PRECEDING

It
is a must-have for window functions that assign sequences to rows, including N ROWS
RANK and ROW_NUMBER. For example, if we ORDER BY the expression `price` on CURRENT ROW
The order_items table lists the orders of a bicycle store. For each order_id, there
an ascending order, then the lowest-priced item will have the lowest rank.
M ROWS
are several products sold (product_id). Each product_id has a discount value.

M FOLLOWING
Let's compare the following two queries which differ only in the ORDER BY clause.

order_id product_id discount UNBOUNDED

/* Rank price from LOW->HIGH */


/* Rank price from HIGH->LOW */
FOLLOWING

1 20 0.2 SELECT
SELECT

product_name,
product_name,

1 8 0.07 list_price,
list_price,

RANK() OVER
RANK() OVER

1 10 0.05
> Accompanying Material
(ORDER BY list_price DESC) rank
(ORDER BY list_price ASC) rank

FROM products
FROM products

1 16 0.05

1 4 0.2
You can use this https://bit.ly/3scZtOK to run any of the queries explained in
2 20 0.07
this cheat sheet.

> Value window functions > LEAD, LAG


FIRST_VALUE() and LAST_VALUE() retrieve the first and last value respectively The LEAD and LAG locate a row relative to the current row.

SQL for Data Science from an ordered list of rows, where the order is defined by ORDER BY.

Value window function Function Function Syntax Function Description

SQL Window Functions


LEAD(expression Accesses the value stored in a row after the
FIRST_VALUE(value_to_return) OVER Returns the first value in an ordered set of
values

[,offset[,default_value]]) current row.


(ORDER BY value_to_order_by)

OVER(ORDER BY columns)

Learn SQL online at www.DataCamp.com LAST_VALUE(value_to_return) OVER Returns the last value in an ordered set of LAG(expression Accesses the value stored in a row before
(ORDER BY value_to_order_by) values [,offset[,default_value]]) the current row.
OVER(ORDER BY columns)
NTH_VALUE(value_to_return, n) OVER Returns the nth value in an ordered set of
(ORDER BY value_to_order_by) values.
Both LEAD and LAG take three arguments
To compare the price of a particular bicycle model with the cheapest (or most Expression: the name of the column from which the value is retrieve

> Ranking window functions expensive) alternative, we can use the FIRST_VALUE (or LAST_VALUE).

/* h
Find t e difference in price from /* h
Find t e difference in price from
Offset: the number of rows to skip. Defaults to 1

Default_value: the value to be returned if the value retrieved is null.

h h
t e c eapest alternative */
h
t e priciest alternative */

Defaults to NULL.

There are several window functions for assigning rankings to rows. Each of SELECT
SELECT

product_name,
product_name,

these functions requires an ORDER BY sub-clause within the OVER clause.


list_price,
list_price,

With LAG and LEAD, you must specify ORDER BY in the OVER clause.


T
FIRS _VALUE(list_price) OVER (
T
LAS _VALUE(list_price) OVER (

ORDER ORDER
The following are the ranking window functions and their description: BY list_price
BY list_price

LEAD and LAG are most commonly used to find the value of a previous row or
ROWS BE T WEEN
ROWS BE T WEEN

UNBOUN E D D PRECED ING


UNBOUN E D D PRECED ING

the next row. For example, they are useful for calculating the year-on-year
D
AN
D
AN
increase of business metrics like revenue.

Function Syntax Function D escription Additional notes UNBOUN E D D FOLLOWING


UNBOUN E D D FOLLOWING

h
) AS c eapest_price,
) AS highest_price

FROM product s

FROM product s

Here is an example of using lag to compare this year's sales to last year's.

ROW_NUMBER()
Assigns a sequential integer Row numbers are not repeated within
to each row within the each partition.

/* h b
Find t e num er of orders in a year */

partition of a result set.

WI T
H yearly_orders AS (

SELECT

year(order_date) AS year,

Assigns a rank number to Tied values are given the same rank
RANK()

COUNT(DISTINCT order_id) AS num_orders

each row in a partition.

The next rankings are skipped. FROM sales.orders

GROUP BY year(order_date)

PERCENT_RANK() Assigns the rank number of Tied values are given the same rank

> Aggregate window functions


each row in a partition as a Computed as the fraction of rows /* h '


Compare t is year s sales to last year s ' */

percentage.

less than the current row, i.e., the SELECT

rank of row divided by the largest *,

rank in the partition. D


LAG(num_orders) OVER (OR ER BY year) last_year_order,

Aggregate functions available for GROUP BY, such as COUNT(), MIN(), MAX(), LAG(num_orders) OVER (ORDER BY year) - num_orders diff_from_last_yea r

SUM(), and AVG() are also available as window functions.


FROM yearly_orders

NTILE(n_buckets) D istributes the rows of a For example, if we perform the


partition into a specified window function NTILE( 5) on a table
Function Syntax Function Description
number of buckets.

with 100 rows, they will be in bucket


1, rows 21 to 40 in bucket 2, rows 41
COUNT(expression) OVER (PARTITION Count the number of rows that have a non-
to 60 in bucket 3, et cetera.

BY partition_column) null expression in the partition.

MIN(expression) OVER (PARTITION BY Find the minimum of the expression in the


CUME_DIST() The cumulative distribution: the It returns a value larger than 0 and
partition_column)

partition.

percentage of rows less than or at most 1.


Similarly, we can make a comparison of each year's order with the next year's.
equal to the current row.

Tied values are given the same


MAX(expression) OVER (PARTITION BY Find the maximum of the expression in the
cumulative distribution value.

partition_column)

partition.

/* h b
Find t e num er of orders in a year */

WI T H yearly_orders AS (

AVG(expression) OVER (PARTITION BY Find the mean (average) of the expression SELECT

We can use these functions to rank the product according to their prices. partition_column)

in the partition.

year(order_date) AS year,

COUNT(DISTINCT order_id) AS num_orders

/* Rank all products by price */


FROM sales.orders

SELECT

Suppose we want to find the average, maximum and minimum discount for GROUP BY year(order_date)

product_name,
)

list_price,

each product, we can achieve it as such.


D
ROW_NUMBER() OVER (OR ER BY list_price) AS ro _num,
w /* h b
Compare t e num er of years compared to ne t year x */

DE NSE_RANK() OVER (ORDER BY list_price) AS dense_rank,


SELECT
SELEC T ,
*

D
RANK() OVER (OR ER BY list_price) AS rank,
order_id,
LEAD(num_orders) D x
OVER (OR ER BY year) ne t_year_order,

T D
PERCEN _RANK() OVER (OR ER BY list_price) AS pct_rank,
product_id,
LEAD(num_orders) OVER (ORDER BY year) - num_orders diff_from_ne t_yeax r

NT ILE( 75) D
OVER (OR ER BY list_price) AS ntile,
discount,
FROM yearly_order s

CUME_ D ST()
I D
OVER (OR ER BY list_price) AS cume_dis t
AVG(discount) OVER (PAR T TI ION BY product_id) AS avg_discount,

FROM product s

MIN(discount) OVER (PART TI ION BY product_id) AS min_discount,

X
MA (discount) OVER (PART TI ION x
BY product_id) AS ma _discoun t

FROM order_item s

You might also like