Window Functions
Window Functions
Window Functions 🚀
Anil Reddy Chenchu
www.linkedin.com/in/chenchuanil
Window
Functions
Window functions are used to perform calculations across a set of
table rows that are somehow related to the current row. Unlike
aggregate functions, they do not group the result set into a single
output row; instead, they provide a result for every row in the
original result set.They're invaluable in data analysis and
engineering. Let's break down all of them with clear examples and
code snippets!
1. ROW_NUMBER()
2. RANK()
3. DENSE_RANK()
4. LEAD()
5. LAG()
6. FIRST_VALUE()
7. LAST_VALUE()
8. NTILE()
www.linkedin.com/in/chenchuanil
Let me walk you through the tables we
will be working with
employee_id first_name last_name department_id salary hire_date
3 10 3500.00 11 10 30 5000.00 3
4 10 3500.00 12 12 30 6200.00 4
11 30 4800.00 13 13 40 2900.00 1
9 30 5000.00 14 14 40 3100.00 2
10 30 5000.00 15 15 40 3300.00 3
12 30 6200.00 16 16 40 3400.00 4
www.linkedin.com/in/chenchuanil
ROW_NUMBER()....Cont Code Explanation
Let’s break down and explain both queries , focusing on the OVER
clause, PARTITION BY, and the results they generate.
Explanation:
ROW_NUMBER(): This function assigns a unique sequential number
to rows starting from 1.
OVER (ORDER BY salary): The OVER clause defines how the window
function operates across the result set. Here, the ORDER BY salary
clause orders the rows by salary, and the ROW_NUMBER()function
assigns row numbers based on this order.
The row numbers will be generated across the entire result set,
regardless of any grouping or partitioning (because there is no
PARTITION BY in this query).
PARTITION BY department_id: This divides (partitions) the result set into subsets based on
department_id. The ROW_NUMBER() function is applied separately to each partition (i.e., each
department).
ORDER BY salary: Within each partition (department), the rows are ordered by salary, and the
row numbers are assigned accordingly.
Summary of Differences
Query 1: The ROW_NUMBER() function assigns numbers sequentially over the entire dataset,
ordered by salary.
Query 2: The ROW_NUMBER() function assigns numbers separately within each department
(partitioned by department_id) and ordered by salary within each department.
www.linkedin.com/in/chenchuanil
RANK()
The RANK function provides a rank for each row within a partition of a result set,
with gaps between the ranks when there are ties.Rows with equal values receive
the same rank, but the next rank skips numbers.(i.e., gaps are left in the rankings).
SELECT SELECT
employee_id, employee_id, department_id,
salary,
salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary)
RANK() OVER (ORDER BY salary) AS rank
AS rank FROM employees;
FROM employees;
5 2500.00 1 1 10 3000.00 1
6 2700.00 2 2 10 3200.00 2
13 2900.00 3 3 10 3500.00 3
7 3000.00 4 4 10 3500.00 3
1 3000.00 4 5 20 2500.00 1
14 3100.00 6 6 20 2700.00 2
2 3200.00 7 7 20 3000.00 3
15 3300.00 8 8 20 3000.00 3
16 3400.00 9 11 30 4800.00 1
3 3500.00 10 9 30 5000.00 2
8 3500.00 10 10 30 5000.00 2
4 4000.00 12 12 30 6200.00 4
11 4800.00 13 13 40 2900.00 1
9 5000.00 14 14 40 3100.00 2
10 5500.00 15 15 40 3300.00 3
12 6200.00 16 16 40 3400.00 4
RANK()....Cont Code Explanation
Let’s break down and explain both queries, focusing on the OVER clause,
PARTITION BY, and the results they generate.
We'll go through two queries in previous slide: one without PARTITION BY and one
with PARTITION BY, focusing on the OVER clause, PARTITION BY, and how the
results are generated.
Ties: When two or more rows have the same salary, they receive the same rank, but the next rank
will skip numbers to account for the tie.
Result: The RANK() function will assign ranks based on the salary. If two employees have the
same salary, they will share the same rank, and the next rank will skip numbers.
PARTITION BY department_id: This divides the result set into partitions (groups) based on
department_id. The RANK() function is applied separately to each partition, so each department
will have its own ranking sequence.
ORDER BY salary: Within each partition (department), the rows are ranked based on salary.
Ties: When employees within the same department have the same salary, they will share the
same rank, and subsequent ranks will skip numbers.
Result: The RANK() function will assign ranks separately within each partition (department), with
the ranking order determined by salary. Ties within the same department will receive the same
rank, and subsequent ranks will be skipped.
Summary of Differences:
Query 1 (without PARTITION BY): The RANK() function assigns ranks across the
entire result set, based on the order of salaries. Ties are handled by assigning the
same rank to tied rows, and the next rank skips numbers.
Query 2 (with PARTITION BY): The RANK() function assigns ranks within each
department (partition). Each department will have its own independent ranking
sequence, based on salary, with ties causing skipped ranks.
www.linkedin.com/in/chenchuanil
DENSE_RANK()
The DENSE_RANK function is similar to RANK, but it does not leave gaps between ranking
numbers when there are ties. It does not skip rank numbers when there are ties.
SELECT
SELECT
employee_id, salary,
employee_id, department_id, salary,
DENSE_RANK() OVER (ORDER BY salary)
DENSE_RANK() OVER (Partition By department_id
AS dense_rank
ORDER BY salary) AS dense_rank
FROM employees;
FROM employees;
5 2500.00 1 1 10 3000.00 1
6 2700.00 2 2 10 3200.00 2
13 2900.00 3 3 10 3500.00 3
7 3000.00 4 4 10 3500.00 3
8 3000.00 4 5 20 2500.00 1
1 3000.00 4 6 20 2700.00 2
14 3100.00 5 7 20 3000.00 3
2 3200.00 6 8 20 3000.00 3
15 3300.00 7 11 30 4800.00 1
16 3400.00 8 9 30 5000.00 2
3 3500.00 9 10 30 5000.00 2
4 3500.00 9 12 30 6200.00 3
11 4800.00 10 13 40 2900.00 1
9 5000.00 11 14 40 3100.00 2
10 5000.00 11 15 40 3300.00 3
12 6200.00 12
16 40 3400.00 4
www.linkedin.com/in/chenchuanil
DENSE_RANK()....Cont Code Explanation
Let’s break down and explain both queries, focusing on the OVER clause,
PARTITION BY, and the results they generate.
We'll go through two queries in previous slide: one without PARTITION BY and one
with PARTITION BY, focusing on the OVER clause, PARTITION BY, and how the
results are generated.
Ties: When two or more rows have the same salary, they receive the same rank, but unlike
RANK(), the next rank will not skip numbers.
Result: The DENSE_RANK() function will assign ranks based on salary. Rows with the same salary
receive the same rank, and the next rank is sequential (i.e., it doesn't skip ranks).
Summary of Differences:
Query 1 (without PARTITION BY): The DENSE_RANK() function assigns ranks across the entire result set based
on the salary order. Rows with the same salary receive the same rank, and the next rank is the next consecutive
number (no gaps).
Query 2 (with PARTITION BY): The DENSE_RANK() function assigns ranks within each department (partition).
Each department has its own independent ranking sequence, based on salary, and ranks are sequential without
gaps, even when ties occur.
In summary, DENSE_RANK() behaves like RANK(), but it ensures there are no gaps in the ranking sequence
when ties occur.
www.linkedin.com/in/chenchuanil
Below is the sales table which we are going to
work for LEAD ,LAG, First_Value and Last_Value
Window functions
1 2024-08-01 100.00
2 2024-08-02 150.00
3 2024-08-03 200.00
4 2024-08-04 250.00
5 2024-08-05 300.00
www.linkedin.com/in/chenchuanil
LEAD()
The LEAD function is used to access data from the next row in the result set without
the need for self-joins.
Scenario
Let's say you have a sales table, and you want to compare each sale amount with
the next day sale amount.
SELECT
SaleID,
SaleDate,
SaleAmount,
LEAD(SaleAmount, 1) OVER (ORDER BY SaleDate) AS NextDaySaleAmount
FROM
Sales;
Explanation:
In this example, the LEAD function takes the SaleAmount from the next row (based on SaleDate order)
and places it in the NextSaleAmount column for each row. For the last row, there is no subsequent row,
so the value is NULL.
This is useful when you want to compare a value with the value in the following row in the same result
set.
www.linkedin.com/in/chenchuanil
LAG()
The LAG function. The LAG function allows you to access data from the previous row in a
result set without using a self-join.
Scenario
Suppose you want to compare each sale amount with the previous sale amount.
For example, in sales analysis, you might want to compute the difference between
consecutive sales, find out how much the sales have increased or decreased from the
previous day, and so on.The 1 in the below query indicates that we want to look back 1
row.
SELECT
SaleID,
SaleDate,
SaleAmount,
LAG(SaleAmount, 1) OVER (ORDER BY SaleDate) AS PreviousSaleAmount,
SaleAmount - LAG(SaleAmount, 1) OVER (ORDER BY SaleDate) AS SaleDifference
FROM
Sales;
www.linkedin.com/in/chenchuanil
FIRST_Value()
The FIRST_VALUE window function is used to return the first value within a
specified window or partition of data. This can be helpful when you need to retrieve
the earliest or first occurrence of a value from a set of rows, based on the order you
define.
Scenario
Let’s continue with the Sales table, and suppose you want to retrieve the first sale
amount for each row within the same window of data.
SELECT
SaleID,
SaleDate,
SaleAmount,
FIRST_VALUE(SaleAmount) OVER (PARTITION BY MONTH(SaleDate) ORDER BY SaleDate) AS
FirstSaleAmount
FROM
Sales;
FirstSaleAmoun
SaleID SaleDate SaleAmount
t
In this case, if there were sales across multiple months, the FIRST_VALUE function would return
the first sale amount for each month within the partition.
The FIRST_VALUE function is helpful when you need to capture the earliest value within a specific window of
data, whether it’s the first sale, first event, first occurrence of a value, etc.This can be useful in scenarios like:
Tracking the first time a particular item was sold.
Finding the first occurrence of an event in a series of events.
Evaluating trends based on the earliest data points within a partition of time.
www.linkedin.com/in/chenchuanil
Last_Value()
The LAST_VALUE window function is used to return the last value within a specified
window or partition of data. It is essentially the counterpart to the FIRST_VALUE
function, allowing you to retrieve the last value based on the order you define.
Scenario
Let’s continue with the Sales table and suppose you want to retrieve the last sale
amount for each row within the same window of data.
SELECT
SaleID,
SaleDate,
SaleAmount,
LAST_VALUE(SaleAmount) OVER (PARTITION BY MONTH(SaleDate) ORDER BY SaleDate
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS
LastSaleAmount
FROM
Sales;
If there were sales across multiple months, the LAST_VALUE function would return the last sale
amount for each month within the partition.
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: This is an
important part of the query when using LAST_VALUE. It ensures that the window includes all rows
from the beginning of the dataset to the end of the dataset, so that the last value in the entire
window is returned for every row. Without this clause, LAST_VALUE would only consider rows up
to the current row in the window (default behavior), and thus might not always give you the very
last value of the entire dataset. www.linkedin.com/in/chenchuanil
NTILE()
The NTILE() function is a window function in SQL that divides rows of a result set
into a specified number of approximately equal groups (or buckets). It assigns a
bucket number (or group number) to each row based on the sorting order. The
bucket number starts at 1 and increments by 1 until the number of buckets is
reached.
SELECT
employee_id,
salary,
NTILE(4) OVER (ORDER BY salary) AS salary_quartile
FROM employees;
16 3400.00 3
Use Cases for NTILE()
3 3500.00 3
Performance Evaluation: Divide
4 3500.00 3 employees or students into
performance groups such as top
11 4800.00 4 quartile, middle quartiles, and bottom
quartile.
9 5000.00 4
Salary Bands: Distribute employees into
10 5000.00 4 salary bands to identify high earners,
middle earners, and low earners.
12 6200.00 4
www.linkedin.com/in/chenchuanil
Comprehensive Example with All Window Functions
WITH employee_data AS (
SELECT
employee_id,
department_id,
salary,
ROW_NUMBER() OVER (ORDER BY salary) AS row_num,
RANK() OVER (ORDER BY salary) AS rank,
DENSE_RANK() OVER (ORDER BY salary) AS dense_rank,
LEAD(salary) OVER (ORDER BY salary) AS next_salary,
LAG(salary) OVER (ORDER BY salary) AS previous_salary,
FIRST_VALUE(salary) OVER (PARTITION BY department_id ORDER BY salary) AS first_salary,
LAST_VALUE(salary) OVER (PARTITION BY department_id ORDER BY salary ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING) AS last_salary,
NTILE(4) OVER (ORDER BY salary) AS salary_quartile
FROM employees
)
SELECT * FROM employee_data;
DATA ANALYTICS
www.linkedin.com/in/chenchuanil