SQL Inter Q&A

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

🐳

Top N
Lesson Content
What is Top N Problem?
Top N Records
Top N Per Category
Top N Per Category With Ties

What is Top N Problem?

📌 Sample Questions
• What are the Top 5 highest-rated movies?
• What are the Top 3 highest paid employees per department?
• What are the Top 3 highest paid employees per department when there’re ties?

Top N Records
Query the 5th largest value in the table t.

Assume the values are unique, and there are more than 5 values in the table.

Table t:

value

10

50

Although we are not returning all top 5 values, we still consider this as the Top N problem.

The only difference is that we exclude the Top N - 1 from the result.

Top N 1
LIMIT and OFFSET
MySQL

SELECT value
FROM t
ORDER BY value DESC
LIMIT 1
OFFSET 4;

Select values from table t and sort values in descending order.

Sort the numbers, use OFFSET , and LIMIT to return the 5th row.

MS SQL Server

In MS SQL server, the syntax is a little different.

SELECT value
FROM t
ORDER BY value DESC
OFFSET 4 ROWS
FETCH NEXT 1 ROWS ONLY;

There is no LIMIT keyword, Use the FETCH keyword to specify how many rows to return.

Window Functions

SELECT value
FROM (
SELECT value,
ROW_NUMBER() OVER(ORDER BY value DESC) AS row
FROM t
) AS rk_table
WHERE row = 5;

💡 When using window functions, we cannot apply filters on the result generated by the window
function directly → create a subquery to filter results.

value ROW_NUMBER DENSE_RANK RANK

5 1 1 1

4.9 2 2 2

4.9 3 2 2

4.8 4 3 4

The rank of a row is determined by one plus the number of ranks that come before it.

Top N 2
Top N Per Category

🌿 Cannot use

Use any of
LIMIT and

ROW_NUMBER()
OFFSET

,
, the window function is a better choice.

RANK() , and DENSE_RANK() .

Example: Query the 5 highest-rated restaurants in each city.

highest-rated refers to the highest average rating.

If two restaurants have the same average ratings, return either restaurant.

Table rating:

I.D. Name City Rating

10010 Kim’s Kitchen New York 4

10011 Super Dragon San Francisco 3

… … … …

12010 Tom’s Seafood Tokyo 2

💡 Idea:
1. Compute average ratings for all the restaurants.
2. Sort ratings.
3. Select the top 5.

1. Compute average ratings for all the restaurants.

SELECT
name,
city,
AVG(rating * 1.0) AS ave_rating
FROM rating
GROUP BY name, city;

Since the ratings are integers, multiply by 1.0 to avoid integer division.

Put this query in a WITH CTE and name it avg_ratings.

2. Sort ratings.

WITH avg_ratings AS (
SELECT
name, city,
AVG(rating * 1.0) AS avg_rating
FROM rating
GROUP BY name, city
)

Top N 3
SELECT
name, city, avg_rating,
ROW_NUMBER() OVER(PARTITION BY city ORDER BY avg_rating DESC) as row
FROM avg_ratings;

Since we need only 5 restaurants per city, and the ties can be broken arbitrarily.

Put this query in another WITH CTE and name it rating_rank.

3. Select the top 5.

WITH avg_ratings AS (
SELECT
name, city,
AVG(rating * 1.0) AS avg_rating
FROM rating
GROUP BY name, city
),
rating_rank AS (
SELECT
name, city, rating,
ROW_NUMBER() OVER(PARTITION BY city ORDER BY avg_rating DESC) as row
FROM avg_ratings
)

SELECT
name, city, rating
FROM rating_rank
WHERE row <= 5;

Filter the row as less or equal to 5 → select only 5 top-rated restaurants per city.

Top N Per Category With Ties

❓ What if there are ties in the ranks, and we want to get all the restaurants with the top 5 ratings
per city? How do we modify the query?

If the restaurants have the same average ratings, return all restaurants with the same ratings.

Number of restaurants per city ≥ 5.

Change the ranking function from ROW_NUMBER to DENSE_RANK .

value ROW_NUMBER DENSE_RANK RANK

5 1 1 1

4.9 2 2 2

4.9 3 2 2

4.8 4 3 4

WITH avg_ratings AS (
SELECT

Top N 4
name, city,
AVG(rating * 1.0) AS avg_rating
FROM rating
GROUP BY name, city
),
rating_rank AS (
SELECT
name, city, avg_rating,
DENSE_RANK() OVER(PARTITION BY city ORDER BY avg_rating DESC) as rk
FROM avg_ratings
)

SELECT
name, city, avg_rating
FROM rating_rank
WHERE rk <= 5;

🔥 During interviews:
• Clarify the logic - whether to output top N records, or all records (≥ N) that match the top N
scores.

Top N 5
🏍
Ratios
Calculating Ratios
Two Methods
Example: Subscription Rate
Example: Immediate Order

Calculating Ratios
The problem is to compute a ratio or a percentage given some data entries or system logs.

For example:

Query the percentage of users who had some behavior from a table with user behavior logs.

Query the percentage of products that satisfy some criteria based on a purchase history table.

Usually the numerator and the denominator are counts that come from the same table.

Two Methods

💡 There are 2 common ways to compute a ratio.

1. Subquery: Use a subquery to compute the denominator and the main query to compute the
numerator and ratio.

2. CASE WHEN : Use CASE WHEN to compute the numerator and the main query to compute the
denominator and ratio.

Example: Subscription Rate


Table: Subscription The premium column shows whether the user has
opted in for the premium subscription.
user_id premium

Ratios 1
user_id premium Write a query to calculate the premium subscription
1 TRUE rate: the count of premium subscribers over the total
number of users.
2 FALSE

3 TRUE

Subquery method

Use a subquery to get the denominator.

SELECT
COUNT(user_id) * 1.0 / (SELECT COUNT(user_id) FROM subscription)
AS ratio
FROM subscription
WHERE premium = 'TRUE';

In MS SQL server, we need to multiply the numerator by 1.0 to avoid integer division, which will
return 0.

CASE WHEN method

Use CASE WHEN to return either 0 or 1 based on a certain condition.

SUM → numerator

First use the CASE WHEN statement, then SUM over all the numbers returned → the count of rows
that meet the condition we specified.

SUM(
CASE
WHEN condition THEN 1
ELSE 0
END
)

We usually use this method to compute the numerator.

SELECT
SUM(CASE WHEN premium = 'TRUE' THEN 1 ELSE 0 END) * 1.0 / COUNT(user_id)
AS ratio
FROM subscription;

The SUM(CASE WHEN..) statement will give us the count of premium subscribers.

Multiply the numerator by 1.0 to avoid the integer division.

AVG

An easier (i.e. better) way to get ratio by avoiding calculating the denominator.

SELECT
AVG(CASE WHEN premium = 'TRUE' THEN 1.0 ELSE 0.0 END)

Ratios 2
AS ratio
FROM subscription;

This method only works when the denominator is the total count.

Note: Change the return value to decimals, 1.0 and 0.0 to avoid integer division in the AVG

function.

Example: Immediate Order


Delivery Table Schema: Table: Delivery

column name type customer_id order_date pref_delivery_date

customer_id int 1 2019-08-01 2019-08-02

order_date date 2 2019-08-02 2019-08-02

pref_delivery_date date 1 2019-09-02 2019-09-04

3 2019-10-12 2019-10-12

3 2019-10-09 2019-10-11

2 2019-08-11 2019-08-13

4 2019-01-09 2019-01-09

Query the percentage of users who placed their first order as an immediate order.

The first order is the earliest order that a customer placed based on the order date.

The immediate order is defined as the same-day order; orders with the same customer preferred
delivery date and the order date.

Get the result as a decimal named immediate_percentage.

customer_id order_date pref_delivery_date

1 2019-08-01 2019-08-02 The immediate percentage is 50% -


customers with id 2 and 4 satisfy the
2 2019-08-02 2019-08-02
criteria, and the other 2 don’t.
1 2019-09-02 2019-09-04

3 2019-10-12 2019-10-12
Result:
3 2019-10-09 2019-10-11

2 2019-08-11 2019-08-13 immediate_percentage


4 2019-01-09 2019-01-09 0.5

Subquery method

Numerator: Customers whose first order is an immediate order.

SELECT
customer_id
FROM delivery

Ratios 3
GROUP BY customer_id
HAVING MIN (order_date) = MIN (pref_delivery_date)

Use a WITH common table expression to store the result we just got.

WITH first_order AS (
SELECT
customer_id
FROM delivery
GROUP BY customer_id
HAVING MIN(order_date) = MIN (pref_delivery_date)
)

SELECT
COUNT(customer_id) * 1.0 /
(SELECT COUNT(DISTINCT customer_id) FROM delivery)
AS immediate_percentage
FROM first_order

CASE WHEN method

SELECT
AVG(CASE
WHEN first_order_date = pref_delivery_date THEN 1.0
ELSE 0.0 END
) AS immediate_percentage
FROM ...

Need a table which contains the first order date for each customer → get the rankings of the
order dates and select the ranking = 1.

SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS
order_rk
FROM delivery

Result:

customer_id order_date pref_delivery_date order_rk

1 2019-08-01 2019-08-02 1

2 2019-08-02 2019-08-02 1

1 2019-09-02 2019-09-04 2

3 2019-10-12 2019-10-12 1

3 2019-10-09 2019-10-11 2

2 2019-08-11 2019-08-13 2

4 2019-01-09 2019-01-09 1

Put the query into a WITH CTE and call the table ordered_delivery.

Ratios 4
WITH ordered_delivery
AS (SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS
order_rk
FROM delivery)

SELECT
AVG(CASE
WHEN order_date = pref_delivery_date THEN 1.0
ELSE 0.0 END
) AS immediate_percentage
FROM ordered_delivery
WHERE order_rk = 1 # first order

In the main query, select the first order for each customer.

Ratios 5

You might also like