Cracking The SQL Interview For: Data Scientists
Cracking The SQL Interview For: Data Scientists
Cracking The SQL Interview For: Data Scientists
Interview for
DATA
SCIENTISTS
LEON WEI
"SQLPad Helped me gain mastery of all the
core SQL concepts in a structured manner with
a thoughtfully designed business schema"
- Nisheeth, Data Scientist
1
INTRODUCTION ..............................................................................................3
CONGRATULATIONS! ..........................................................................................142
INTRODUCTION
However, many college graduates or young professionals start their job searches without solid SQL
coding skills, - which ultimately costs them their dream jobs
This ebook includes 90 SQL interview questions and solutions, which will help data candidates
learn or refresh their SQL coding skills and get ready for a SQL interview.
This book is also accompanied by the Cracking the SQL Interview for Data Scientists Course
(sold separately), which includes 21 video lectures with tips and techniques to ace a SQL
interview
If there are any questions, feedback, please feel free to send me an email at [email protected]
Leon Wei,
March 2021
3
.
Leon Wei is the founder of sqlpad.io and instamentor.com, two interview preparation sites for data
professionals.
Most recently, he was a senior manager of machine learning at Apple, where he lead a team of
data scientists and engineers building large-scale machine learning systems for Apple's billion
dollars businesses.
Before that, he ran the machine learning and data science team at Chegg, a leading education
technology company. He also worked as a research scientist at Amazon developing its real-time
pricing engine using machine learning.
If you need any help with your job search or career, you can hire him as your coach/mentor
on instamentor.com
4
SELECTED CUSTOMER REVIEWS
“Worked through all the problems at sqlpad.io - Great teaching / refresher tool and highly
recommended!”
— Mike Metzger
“I have been relearning SQL and sqlpad.io has been a great resource. There is a good ratio
of a new concept to practice questions! Highly recommended.”
— Anup Kalburgi
“I signed up on SQLPad and was pleasantly surprised when Leon helped me out personally
on the website. In two mock interviews with him from instamentor.com, he was meticulously
prepared and very professional.”
— Rahul Nayak
“SQLPad is the best website I have used for practicing SQL.The databases and practice
problems resemble real-world data and daily tasks in a data scientist / data analyst role.”
— William
“SQLPad helped me gain mastery of all the core SQL concepts in a structured manner with
a thoughtfully designed business schema. ”
— Nisheeth
“This course helped me get into a nal round of a data scientist interview at Facebook. ”
— Justin
— Juhi
fi
fi
“I am currently in question 30, I really like some of the questions you posted here. Also thank
you for your excellent customer service !!!! ”
— Mamath
“Enjoying it so far, love the mix with practical exercises and the focus to land a job, that's
really important because normal courses do not prepare you for interviews. Also, Leon is
really kind and helpful.”
— Jose
“This site is a great resource for SQL interview practice questions. The interface is excellent!
And even as someone who currently uses SQL for their day job, I have de nitely improved
my skills by working through these problems.”
— Kyle
fi
W H AT A S Q L I N T E R V I E W M AY L O O K L I K E AT A T O P
TEC H C OMPANY
Summar
Let me walk you through a typical SQL interview at one of the top tech rms, so you can get a
sense of how a SQL interview is conducted and what the main criteria are to determine your
performance. I will also share a few tips and techniques to give you an edge for your next
interview.
Introductio
SQL and its related database management system, such as Hive, BigQuery, Spark SQL, are the
#1 toolset in managing and processing data in the industry. If you are interviewing for a data-related
job, you need to prove your SQL pro ciency
Here is a list of typical roles that will require at least one round of SQL interviews
1. Data analys
2. Data scientis
3. Data enginee
4. Business intelligence enginee
5. Product analys
6. Decision scientis
7. Research scientis
8. Software engineer (especially those on the backend side
SQL Interview can sometimes bear other names such as Technical Data Interview or Data
Processing Interview. The primary purpose is to make sure a candidate is hands-on with data and
can contribute immediately after joining the company
You will be asked to perform a series of SQL coding exercises to extract facts or insights from the
given data and answer follow-up questions based on different scenarios
I have been using SQL for 10+ years, and it is still my #1 choice in preparing data
However, I have interviewed so many candidates who started their job searches without a solid
coding skill in SQL. In the end, the SQL skills gap cost their dream jobs
Before COVID-19, a typical interview process included 2 or 3 rounds of phone screens and a nal
round of onsite interviews
Nowadays, most of the interviews are online, often assisted by video conferencing tools such as
Zoom, Webex, or Google Hangout
In the old days, during a nal interview. a candidate would be asked to write their SQL code on a
whiteboard without any help from a modern IDE for syntax highlighting or auto-completion
7
fi
.
fi
?
fi
.
fi
Nowadays, the whiteboard has been replaced by interactive online coding tools such
as coderpad or google docs.
Tips: Before your interview, make sure you familiarize yourself with those online coding
environments. If you don't know what tool they use, ask your recruiter.
Right before the interview, you will receive a link that points to an online coding environment, where
you will code up solutions in SQL.
Let's assume you are interviewing for a data scientist role at a major music streaming company
(e.g., Spotify).
TABLE 1: song
-- id: BIGINT
-- name: VARCHAR(255)
-- artist_id: BIGINT
The first table is a list of songs and their metadata, song names, and associated artists.
TABLE 2: artist
-- id: BIGINT
-- name: VARCHAR(255)
The second table is simply an artist table with only 2 columns: artist id and the artist name.
TABLE 3: daily_plays
-- date: DATE
-- country: VARCHAR(2) ('uk','us','in','jp','cn')
-- song_id BIGINT
-- plays BIGINT
The third table is a daily aggregate table that counts the number of times a song is played
in different countries.
Question 1:
Tips: a common mistake a beginner usually makes is diving into coding right away without fully
understanding the question.
8
An interviewer often start with a vague question to see if a candidate is good at communicating
It is crucial to clarify the following key points before diving into coding.
1. What does it mean 'top'? What is top's de nition? Do you mean number of plays, number of
unique customers, or perhaps other metrics
3. What if there are ties? Should I return only 1 row or all of them? (optionally
After clarifying this question, you can go ahead and start coding, and here is a sample solution
SELECT
S.song_id,
S.name
FROM daily_plays P
INNER JOIN song S
ON P.song_id = S.id
WHERE P.country = 'UK'
AND P.date = CURRENT_DATE - 1
ORDER BY daily_plays DESC
LIMIT 5;
Question 2:
Window functions are asked a lot, especially if the hiring company has a global business.
To identify top artists/songs/albums across different countries, you need to know how to
use ROW_NUMBER and RANK. We will dive into Window functions in our next section.
Again, before start coding, try to clarify a few things with the interviewer
2. Columns to return
4. This is probably subtle, but because multiple artists can perform a song, you need to talk to the
interviewer how to 'count' the plays properly. For simplicity, let's assume a song only has one artist
9
fi
fi
At this step, if you completed the above 2 questions, the interview would have a good sense that
you are very good with SQL coding. He would likely pause SQL coding and transition to follow-up
analytics questions
Sample analytics question #1: Taylor Swift is a popular artist, but her songs' plays dropped
yesterday. How would you analyze this, what data would you use, and what would be your
process
For this kind of question (data change: drop/increase), I have created a framework that you can
borrow
1. What do you mean dropped, comparing today’s data with yesterday's or same day last week
2. Did it drop a lot? If it only dropped a little, maybe we don't have to worry about it
3. And you were told by the interviewer, it's comparing yesterday to the day before, and yes, it
indeed dropped a lot signi cantly
In my 15 years of career, the #1 cause of poor data quality often comes from a technical issue, e.g.,
power outage, a bug in our logging code, someone checked in a code without testing, forgot to pay
a license fee, and the third-party software stopped working.
Even it is a signi cant drop. We still don’t know whether we should treat it as a big problem, e.g., it
could be because of seasonality, the day before yesterday was New Years' Day, so of course the
number will drop, and all other artists' songs are also dropped signi cantly as well
And you were told by the interviewer, yes, it is a big problem, and it is not due to seasonality, and it
only happened to Taylor
After ruling out those potential issues, now it's time to dive into data
10
fi
a
fi
fi
.
fi
.
4.1 Start with internal data (data that you already have full access to)
We can slice Taylor's song plays by different countries, genres, devices, iOS vs. Android, Web vs.
Mobile, etc. And it could be because of a particular country, we just lost a popular song's license in
that country to a competitor, for example
Finally, you can also mention external data impact, e.g., a competitor such as Amazon Music just
had a new product launch. They are giving out free listening to all Taylor's songs. Therefore, we lost
a bunch of users
Maybe it is because this artist dropped out of a top chart on Billboard or it could also because the
artist lost a lot of followers after posting something controversial on social media.
Here are a few more questions that could be asked during interview, in case you want to continue
practicing
First, clarify the interviewer's requirements and the goals, why do we need to build this,
what problems are we trying to solve, who are our customers we are serving, why are you not
happy with the existing solution
Always start small, de ne a handful of critical product features rst, and only launch it in one or two
countries to test it out. And don't forget to mention you will always keep iterating and ne-tuning the
algorithms/product features to expand its scope and bring it to more markets/countries
Tips: Many tech companies really care about how quickly they can launch the rst version for
system design questions. Bias towards action, and as Reid Hoffman famously put it: if you are not
embarrassed about your rst product, you launched too late
Conclusio
I have shown you what a typical SQL interview (sometimes called technical data interview or data
processing interview) process may look like. I also gave you some tips and techniques and
a framework to answer that standard interview questions
I hope you learned something today. If you have any questions or want some help with your job
search, please feel free to reach out to me at instamentor.
11
.
fi
.
fi
!
fi
fi
?
fi
.
fi
fi
?
fi
WINDOW FUNCTIONS
Introductio
WINDOW functions are a family of SQL functions that are frequently asked during a data scientist
job interview
However, writing a bug-free WINDOW function-based SQL query could be quite challenging for any
job candidates, especially those who just get started with SQL
In this article, I will share some typical WINDOW functions speci c interview questions, their
patterns, and step-by-step solutions
Outline
I am going to break down this article into 4 sections
1. In the rst section, I will go through a few WINDOW functions based on regular aggregate
functions, such as AVG, MIN/MAX, COUNT, SUM.
2. In section 2, I will focus on rank-related functions, such as ROW_NUMBER, RANK,
and RANK_DENSE. Those functions are handy when generating ranking indexes, and you
need to be uent in those functions before entering a data scientist SQL interview
3. In the third section, I will talk about generating statistics (e.g., percentiles, quartiles, median,
etc .) with the NTILE function, a common task for a data scientist
4. In the last section, let’s focus on LAG and LEAD, two functions that are super important if you
are interviewing for a role that requires dealing with time-series data
Window functions are functions that perform calculations across a set of rows related to the current
row
It is comparable to the type of calculation done with an aggregate function, but unlike regular
aggregate functions, window functions do not group several rows into a single output row — the
rows retain their own identities
Behind the scenes, the window functions process more than just the query results' current row
12
fi
fl
.
fi
.
All examples in this article are based on movie DVD rental business data. In this first example,
our goal is to compare each movie DVDs replacement cost to the average cost of movies sharing
the same MPAA ratings.
SELECT
title,
rating,
replacement_cost,
FROM film;
For those of you not based in the United States, an MPAA rating is a film rating system that
decides a film’s suitability for specific audiences, based on a film’s content. For example, G means
it’s appropriate for all ages, while PG-13 contains materials that could be inappropriate for children
under 13.
There is no GROUP BY clause for the AVG function, but how does the SQL engine know which
rows to use to compute the average? The answer is the PARTITION BY clause inside
the OVER() utility, and we are calculating the average based on a unique value of rating.
In the final output, every row has the average cost from the same rating. You can perform other
analyses such as dividing a movie’s cost to avg_cost and find out its expense relative to similar
movies.
13
https://sqlpad.io/playground/
All tables in this article are available on SQLPad’s online SQL playground. If you want to follow
along and submit queries against those tables, please feel free to go to sqlpad.io/playground and
have some fun.
14
Let’s take a look at another example. In this example, I want to compare every movie’s length (in
minutes) to the maximum length of movies from the same category.
SELECT
title,
name,
length,
FROM (
FROM film F
ON FC.film_id = F.film_id
ON C.category_id = FC.category_id
) X;
It’s very similar to the first example. Still, I combined a MAX function with OVER and PARTITION
BY to create a window function, which returned the maximum movie length inside the same movie
category.
For the first film: story side, its length is 163 minutes, and the maximum length of an action film
(same category) is 185. If I compare every action movie’s length to 185, I can get a sense of how
long this specific movie is, relative to its category, as films from different genres tend to have
different durations.
15
SELECT
film_id,
title,
length,
FROM film
ORDER BY film_id;
Let’s take a look at a more complicated example, where we calculated a running sum with a
window function.
Assuming it’s the holiday season, I want to binge-watch all 1000 movies, starting from movie id=1.
After finishing each film, I want to know what my overall progress is. I can use SUM and OVER to
calculate a running total of time to get my progress.
Notice that there is no PARTITION BY clause because I am not grouping those movies into any
sub-categories. I want to compute my overall progress but not based on any subgroups or
categories.
Another thing to notice is that if I don’t add anything inside the OVER() function, I get the total
number of minutes from the entire movie catalog. As you can see from the second last column:
they all have the same value of 115267, but after I add the ORDER BY clause, I get the running
total of the minutes up to that specific row (running_total column).
Again, please feel free to go to sqlpad’s playground and play with this film table until you become
comfortable with the syntax
16
.
If you are interested in practicing a few more WINDOW functions that we just covered, here are 4
exercises for you to reinforce your learning.
SELECT
F.film_id,
F.title,
F.length,
FROM film F
ORDER BY row_number;
In this example, our goal is to create a ranking index based on the movie's length for the entire
movie catalog.
But movies with the same lengths were given a DIFFERENT row number, as the database
randomly assigned a unique number when there was a tie.
17
SELECT
F.film_id,
F.title,
F.length,
C.name AS category,
FROM film F
ON FC.film_id = F.film_id
ON C.category_id = FC.category_id
Let’s take a look at another example. Instead of comparing a movie’s length to all other films from
the entire catalog, we can rank them within each movie category using PARTITION BY.
For example, imagine you are working at an e-commerce company, and it has a global business.
Your boss asks you to send her a list of best sellers for each country. You can
use ROW_NUMBER and PARTITION BY to generate this list quickly.
18
SELECT
F.film_id,
F.title,
F.length,
FROM film F
ORDER BY ranking;
Let’s take a look at the RANK function, which is very similar to ROW_NUMBER. The difference
between RANK and ROW_NUMBER is that RANK assigns the same unique values if there is a
tie and restarts the next value with the total number of rows up to that row. Notice how it jumps from
1 to 11.
19
SELECT
F.film_id,
F.title,
F.length,
C.name AS category,
FROM film F
ON FC.film_id = F.film_id
ON C.category_id = FC.category_id
Similarly, we can also generate rankings within a subgroup with the help of PARTITION BY.
20
SELECT
F.film_id,
F.title,
F.length,
FROM film F
ORDER BY ranking;
The last function I want to show you is DENSE_RANK. It is very similar to RANK but differs in how
it handles ties. It restarts with the following immediate consecutive value rather than creating a
gap.
As you can see here, for the first 2 rows, two movies both have a value of 1. Instead of restarting
from 3, the next dense_rank value starts as 2.
21
SELECT
F.film_id,
F.title,
F.length,
C.name AS category,
FROM film F
ON FC.film_id = F.film_id
ON C.category_id = FC.category_id
Time for some exercises. I have prepared 3 exercises to help with your understanding.
22
Section 3: NTILE
In this section, I am going to show you how to create statistics using NTILE.
NTILE is a handy function, especially for data analytics professionals. For example, as a data
scientist, you probably need to create robust statistics such as quartile, quintile, median, etc., in
your daily job, and NTILE makes it very easy to generate those numbers.
NTILE takes an argument of the number of buckets and then creates this number of buckets as
equally as possible, based on how the rows are partitioned and ordered inside the OVER function.
SELECT
film_id,
title,
length,
FROM film
ORDER BY percentile;
Let’s take a look at example 1, where we created 100 buckets, and we ordered all of the movies by
their length descendingly. Therefore, the longest ones are assigned to bucket #1, and the shortest
ones #100.
23
For the second example, we created a few more statistics, such as DECILES (10 buckets)
and QUARTILES (4 buckets). We also partitioned them by MPAA ratings, so the statistics are
relative to each unique MPAA rating.
NTILE is a very straightforward window function that can be very useful for your daily job as a data
scientist. Let’s do some exercises to help you remember its syntax and reinforce your learning in
this lecture.
LAG and LEAD's main difference is that LAG gets data from previous rows, while LEAD is the
opposite, which fetches data from the following rows.
We can use either one of the two functions to compare month-over-month growth. As a data
analytics professional, you are very likely to work on time-related data. If you can
use LAG or LEAD efficiently, you will be a very productive data scientist.
Their syntax is very similar to other window functions. Instead of focusing on the format of the
syntax, let me show you a couple of examples.
24
WITH daily_revenue AS (
SELECT
DATE(payment_ts) date,
SUM(amount) revenue
FROM payment
GROUP BY DATE(payment_ts)
SELECT
date,
revenue,
FROM daily_revenue
ORDER BY date;
1. In the first step, we created daily movie rental revenue with CTE (common table expression).
2. And in the second step, we appended the previous day’s revenue to the current day’s using the
LAG function.
3. Notice that last 2 columns of the first row are empty. It’s simply because May 24th is the first
available day.
4. We also specified the offset, which is 1, so we fetch the next row. If you change this number to
2, then you compare the current day’s revenue to the day before the previous day.
5. Finally, we divided the current day’s revenue by the previous day’s to create our daily revenue
growth.
25
WITH daily_revenue AS (
SELECT
DATE(payment_ts) date,
SUM(amount) revenue
FROM payment
GROUP BY DATE(payment_ts)
SELECT
date,
revenue,
FROM daily_revenue
ORDER BY date;
Let’s take a look at another example. It’s very similar to the previous one, but instead of appending
the previous day’s revenue, we used the LEAD function with an offset of 1 to get the next day’s
movie rental revenue.
We then divided the next day’s revenue by the current day’s revenue to get the day-over-day
growth.
26
For this lecture, you can try the following 2 exercises to help you get familiar with the syntax.
Summary
Great job. If you have followed through all the examples, you have seen most of the
common WINDOW functions/patterns.
WINDOW functions are a family of SQL utilities that are often asked during a data scientist job
interview.
Writing a bug-free WINDOW function query could be quite challenging. It takes time and practice to
become a master, and you are getting there soon, once you finish this book. 🏆
27
QUESTIONS
SINGLE TABLE OPERATIONS
Sample results
store | manager
-----------+--------------
Woodridge | Jon Stephens
Sample results
category
-----------
Category 1
Category 2
Category 3
28
fi
fi
fi
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Sample results
title
---------------------
MOVIE 1
MOVIE 2
MOVIE 3
MOVIE 4
MOVIE 5
col_name | col_type
-------------+--------------------------
staff_id | integer
first_name | text
last_name | text
address_id | smallint
email | text
store_id | smallint
active | boolean
username | text
picture | character varying
29
fi
fi
fi
fi
fi
fi
Sample results
first_name | last_name
------------+-----------
Jon | Snow
Sample results
year | mon | rev
------+-----+----------
2020 | 1 | 123.45
2020 | 2 | 234.56
2020 | 3 | 345.67
30
n
amount | numeric
payment_ts | timestamp with time zone
Sample results
dt | sum
------------+---------
2020-06-14 | 57.84
2020-06-15 | 1376.52
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
year | mon | uu_cnt
------+-----+--------
2020 | 1 | 123
2020 | 2 | 456
2020 | 3 | 789
31
fi
n
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Sample results
year mon avg_spend
2020 2 3.2543037974683544
2020 5 9.1301559454191033
Sample results
year mon num_hp_customers
2020 2 158
2020 5 520
32
fi
Sample results
min_spend | max_spend
-----------+-----------
0.99 | 52.90
Table: actor
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Sample results
last_name | count
-----------+-------
ALLEN | 3
DAVIS | 3
col_name | col_type
-------------+--------------------------
actor_id | integer
33
n
first_name | text
last_name | text
Sample results
last_name | count
-----------+-------
ALLEN | 3
BERGEN | 1
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Sample results
actor_category | count
----------------+-------
a_actors | 13
b_actors | 8
Instructio
• Write a query to return the number of good days and bad days in May 2020 based on
number of daily rentals.
• Return the results in one row with 2 columns from left to right: good_days, bad_days.
• good day: > 100 rentals.
• bad day: <= 100 rentals.
• Hint (For users already know OUTER JOIN), you can use dates table
• Hint: be super careful about datetime columns.
34
fi
fi
fi
fi
fi
fi
• Hint: this problem could be tricky, feel free to explore the rental table and take a look
at some data.
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
good_days | bad_days
-----------+----------
7 | 24
Instructio
• Write a query to return the number of fast movie watchers vs slow movie watchers.
• fast movie watcher: by average return their rentals within 5 days.
• slow movie watcher: takes an average of >5 days to return their rentals.
• Most customers have multiple rentals over time, you need to rst compute the number
of days for each rental transaction, then compute the average on the rounded up days.
e.g., if the rental period is 1 day and 10 hours, count it as 2 days.
• Skip the rentals that have not been returned yet, e.g., rental_ts IS NULL.
• The orders of your results doesn't matter.
• A customer can only rent one movie per transaction.
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
watcher_category | count
------------------+-------
fast_watcher | 112
slow_watcher | 487
35
n
fi
fi
col_name | col_type
----------+----------
id | integer
name | text
address | text
zip code | text
phone | text
city | text
country | text
sid | smallint
Sample results
name
--------------
Jon Stephens
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Sample results
actor_id
----------
1
36
fi
fi
Instructio
• Write a query to return the lm category id with the most lms, as well as the number
lms in that category.
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
Sample results
category_id | film_cnt
------------+----------
1 | 2
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Table: lm_actor
Films and their casts
col_name | col_type
-------------+--------------------------
actor_id | smallint
film_id | smallint
Sample results
first_name | last_name
------------+-----------
MICHAEL | JACKSON
37
fi
fi
fi
fi
fi
n
fi
fi
fi
• Write a query to return the rst and last name of the customer who spent the most on
movie rentals in Feb 2020.
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
first_name | last_name
-----------+-----------
JAMES | BOND
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
38
n
fi
fi
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
first_name | last_name
------------+-----------
JENNIFER | ANISTON
Sample results
avg
--------------------
1.234567
39
n
Sample results
avg
--------------------
1.23456789
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Sample results
title
------------------------
ACADEMY DINOSAUR
ARABIA DOGMA
40
fi
fi
n
fi
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Sample results
title
--------------
SHORT FILM
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
41
fi
fi
n
fi
fi
fi
fi
length | smallint
replacement_cost | numeric
rating | text
Sample results
title
--------------
SECOND SHORTEST
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Sample results
title
------------------
LARGEST MOVIE
42
fi
fi
n
fi
Write a query to return the title of the lm with the second largest cast.
•
If there are ties, e.g., two movies have the same number of actors, return either one of
•
the movie.
Table: lm_actor
Films and their casts
col_name | col_type
-------------+--------------------------
actor_id | smallint
film_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Sample results
title
-----------
SECOND LARGEST
43
fi
fi
fi
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
first_name | last_name
------------+-----------
MARK | ZUCKERBERG
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
44
t
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
count
-------
1234
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
45
t
fi
fi
Sample results
title
-------------------------
AGENT TRUMAN
ALABAMA DEVIL
AMERICAN CIRCUS
ANGELS LIFE
Instructio
• Write a query to return the number of lms with no rentals in Feb 2020.
• Count the entire movie catalog from the film table.
Table: inventory
Each row is unique; Inventoy_id is the primary key of the table
col_name | col_type
--------------+--------------------------
inventory_id | integer
film_id | smallint
store_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
46
fi
n
fi
fi
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
count
-------
123
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
count
-------
123
47
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Sample results
title
------------------------
ACADEMY DINOSAUR
APACHE DIVINE
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
48
fi
fi
fi
n
fi
Sample results
film_category | count
---------------+-------
medium | 1
long | 2
short | 3
col_name| col_type
----------+------------
country | varchar(2)
date | date
user_id | integer
search_id | integer
query | text
Sample results
num_searches
--------------------
1234567899
col_name| col_type
----------+------------
country | varchar(2)
date | date
user_id | integer
search_id | integer
query | text
49
Sample results
top_search_term
--------------------
Joe Biden
• Write a query to compute the click through rate for the search results on new year's
day (2021-01-01).
• Click through rate: number of searches end up with at least one click.
• Convert your result into a percentage (* 100.0).
Table: search_result
col_name | col_type
-------------+--------------------------
date | date
search_id | bigint
result_id | bigint
result_type | varchar(20)
action | varchar(20)
Sample results
ctr
----------
2.34
Question 85. Top 5 queries based on click through rate on new year's day dif cult
Instructio
• Write a query to return the top 5 search terms with the highest click through rate on
new year's day (2021-01-01)
• The search term has to be searched by more than 2 (>2) distinct users.
• Click through rate: number of searches end up with at least one click.
Table: search
col_name| col_type
----------+------------
country | varchar(2)
date | date
user_id | integer
search_id | integer
query | text
50
fi
Table: search_result
col_name | col_type
-------------+--------------------------
date | date
search_id | bigint
result_id | bigint
result_type | varchar(20)
action | varchar(20)
Sample results
query
------------
covid
exit
biden
col_name | col_type
-------------+--------------------------
song_id | bigint
title | varchar(1000)
artist_id | bigint
Table: song_plays
Number of times a song is played (streamed), aggregated on daily basis.
col_name | col_type
-------------+--------------------------
date | date
country | varchar(2)
song_id | bigint
num_plays | bigint
Sample results
title
------------
Eminence Front
51
Table: artist
col_name | col_type
-------------+--------------------------
artist_id | bigint
name | VARCHAR(255)
52
n
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Table: lm_actor
Films and their casts
col_name | col_type
-------------+--------------------------
actor_id | smallint
film_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Sample results
first_name | last_name
------------+-----------
GARY | PHOENIX
DUSTIN | TAUTOU
53
fi
fi
n
fi
fi
fi
Instructio
• Return the name of the category that has the most lms.
• If there are ties, return just one of them.
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
Table: category
Movie categories.
col_name | col_type
-------------+--------------------------
category_id | integer
name | text
Sample results
name
--------
Category Name
Table: category
Movie categories.
col_name | col_type
-------------+--------------------------
category_id | integer
name | text
Sample results
category_id | name
54
fi
fi
fi
fi
fi
fi
-------------+--------
123 | Category
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Table: lm_actor
Films and their casts
col_name | col_type
-------------+--------------------------
actor_id | smallint
film_id | smallint
Sample results
actor_id | first_name | last_name
----------+------------+-----------
1234 | FIRST_NAME | LAST_NAME
55
fi
n
fi
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
film_id | title
---------+--------------------
12345 | MOVIE TITLE 1
12346 | MOVIE TITLE 2
12347 | MOVIE TITLE 3
12348 | MOVIE TITLE 4
12349 | MOVIE TITLE 5
col_name | col_type
56
fi
fi
n
fi
fi
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Table: lm_actor
Films and their casts
col_name | col_type
-------------+--------------------------
actor_id | smallint
film_id | smallint
Sample results
actor_category | count
-----------------+-------
less productive | 123
productive | 456
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
57
fi
fi
fi
fi
n
fi
Sample results
in_stock | count
--------------+-------
in stock | 123
not in stock | 456
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
hass_rented | count
-----------+-------
rented | 123
never-rented | 456
58
n
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
59
fi
fi
fi
n
fi
Sample results
demand_category | count
-----------------+-------
in-demand | 123
not-in-demand | 456
Instructio
• For movies that are not in demand (rentals = 0 in May 2020), we want to remove them
from our inventory.
• Write a query to return the number of unique inventory_id from those movies with 0
demand.
• Hint: a movie can have multiple inventory_id.
Table: inventory
Each row is unique; Inventoy_id is the primary key of the table
col_name | col_type
--------------+--------------------------
inventory_id | integer
film_id | smallint
store_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
60
fi
fi
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
count
-------
12345
Question 46. Actors and customers whose last name starts with 'A' easy
Instructio
• Write a query to return unique names ( rst_name, last_name) of
our customers and actors whose last name starts with letter 'A'.
Table: actor
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
first_name | last_name
------------+-----------
KENT | ARSENAULT
JOSE | ANDREW
61
fi
fi
• Write a query to return all actors and customers whose rst names ends in 'D'.
• Return their ids (for actor: use actor_id,
customer: customer_id), first_name and last_name.
• The order of your results doesn't matter.
Table: actor
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
customer_id | first_name | last_name
-------------+------------+--------------
55 | DORIS | REED
65 | ROSE | HOWARD
62
n
fi
Table: actor_tv
Actors who appeared in a TV show.
col_name | col_type
------------+-------------------
actor_id | integer
first_name | character varying
last_name | character varying
Sample results
actor_id | first_name | last_name
----------+-------------+-------------
1 | PENELOPE | GUINESS
4 | JENNIFER | DAVIS
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
63
fi
fi
n
film_id | smallint
category_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Table: category
Movie categories.
col_name | col_type
-------------+--------------------------
category_id | integer
name | text
Sample results
name | revenue
-------------+---------
Sports | 123
Sci-Fi | 456
Animation | 789
64
fi
n
• Write a query to return the names of the top 5 cities with the most rental revenues in
2020.
• Include each city's revenue in the second column.
• The order of your results doesn't matter.
• If there are ties, return any one of them.
• Yours results should have exactly 5 rows.
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: address
col_name | col_type
-------------+--------------------------
address_id | integer
address | text
address2 | text
district | text
city_id | smallint
postal_code | text
phone | text
Table: city
col_name | col_type
-------------+--------------------------
city_id | integer
city | text
country_id | smallint
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
65
create_date | date
active | integer
Sample results
city | sum
----------------------------+--------
Cape Coral | 221.55
Saint-Denis | 216.54
Aurora | 198.50
City 4 | 12.34
City 5 | 1.23
Table: actor_tv
Actors who appeared in a TV show.
col_name | col_type
------------+-------------------
actor_id | integer
first_name | character varying
last_name | character varying
Sample results
first_name | last_name
-------------+-------------
ED | CHASE
ZERO | CAGE
CUBA | OLIVIER
Instructio
66
fi
fi
• Write a query to return the film_id with movie only casts (actors who never appeared
in tv).
• The order of your results doesn't matter.
• You should exclude movies with one or more tv actors
Table: lm_actor
Films and their casts
col_name | col_type
-------------+--------------------------
actor_id | smallint
film_id | smallint
Table: actor_tv
Actors who appeared in a TV show.
col_name | col_type
------------+-------------------
actor_id | integer
first_name | character varying
last_name | character varying
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Sample results
film_id
---------
174
201
Instructio
• Write a query to return the number of lms in 3 separate groups: high, medium, low.
• The order of your results doesn't matter.
67
fi
fi
fi
fi
De nitio
• high: revenue >= $100.
• medium: revenue >= $20, <$100 .
• low: revenue <$20.
Hin
• If a movie has no rental revenue, it belongs to the low group
Table: inventory
Each row is unique; Inventoy_id is the primary key of the table
col_name | col_type
--------------+--------------------------
inventory_id | integer
film_id | smallint
store_id | smallint
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
68
fi
t
fi
n
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
film_group | count
------------+-------
medium | 123
high | 456
low | 789
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
69
fi
t
Sample results
customer_group | count
---------------+-------
high | 123
medium | 456
low | 789
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
date_category | count
---------------+-------
busy | 10
slow | 21
70
fi
n
Write a query to return the total number of actors from actor_tv, actor_movie with FULL
•
OUTER JOIN.
• Use COALESCE to return the rst non-null value from a list.
• Actors who appear in both tv and movie share the same value of actor_id in both
actor_tv and actor_movie tables.
Table: actor_movie
Actors who appeared in a movie.
col_name | col_type
------------+-------------------
actor_id | integer
first_name | character varying
last_name | character varying
Table: actor_tv
Actors who appeared in a TV show.
col_name | col_type
------------+-------------------
actor_id | integer
first_name | character varying
last_name | character varying
Sample results
count
-------
123
Table: actor_tv
Actors who appeared in a TV show.
col_name | col_type
------------+-------------------
71
fi
actor_id | integer
first_name | character varying
last_name | character varying
Sample results
count
-------
123
col_name | col_type
-------------+--------------------------
song_id | bigint
title | varchar(1000)
artist_id | bigint
Table: song_plays
Number of times a song is played (streamed), aggregated on daily basis.
col_name | col_type
-------------+--------------------------
date | date
country | varchar(2)
song_id | bigint
num_plays | bigint
Sample results
country | song_name
--------------+-----------------
US | Superhero
UK | Eminence Front
72
WINDOW FUNCTIONS
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
film_id revenue_percentage
1 0.05454153589380405482
2 0.07851192534291674250
3 0.05618801685225177037
4 0.13612392572679896957
73
fi
n
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
74
fi
fi
fi
fi
n
fi
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Table: category
Movie categories.
col_name | col_type
-------------+--------------------------
category_id | integer
name | text
Sample results
film_id | category_name | revenue_percent_category
---------+---------------+--------------------------
1 | Documentary | 0.87183937479845975834
2 | Horror | 1.4218786097664498
3 | Documentary | 0.89815815929740700696
4 | Horror | 2.4652522202582108
5 | Family | 1.2276180943524362
75
n
fi
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Table: category
Movie categories.
col_name | col_type
-------------+--------------------------
category_id | integer
name | text
Sample results
film_id | category_name | rentals | avg_rentals_category
---------+---------------+---------+----------------------
1 | Documentary | 23 | 16.6666666666666667
2 | Horror | 7 | 15.9622641509433962
3 | Documentary | 12 | 16.6666666666666667
4 | Horror | 23 | 15.9622641509433962
Instructio
• Write a query to return a customer's life time value for the following: customer_id IN
(1, 100, 101, 200, 201, 300, 301, 400, 401, 500).
• Add a column to compute the average LTV of all customers from the same store.
• Return 4 columns: customer_id, store_id, customer total spend, average customer
spend from the same store.
• The order of your results doesn't matter.
Hin
• Assumptions: a customer can only be associated with one store.
76
fi
t
fi
n
fi
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
customer_id | store_id | ltd_spend | avg
-------------+----------+-----------+----------------------
101 | 1 | 96.76 | 113.3933333333333333
500 | 1 | 115.72 | 113.3933333333333333
201 | 1 | 108.75 | 113.3933333333333333
300 | 1 | 137.69 | 113.3933333333333333
100 | 1 | 102.76 | 113.3933333333333333
1 | 1 | 118.68 | 113.3933333333333333
200 | 2 | 136.73 | 107.2550000000000000
77
fi
fi
n
fi
film_id | smallint
category_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: category
Movie categories.
col_name | col_type
-------------+--------------------------
category_id | integer
name | text
Sample results
film_id | title | length | category | row_num
---------+---------------------+--------+-------------+---------
869 | SUSPECTS QUILLS | 47 | Action | 1
243 | DOORS PRESIDENT | 49 | Animation | 1
505 | LABYRINTH LEAGUE | 44 | Children | 1
78
fi
n
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
store_id | customer_id | revenue | ranking
----------+-------------+---------+---------
1 | 148 | 216.54 | 1
1 | 144 | 195.58 | 2
1 | 459 | 186.62 | 3
1 | 468 | 175.61 | 4
1 | 236 | 175.58 | 5
2 | 526 | 221.55 | 1
2 | 178 | 194.61 | 2
2 | 137 | 194.61 | 2
2 | 469 | 177.60 | 3
2 | 181 | 174.66 | 4
2 | 259 | 170.67 | 5
Instructio
• Write a query to return top 2 lms based on their rental revenues in their category.
• A lm can only belong to one category.
• The order of your results doesn't matter.
• If there are ties, return just one of them.
• Return the following columns: category, film_id, revenue, row_num
Table: inventory
Each row is unique; Inventoy_id is the primary key of the table
col_name | col_type
--------------+--------------------------
inventory_id | integer
film_id | smallint
store_id | smallint
79
fi
n
fi
fi
fi
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Table: category
80
fi
fi
fi
Movie categories.
col_name | col_type
-------------+--------------------------
category_id | integer
name | text
Sample results
category | film_id | revenue | row_num
-------------+---------+---------+---------
Action | 327 | 175.77 | 1
Action | 21 | 167.78 | 2
Animation | 239 | 178.70 | 1
Animation | 865 | 170.76 | 2
Children | 48 | 158.81 | 1
Children | 409 | 132.80 | 2
Classics | 843 | 141.77 | 1
Classics | 131 | 137.76 | 2
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: lm
81
fi
fi
n
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
film_id | revenue | percentile
---------+---------+------------
11 | 35.76 | 23
1 | 36.77 | 24
30 | 46.91 | 35
82
n
film_id | smallint
store_id | smallint
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
83
fi
fi
fi
Table: category
Movie categories.
col_name | col_type
-------------+--------------------------
category_id | integer
name | text
Sample results
category | film_id | revenue | row_num
-------------+---------+---------+---------
Action | 19 | 33.79 | 11
Animation | 18 | 32.78 | 13
Comedy | 7 | 82.85 | 35
Documentary | 1 | 36.77 | 17
Documentary | 3 | 37.88 | 19
Family | 5 | 51.88 | 30
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
84
fi
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
film_id | num_rentals | quartile
---------+-------------+----------
30 | 9 | 1
20 | 10 | 1
21 | 22 | 4
Instructio
• Write a query to return the difference of the spend amount between the following
customers' rst movie rental and their second rental.
• customer_id in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
• Use first spend - second spend to compute the difference.
• Skip users who only rented once.
Hint
• You can use ROW_NUMBER to identify the rst and second transactions.
• You can use LAG or LEAD to nd previous or following transaction amount.
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Sample results
customer_id | delta
-------------+-------
85
:
fi
fi
fi
fi
fi
1 | 2.00
2 | 2.00
3 | -1.00
4 | 4.00
5 | 0.00
Instructions
• Write a query to return the number of happy customers from May 24 (inclusive) to May
31 (inclusive).
De nitio
• Happy customer: customers who made at least 1 rental in each day of any 2
consecutive days.
Hin
• For customer 1, you can create the following temporary table:
• customer 1, rst rental date, second rental date
• customer 1, second rental date, third rental date
• ..............
• customer 1, second last rental date, last rental date
• customer 1, last rental date, NULL
• As long as there is at least one row, where the delta of the last 2 columns are not null,
and less or equal than 1 day, this customer must be a happy customer.
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
count
-------
123
(1 row)
86
fi
t
fi
:
fi
Sample results
date | customer_id | daily_spend | cumulative_spend
------------+-------------+-------------+------------------
2020-05-25 | 1 | 2.99 | 2.99
2020-05-28 | 1 | 0.99 | 3.98
2020-06-15 | 1 | 16.97 | 20.95
2020-06-16 | 1 | 4.99 | 25.94
2020-06-18 | 1 | 5.98 | 31.92
2020-06-21 | 1 | 3.99 | 35.91
2020-07-08 | 1 | 11.98 | 47.89
2020-07-09 | 1 | 9.98 | 57.87
2020-07-11 | 1 | 7.99 | 65.86
2020-07-27 | 1 | 2.99 | 68.85
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
date | customer_id | daily_rental | cumulative_rentals
------------+-------------+--------------+--------------------
2020-05-27 | 3 | 1 | 1
87
s
2020-05-29 | 3 | 1 | 2
2020-06-16 | 3 | 2 | 4
2020-06-17 | 3 | 1 | 5
2020-06-19 | 3 | 1 | 6
2020-07-07 | 3 | 1 | 7
2020-07-08 | 3 | 1 | 8
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
customer_id | date
-------------+------------
1 | 2020-07-08
2 | 2020-07-29
3 | 2020-07-27
4 | 2020-07-30
5 | 2020-07-06
6 | 2020-07-10
88
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
avg
-------
65
col_name | col_type
-------------+--------------------------
actor_id | integer
first_name | text
last_name | text
Table: lm_actor
Films and their casts
col_name | col_type
-------------+--------------------------
actor_id | smallint
film_id | smallint
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
89
fi
fi
fi
fi
fi
Sample results
category_id | actor_id | num_movies
-------------+----------+------------
1 | 50 | 6
2 | 150 | 6
3 | 17 | 7
4 | 86 | 6
5 | 196 | 6
6 | 48 | 6
7 | 7 | 7
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
90
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
Table: lm
col_name | col_type
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
category_id | customer_id
-------------+-------------
1 | 363
2 | 526
3 | 467
4 | 293
5 | 459
91
fi
fi
fi
col_name | col_type
-------------+--------------------------
address_id | integer
address | text
address2 | text
district | text
city_id | smallint
postal_code | text
phone | text
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
district | cat
---------+----------
New York | most
Kamado | least
92
s
Hin
• Use NTILE(100) to create percentiles.
• To save you some time, here is the CTE to create a movie's revenue by category.
WITH movie_rev_by_cat AS (
SELECT
F.film_id,
MAX(FC.category_id) AS category_id,
SUM(P.amount) AS revenue
FROM film F
INNER JOIN inventory I
ON I.film_id = F.film_id
INNER JOIN rental R
ON R.inventory_id = I.inventory_id
INNER JOIN payment P
ON P.rental_id = R.rental_id
INNER JOIN film_category FC
ON FC.film_id = F.film_id
GROUP BY F.film_id
)
Table: inventory
Each row is unique; Inventoy_id is the primary key of the table
col_name | col_type
--------------+--------------------------
inventory_id | integer
film_id | smallint
store_id | smallint
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: lm_category
A lm can only belong to one category
col_name | col_type
-------------+--------------------------
film_id | smallint
category_id | smallint
Table: lm
col_name | col_type
93
fi
t
fi
fi
----------------------+--------------------------
film_id | integer
title | text
description | text
release_year | integer
language_id | smallint
original_language_id | smallint
rental_duration | smallint
rental_rate | numeric
length | smallint
replacement_cost | numeric
rating | text
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
film_id | perc_by_cat
---------+-------------
1 | 17
3 | 19
5 | 30
2 | 22
4 | 36
94
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Table: customer
col_name | col_type
-------------+--------------------------
customer_id | integer
store_id | smallint
first_name | text
last_name | text
email | text
address_id | smallint
activebool | boolean
create_date | date
active | integer
Sample results
customer_id | store_id | quartile
-------------+----------+----------
1 | 1 | 2
2 | 1 | 2
3 | 1 | 1
4 | 2 | 1
Question 79. Spend difference between the last and the second last rentals dif cult
Instruction
• Write a query to return the spend amount difference between the last and the second
last movie rentals for the following customers:
• customer_id IN (1,2,3,4,5,6,7,8,9,10).
• Skip customers if they made less than 2 rentals.
Hin
• Use ROW_NUMBER to determine the sequence of movie rental
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
95
t
fi
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Sample results
customer_id | delta
-------------+-------
1 | 3.00
2 | 0.00
3 | 2.00
4 | -1.00
5 | -2.00
Instructio
• Write a query to return DoD(day over day) growth for each store from May 24
(inclusive) to May 31 (inclusive).
• DoD: (current_day/ prev_day -1) * 100.0
• Multiply dod growth to 100.0 to get percentage of growth.
• Use ROUND to convert dod growth to the nearest integer.
Hin
• To save you some time, use the following CTE to create a store's daily revenue:
WITH store_daily_rev AS (
SELECT
I.store_id,
DATE(P.payment_ts) date,
SUM(amount) AS daily_rev
FROM
payment P
INNER JOIN rental R
ON R.rental_id = P.rental_id
INNER JOIN inventory I
ON I.inventory_id = R.inventory_id
WHERE DATE(P.payment_ts) >= '2020-05-01'
AND DATE(P.payment_ts) <= '2020-05-31'
GROUP BY I.store_id, DATE(P.payment_ts)
)
Table: inventory
Each row is unique; Inventoy_id is the primary key of the table
col_name | col_type
--------------+--------------------------
inventory_id | integer
film_id | smallint
store_id | smallint
96
fi
Table: payment
Movie rental payment transactions table
col_name | col_type
--------------+--------------------------
payment_id | integer
customer_id | smallint
staff_id | smallint
rental_id | integer
amount | numeric
payment_ts | timestamp with time zone
Table: rental
col_name | col_type
--------------+--------------------------
rental_id | integer
rental_ts | timestamp with time zone
inventory_id | integer
customer_id | smallint
return_ts | timestamp with time zone
staff_id | smallint
Sample results
store_id | date | dod_growth
----------+------------+------------
1 | 2020-05-24 |
1 | 2020-05-25 | 2058
1 | 2020-05-26 | 137
1 | 2020-05-27 | 74
1 | 2020-05-28 | 166
1 | 2020-05-29 | 62
1 | 2020-05-30 | 109
1 | 2020-05-31 | 107
2 | 2020-05-24 |
2 | 2020-05-25 | 1794
• Write a query to return the top searched term in the US and UK on new year's day
(2021-01-01), separately
• The order of your results doesn't matter.
• Rank them based on search volume.
Table: search
97
col_name| col_type
----------+------------
country | varchar(2)
date | date
user_id | integer
search_id | integer
query | text
Sample results
country | query
--------------+-------
US | Joe Biden
UK | David Beckham
Sample results
country | song_id
--------------+-------
US | 12345
UK | 23456
JP | 23456
CA | 12345
AU | 23456
98
n
Table: artist
col_name | col_type
-------------+--------------------------
artist_id | bigint
name | VARCHAR(255)
Table: song_plays
Number of times a song is played (streamed), aggregated on daily basis.
col_name | col_type
-------------+--------------------------
date | date
country | varchar(2)
song_id | bigint
num_plays | bigint
Sample results
country | artist_id
--------------+-------
US | 100
UK | 200
JP | 300
CA | 100
AU | 200
99
SOLUTIONS
SINGLE TABLE OPERATIONS
100
fi
n
fi
fi
fi
fi
fi
fi
Instructio
• Write a query to return the total movie rental revenue for each month.
• You can use EXTRACT(MONTH FROM colname) and EXTRACT(YEAR FROM colname) to
extract month and year from a timestamp column.
Solutio
SELECT
EXTRACT(YEAR FROM payment_ts) AS year,
EXTRACT(MONTH FROM payment_ts) AS mon,
SUM(amount) as rev
FROM payment
GROUP BY year, mon
ORDER BY year, mon;
101
n
102
fi
fi
WITH cust_tot_amt AS (
SELECT
customer_id,
SUM(amount) AS tot_amt
FROM payment
WHERE DATE(payment_ts) >= '2020-06-01'
AND DATE(payment_ts) <= '2020-06-30'
GROUP BY customer_id
)
SELECT
MIN(tot_amt) AS min_spend,
MAX(tot_amt) AS max_spend
FROM cust_tot_amt;
Solutio
SELECT
last_name,
COUNT(*)
FROM actor
WHERE last_name IN ('DAVIS', 'BRODY', 'ALLEN', 'BERRY')
GROUP BY last_name;
103
fi
fi
fi
fi
fi
Solutio
SELECT
CASE WHEN first_name LIKE 'A%' THEN 'a_actors'
WHEN first_name LIKE 'B%' THEN 'b_actors'
WHEN first_name LIKE 'C%' THEN 'c_actors'
ELSE 'other_actors'
END AS actor_category,
COUNT(*)
FROM actor
GROUP BY actor_category;
Instructio
• Write a query to return the number of good days and bad days in May 2020 based on
number of daily rentals.
• Return the results in one row with 2 columns from left to right: good_days, bad_days.
• good day: > 100 rentals.
• bad day: <= 100 rentals.
• Hint (For users already know OUTER JOIN), you can use dates table
• Hint: be super careful about datetime columns.
• Hint: this problem could be tricky, feel free to explore the rental table and take a look
at some data.
Solutio
-- For people following my course (who have not learned outer join yet)
WITH daily_rentals AS (
SELECT
DATE(rental_ts) AS dt,
COUNT(*) AS num_rentals
FROM rental
WHERE DATE(rental_ts) >= '2020-05-01'
AND DATE(rental_ts) <= '2020-05-31'
GROUP BY dt
)
SELECT
SUM(CASE WHEN num_rentals > 100 THEN 1
ELSE 0
END) AS good_days,
31 - SUM(CASE WHEN num_rentals > 100 THEN 1
ELSE 0
END) AS bad_days
FROM daily_rentals;
Solutio
-- (For users who already know OUTER JOIN):
WITH daily_rentals AS (
SELECT
D.date AS dt,
COUNT(R.rental_id) AS num_rentals
FROM dates D
LEFT JOIN rental R
ON D.date = DATE(R.rental_ts)
WHERE D.date >= '2020-05-01'
104
fi
Instructio
• Write a query to return the number of fast movie watchers vs slow movie watchers.
• fast movie watcher: by average return their rentals within 5 days.
• slow movie watcher: takes an average of >5 days to return their rentals.
• Most customers have multiple rentals over time, you need to rst compute the number
of days for each rental transaction, then compute the average on the rounded up days.
e.g., if the rental period is 1 day and 10 hours, count it as 2 days.
• Skip the rentals that have not been returned yet, e.g., rental_ts IS NULL.
• The orders of your results doesn't matter.
• A customer can only rent one movie per transaction.
Solutio
WITH average_rental_days AS (
SELECT
customer_id,
AVG(EXTRACT(days FROM (return_ts - rental_ts) ) + 1) AS average_days
FROM rental
WHERE return_ts IS NOT NULL
GROUP BY 1
)
SELECT CASE WHEN average_days <= 5 THEN 'fast_watcher'
WHEN average_days > 5 THEN 'slow_watcher'
ELSE NULL
END AS watcher_category,
COUNT(*)
FROM average_rental_days
GROUP BY watcher_category;
105
fi
n
fi
fi
SELECT actor_id
FROM actor
WHERE first_name = 'GROUCHO'
AND last_name = 'WILLIAMS';
106
fi
fi
fi
fi
fi
fi
fi
AND DATE(payment_ts) <= '2020-02-29'
GROUP BY customer_id
ORDER BY cust_amt DESC
LIMIT 1
)
SELECT first_name, last_name
FROM customer
WHERE customer_id IN (
SELECT customer_id
FROM cust_feb_spend
);
107
fi
FROM payment
WHERE DATE(payment_ts ) >= '2020-02-01'
AND DATE(payment_ts ) <= '2020-02-28'
GROUP BY customer_id
)
SELECT AVG(cust_spend)
FROM cust_feb_spend
;
SELECT title
FROM film
WHERE film_id IN (
SELECT film_id
FROM film_casts_cnt
)
108
fi
fi
fi
fi
fi
)
SELECT title
FROM film
WHERE film_id IN (
SELECT film_id
FROM shortest_2
ORDER BY length DESC
LIMIT 1
);
109
fi
fi
110
fi
WITH out_film AS (
SELECT DISTINCT film_id
FROM inventory
WHERE inventory_id IN (
SELECT inventory_id
FROM rental
WHERE return_ts IS NULL
)
)
SELECT title
FROM film
WHERE film_id IN (
SELECT film_id
FROM out_film
)
;
Instructio
• Write a query to return the number of lms with no rentals in Feb 2020.
• Count the entire movie catalog from the film table.
Solutio
WITH rented_film AS (
SELECT DISTINCT film_id
FROM inventory
WHERE inventory_id IN(
SELECT inventory_id
FROM rental
WHERE DATE(rental_ts) >= '2020-02-01'
AND DATE(rental_ts) <= '2020-02-29'
)
)
SELECT COUNT(*)
FROM film
WHERE film_id NOT IN(
SELECT film_id
FROM rented_film
);
111
fi
fi
112
fi
fi
Write a query to return the total number of users who have searched on new year's day:
2021-01-01. Solutio
SELECT COUNT(DISTINCT user_id)
FROM search
WHERE date = '2021-01-01';
Solutio
SELECT COUNT(user_id) FROM (
SELECT
user_id
FROM search
WHERE date = '2021-01-01'
GROUP BY 1
) X;
Question 85. Top 5 queries based on click through rate on new year's day dif cult
Instructio
• Write a query to return the top 5 search terms with the highest click through rate on
new year's day (2021-01-01)
• The search term has to be searched by more than 2 (>2) distinct users.
• Click through rate: number of searches end up with at least one click.
Solutio
WITH click_through_rate AS (
SELECT
S.query,
113
fi
114
115
fi
fi
fi
fi
fi
fi
116
fi
fi
fi
117
fi
fi
fi
Instructio
• For movies that are not in demand (rentals = 0 in May 2020), we want to remove them
from our inventory.
• Write a query to return the number of unique inventory_id from those movies with 0
demand.
• Hint: a movie can have multiple inventory_id.
Solutio
SELECT COUNT(inventory_id )
FROM inventory I
INNER JOIN (
SELECT F.film_id
FROM film F
LEFT JOIN (
SELECT DISTINCT I.film_id
FROM inventory I
INNER JOIN (
SELECT inventory_id, rental_id
FROM rental
118
fi
fi
fi
fi
Question 46. Actors and customers whose last name starts with 'A' easy
Instructio
• Write a query to return unique names ( rst_name, last_name) of
our customers and actors whose last name starts with letter 'A'.
Solutio
SELECT first_name, last_name
FROM customer
WHERE last_name LIKE 'A%'
UNION
SELECT first_name, last_name
FROM actor
WHERE last_name LIKE 'A%';
119
fi
fi
fi
120
Instructio
• Write a query to return the film_id with movie only casts (actors who never appeared
in tv).
• The order of your results doesn't matter.
• You should exclude movies with one or more tv actors
Solutio
SELECT F.film_id
FROM film F
LEFT JOIN (
SELECT DISTINCT FA.film_id
FROM film_actor FA
INNER JOIN actor_tv T
ON T.actor_id = FA.actor_id
) X
ON F.film_id = X.film_id
WHERE X.film_id IS NULL;
Instructio
• Write a query to return the number of lms in 3 separate groups: high, medium, low.
• The order of your results doesn't matter.
De nitio
• high: revenue >= $100.
• medium: revenue >= $20, <$100 .
• low: revenue <$20.
Hin
• If a movie has no rental revenue, it belongs to the low group
Solutio
SELECT film_group, COUNT(*)
FROM (
SELECT
F.film_id,
CASE WHEN SUM(P.amount) >= 100 THEN 'high'
WHEN SUM(P.amount) >= 20 THEN 'medium'
121
fi
t
fi
fi
fi
fi
122
fi
fi
t
FROM (
SELECT D.date,
CASE WHEN COUNT(*) >= 100 THEN 'busy' ELSE 'slow' END date_category
FROM dates D
LEFT JOIN (
SELECT * FROM rental
) R
ON D.date = DATE(R.rental_ts)
WHERE D.date >= '2020-05-01'
AND D.date <= '2020-05-31'
GROUP BY D.date
) X
GROUP BY date_category
;
123
fi
Write a query to return the name of the top song in the US and UK yesterday,
respectively. Solutio
WITH top_song AS (
SELECT
S.song_id,
P.country,
MAX(S.name) AS song_name,
SUM(plays) num_plays
FROM daily_plays P
INNER JOIN song S
ON P.song_id = S.id
WHERE P.date = CURRENT_DATE - 1
AND country = 'US'
ORDER BY num_plays DESC
LIMIT 1
UNION ALL
SELECT
S.song_id,
P.country,
MAX(S.name) AS song_name,
SUM(plays) num_plays
FROM daily_plays P
INNER JOIN song S
ON P.song_id = S.id
WHERE P.date = CURRENT_DATE - 1
AND country = 'UK'
ORDER BY num_plays DESC
LIMIT 1
)
SELECT country, song_name
FROM top_song;
WINDOW FUNCTIONS
124
fi
Solutio
WITH movie_revenue AS (
SELECT
I.film_id, SUM(P.amount) revenue
FROM payment P
INNER JOIN rental R
ON R.rental_id = P.rental_id
INNER JOIN inventory I
ON I.inventory_id = R.inventory_id
GROUP BY I.film_id
)
SELECT film_id, revenue * 100.0 / SUM(revenue) OVER() revenue_percentage
FROM movie_revenue
ORDER BY film_id
LIMIT 10
;
125
fi
fi
Write a query to return the number of rentals per movie, and the average number of
•
rentals in its same category.
• You only need to return results for film_id <= 10.
• Return 4 columns: lm_id, category name, number of rentals, and the average number
of rentals from its category.
Solutio
WITH movie_rental AS (
SELECT
I.film_id,
COUNT(*) rentals
FROM rental R
INNER JOIN inventory I
ON I.inventory_id = R.inventory_id
GROUP BY I.film_id
)
SELECT
film_id,
category_name,
rentals,
avg_rentals_category
FROM (
SELECT
MR.film_id,
C.name category_name,
rentals,
AVG(rentals) OVER(PARTITION BY C.name) avg_rentals_category
FROM movie_rental MR
INNER JOIN film_category FC
ON FC.film_id = MR.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
) X
WHERE film_id <= 10
;
Instructio
• Write a query to return a customer's life time value for the following: customer_id IN
(1, 100, 101, 200, 201, 300, 301, 400, 401, 500).
• Add a column to compute the average LTV of all customers from the same store.
• Return 4 columns: customer_id, store_id, customer total spend, average customer
spend from the same store.
• The order of your results doesn't matter.
Hin
• Assumptions: a customer can only be associated with one store.
Solutio
WITH customer_ltd_spend AS (
SELECT
P.customer_id,
MAX(store_id) store_id,
SUM(P.amount) ltd_spend
FROM payment P
INNER JOIN customer C
126
fi
fi
ON C.customer_id = P.customer_id
GROUP BY P.customer_id
)
SELECT
film_id,
title,
length,
category,
row_num
FROM movie_ranking
WHERE row_num = 1
;
Solutio
SELECT
film_id,
title,
length,
category,
row_num
FROM (
127
fi
SELECT
F.film_id,
F.title,
F.length,
C.name category,
ROW_NUMBER() OVER(PARTITION BY C.name ORDER BY F.length) row_num
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
) X
WHERE row_num = 1
;
Instructio
• Write a query to return top 2 lms based on their rental revenues in their category.
• A lm can only belong to one category.
• The order of your results doesn't matter.
• If there are ties, return just one of them.
• Return the following columns: category, film_id, revenue, row_num
Solutio
WITH film_revenue AS (
SELECT
128
fi
fi
fi
fi
F.film_id,
MAX(C.name) AS category,
SUM(P.amount) revenue
FROM payment P
INNER JOIN rental R
ON R.rental_id = P.rental_id
INNER JOIN inventory I
ON I.inventory_id = R.inventory_id
INNER JOIN film F
ON F.film_id = I.film_id
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
GROUP BY F.film_id
)
SELECT * FROM (
SELECT
category,
FR.film_id,
revenue,
ROW_NUMBER() OVER(PARTITION BY category ORDER BY revenue DESC) row_num
FROM film_revenue FR
INNER JOIN film_category FC
ON FC.film_id = FR.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
) X
WHERE row_num <= 2;
129
fi
SELECT
film_id,
revenue,
percentile
FROM film_revenue
WHERE film_id IN (1,10,11,20,21,30);
SELECT
category,
film_id,
revenue,
percentile
FROM (
SELECT
category,
FR.film_id,
revenue,
NTILE(100) OVER(PARTITION BY category ORDER BY revenue) percentile
FROM film_revenue_by_cat FR
INNER JOIN film_category FC
ON FC.film_id = FR.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
) X
WHERE film_id <=20
ORDER BY category, revenue;
130
Instructio
• Write a query to return the difference of the spend amount between the following
customers' rst movie rental and their second rental.
• customer_id in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
• Use first spend - second spend to compute the difference.
• Skip users who only rented once.
Hint
• You can use ROW_NUMBER to identify the rst and second transactions.
• You can use LAG or LEAD to nd previous or following transaction amount.
Solutio
SELECT customer_id,
prev_amount - current_amount AS delta
FROM (
SELECT
customer_id,
payment_ts,
amount as current_amount,
LAG(amount, 1) OVER(PARTITION BY customer_id ORDER BY payment_ts ) AS
prev_amount,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_ts) AS
payment_idx
FROM payment
WHERE customer_id IN
(1,2,3,4,5,6,7,8,9,10)
131
fi
fi
fi
fi
fi
) X
WHERE payment_idx = 2;
Instructions
• Write a query to return the number of happy customers from May 24 (inclusive) to May
31 (inclusive).
De nitio
• Happy customer: customers who made at least 1 rental in each day of any 2
consecutive days.
Hin
• For customer 1, you can create the following temporary table:
• customer 1, rst rental date, second rental date
• customer 1, second rental date, third rental date
• ..............
• customer 1, second last rental date, last rental date
• customer 1, last rental date, NULL
• As long as there is at least one row, where the delta of the last 2 columns are not null,
and less or equal than 1 day, this customer must be a happy customer.
Solutio
WITH customer_rental_date AS (
SELECT
customer_id,
DATE(rental_ts) AS rental_date
FROM rental
WHERE DATE(rental_ts) >= '2020-05-24'
AND DATE(rental_ts) <= '2020-05-31'
GROUP BY
customer_id,
DATE(rental_ts)
),
customer_rental_date_diff AS (
SELECT
customer_id,
rental_date AS current_rental_date,
LAG( rental_date, 1) OVER(PARTITION BY customer_id ORDER BY
rental_date) AS prev_rental_date
FROM customer_rental_date
)
132
fi
t
fi
:
fi
Instruction
• Write a query to return the cumulative daily spend for the following customers:
• customer_id in (1, 2, 3).
• Each day a user has a rental, return their total spent until that day.
• If there is no rental on that day, you can skip that day.
Solutio
WITH customer_spend AS (
SELECT
DATE(payment_ts) date,
customer_id,
SUM(amount) AS daily_spend
FROM payment
WHERE customer_id IN (1, 2, 3)
GROUP BY DATE(payment_ts), customer_id
)
SELECT
date,
customer_id,
daily_spend,
SUM(daily_spend) OVER(PARTITION BY customer_id ORDER BY date)
cumulative_spend
FROM customer_spend;
SELECT
date,
customer_id,
daily_rental,
SUM(daily_rental) OVER(PARTITION BY customer_id ORDER BY date)
cumulative_rentals
FROM customer_rentals;
133
Any customers who made at least 10 movie rentals are happy customers, write a query
•
to return the dates when the following customers became happy customers:
• customer_id in (1,2,3,4,5,6,7,8,9,10).
• You can skip a customer if he/she never became a ‘happy customer'.
Solutio
WITH cust_rental_dates AS (
SELECT
customer_id,
DATE(rental_ts) date,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY rental_ts)
rental_idx
FROM rental
WHERE customer_id IN (1,2,3,4,5,6,7,8,9,10)
)
SELECT
customer_id,
date
FROM cust_rental_dates
WHERE rental_idx = 10;
134
GROUP BY customer_id
) X
WHERE tenth_rental_ts IS NOT NULL
)Y;
135
fi
fi
ON R.rental_id = P.rental_id
INNER JOIN inventory I
ON I.inventory_id = R.inventory_id
INNER JOIN film F
ON F.film_id = I.film_id
INNER JOIN film_category FC
ON FC.film_id = F.film_id
GROUP BY P.customer_id, FC.category_id
)
Solutio
WITH cust_revenue_by_cat AS (
SELECT
P.customer_id,
FC.category_id,
SUM(P.amount) AS revenue
FROM payment P
INNER JOIN rental R
ON R.rental_id = P.rental_id
INNER JOIN inventory I
ON I.inventory_id = R.inventory_id
INNER JOIN film F
ON F.film_id = I.film_id
INNER JOIN film_category FC
ON FC.film_id = F.film_id
GROUP BY P.customer_id, FC.category_id
)
SELECT category_id, customer_id
FROM (
SELECT
customer_id,
category_id,
ROW_NUMBER() OVER(PARTITION BY category_id ORDER BY revenue DESC) AS
rev_cat_idx
FROM cust_revenue_by_cat
) X
WHERE rev_cat_idx = 1;
136
ON A.address_id = C.address_id
GROUP BY A.district
)
SELECT
district,
'least' AS city_cat
FROM district_cust_cnt
WHERE cust_asc_idx = 1
UNION
SELECT
district,
'most' AS city_cat
FROM district_cust_cnt
WHERE cust_desc_idx = 1
;
137
ON FC.film_id = F.film_id
GROUP BY F.film_id
)
SELECT film_id, perc_by_cat
FROM (
SELECT film_id,
NTILE(100) OVER(PARTITION BY category_id ORDER BY revenue) AS
perc_by_cat
FROM movie_rev_by_cat
)X
WHERE film_id IN (1,2,3,4,5);
Question 79. Spend difference between the last and the second last rentals dif cult
Instruction
138
fi
Write a query to return the spend amount difference between the last and the second
•
last movie rentals for the following customers:
• customer_id IN (1,2,3,4,5,6,7,8,9,10).
• Skip customers if they made less than 2 rentals.
Hin
• Use ROW_NUMBER to determine the sequence of movie rental
Solutio
WITH cust_spend_seq AS (
SELECT
customer_id,
payment_ts,
amount AS current_payment,
LAG(amount, 1) OVER(PARTITION BY customer_id ORDER BY payment_ts) AS
prev_payment,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_ts DESC)
AS payment_idx
FROM payment P
)
SELECT
customer_id,
current_payment - prev_payment AS delta
FROM cust_spend_seq
WHERE customer_id IN(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
AND payment_idx = 1;
Instructio
• Write a query to return DoD(day over day) growth for each store from May 24
(inclusive) to May 31 (inclusive).
• DoD: (current_day/ prev_day -1) * 100.0
• Multiply dod growth to 100.0 to get percentage of growth.
• Use ROUND to convert dod growth to the nearest integer.
Hin
• To save you some time, use the following CTE to create a store's daily revenue:
WITH store_daily_rev AS (
SELECT
I.store_id,
DATE(P.payment_ts) date,
SUM(amount) AS daily_rev
FROM
payment P
INNER JOIN rental R
ON R.rental_id = P.rental_id
INNER JOIN inventory I
ON I.inventory_id = R.inventory_id
WHERE DATE(P.payment_ts) >= '2020-05-01'
AND DATE(P.payment_ts) <= '2020-05-31'
GROUP BY I.store_id, DATE(P.payment_ts)
)
Solutio
WITH store_daily_rev AS (
SELECT
139
fi
I.store_id,
DATE(P.payment_ts) date,
SUM(amount) AS daily_rev
FROM
payment P
INNER JOIN rental R
ON R.rental_id = P.rental_id
INNER JOIN inventory I
ON I.inventory_id = R.inventory_id
WHERE DATE(P.payment_ts) >= '2020-05-01'
AND DATE(P.payment_ts) <= '2020-05-31'
GROUP BY I.store_id, DATE(P.payment_ts)
)
SELECT
store_id,
date,
ROUND( (daily_rev / LAG(daily_rev, 1) OVER(PARTITION BY store_id ORDER BY
date) -1) * 100.0 ) AS dod_growth
FROM store_daily_rev;
140
Solutio
WITH song_rankings AS (
SELECT
P.song_id,
P.country,
ROW_NUMBER() OVER(PARTITION BY country ORDER BY num_plays DESC) AS
ranking
FROM daily_plays P
WHERE P.date = CURRENT_DATE - 1
)
SELECT
song_id,
country
FROM song_rankings
WHERE ranking = 1;
artist_ranking AS (
SELECT
artist_id,
country,
ROW_NUMBER() OVER(PARTITION BY country ORDER BY num_plays DESC) ranking
FROM artist_plays
)
141
RECAP
CONGRATUL ATIONS!
1. INNER JOIN;
3. NTILE;
4. LAG, LEAD
But what's more important, you've finished this EBook! That is incredible!
And if you are in a good mood 😁 , can I ask you for a favor?
I need to collect testimonials for this book, could you please take a minute and write about your
experience on Twitter?
I hope you enjoyed this book, and good luck at your work, job hunting and your new skills in SQL! If
you have any questions, please feel free to drop me a line [email protected]
March 2021
Leon Wei
142