The Ultimate SQL Cheat Sheet For Data Scientists in Tech
The Ultimate SQL Cheat Sheet For Data Scientists in Tech
This cheat sheet is a compilation of the top advanced SQL functions that I use the most as a Data
Scientist at Spotify. Some of these functions are more specific to the BigQuery environment.
Table of Contents
1. CTE (Common Table Expressions)
2. Window Functions
ROW_NUMBER
RANK
3. Conditional Functions
CASE WHEN
IF Statements
6. LOGICAL_OR
7. COALESCE
8. GENERATE_DATE_ARRAY
9. ROLLUP
10. SAFE_DIVIDE
Disclaimer
This is not Spotify data nor is it based on it. I used ChatGPT to generate hypothetical data points. I
chose the music industry because it’s the type of data I’m the most proficient with and I can easily
draft queries from those samples. I also picked music data because it’s fun.
If you’ve found this Cheat Sheet useful, email me here to let me know →
[email protected]. I’d love to know that what I’m doing is useful and meaningful. It’ll
encourage me to work harder!
Caveats
CTEs are not materialized, meaning they are not stored as physical tables. They are recomputed
every time they are referenced in the main query, which can lead to performance issues if not
used carefully.
Scenario
We want to find songs that have streams higher than their respective artist's average number of
streams.
3 Song C Artist X 50
Query
WITH ArtistAverage AS (
SELECT
artist,
AVG(streams) AS avg_streams
FROM song_streams
GROUP BY artist
)
SELECT
s.song_id,
s.artist,
s.streams,
a.avg_streams
FROM song_streams s
JOIN ArtistAverage a ON s.artist = a.artist
WHERE s.streams > a.avg_streams
Output
"Song B" by "Artist X" has 150 streams, which is higher than the average of 100 streams for "Artist X".
Caveats
Window functions can lead to performance issues if used on large datasets without appropriate
indexing or partitioning.
Ensure that the ORDER BY clause in the window specification is set correctly. The order can
dramatically change the result.
Scenario
Let's use window functions to derive insights about song streams.
1. ROW_NUMBER
Objective: Assign a unique number to each row within a partition of the result set.
Query
SELECT
artist,
album,
song_title,
streams,
ROW_NUMBER() OVER(PARTITION BY artist ORDER BY streams DESC) as row_num
FROM song_streams
Output
The song with the highest streams for each artist gets the row number 1.
2. RANK
Objective: Assign a unique rank to each distinct row within a partition of the result set.
Query
SELECT
artist,
album,
song_title,
streams,
RANK() OVER(PARTITION BY artist ORDER BY streams DESC) as rank
FROM song_streams;
Output
Unlike ROW_NUMBER(), RANK() can assign the same rank to rows with the same streams count.
The difference between RANK() and ROW_NUMBER() lies in how they handle ties (duplicate values) within
the ordered data.
1. ROW_NUMBER():
It doesn't care about ties. Even if two (or more) rows have the same value in the order
column, they will be assigned different numbers. The specific row that gets the next number
in the case of a tie might be arbitrary depending on the database system.
2. RANK():
In the case of ties, it will assign the same rank to all tied rows. For instance, if two rows are
tied for 1st place, both will be given a rank of 1, but the next row will be given a rank of 3 (it
skips the rank 2).
Query
Output
For each song, the result displays the previous song and the next song (based on streams count) for
the same artist. If there's no previous or next song, the corresponding value is NULL.
Conditional Functions
1. CASE WHEN
CASE WHEN provides conditional logic for handling multiple conditions and results.
Caveats
Ensure that the conditions in the CASE WHEN statement are mutually exclusive.
The conditions are evaluated in the order they’re written. Once a condition is met, the
corresponding result is returned, and the remaining conditions are not evaluated. If conditions are
not mutually exclusive, only the first true condition encountered will determine the result.
Scenario
Categorize songs as "Popular" if the play count is over 100 and "Less Popular" otherwise.
Input Data → song_plays
song_id play_count
1 150
2 50
Query
SELECT
song_id,
play_count,
CASE
Output
Songs with a play count greater than 100 are categorized as "Popular", while others are categorized
as "Less Popular".
1 150 Popular
2 50 Less Popular
2. IF Statements
The IF function is used to return one value if a condition is true, and another value if it's false.
Sometimes IF and CASE statements may be used interchangeably.
Caveats
IF statements are limited to just two outcomes. For multiple conditions, multiple nested IFs would
be needed, which can reduce readability.
Scenario
We want to classify songs as "Recent" if they were released in 2020 or later, otherwise classify them
as "Older".
Input Data → songs
Query
SELECT
song_id,
title,
IF(release_year >= 2020, 'Recent', 'Older') AS classification
FROM songs
Output
Songs released in 2020 or later are classified as "Recent", and those released before are classified
as "Older".
1 Song A Older
2 Song B Recent
Caveats
Running totals require ordered data. Ensure you've specified the correct order in your window
function.
When dealing with large datasets, computing running totals can be resource-intensive. Ensure
that your dataset is partitioned appropriately, and consider indexing if performance is an issue.
Scenario
We want to compute the running total of streams for each song over the days.
Input Data → daily_song_streams
2023-01-01 Song B 50
2023-01-02 Song B 75
2023-01-03 Song B 80
2023-01-04 Song B 95
Query
SELECT
date,
song_name,
streams,
SUM(streams) OVER (PARTITION BY song_name ORDER BY date ASC) AS running_total_streams
FROM daily_song_streams
ORDER BY song_name, date
Output
For "Song A", the running total starts with 100 streams on 2023-01-01 and accumulates to 700
streams by 2023-01-04. Similarly, for "Song B", the running total progresses from 50 streams on
2023-01-01 to 300 streams by 2023-01-04.
2023-01-01 Song B 50 50
Caveats
Ensure that the interval (e.g., 7 days) is appropriate for your data. Too short of an interval may not
smooth out fluctuations sufficiently, while too long of an interval might mask important details.
Scenario
We want to compute the 7-day moving average of streams for each song.
Input Data → daily_song_streams table from the previous example.
Query
SELECT
date,
song_name,
streams,
AVG(streams) OVER (PARTITION BY song_name ORDER BY date ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
AS 7_day_avg_streams
FROM daily_song_streams
ORDER BY song_name, date
Output
The 7-day moving average for "Song A" starts with 100 streams on 2023-01-01 and averages to 175
streams by 2023-01-04. For "Song B", the moving average starts with 50 streams and progresses to
75 streams by 2023-01-04.
2023-01-01 Song B 50 50
2023-01-04 Song B 95 75
Caveats
Ensure that the data type you're working with is boolean. LOGICAL_OR operates on boolean
values and may throw errors if the datatype is incompatible.
Remember that it only checks if at least one value is TRUE. If you have multiple TRUE values,
the result is still TRUE.
Scenario
We want to identify users who played songs on the weekend and then count how many there are.
Input Data → user_song_play (Note: Here, 2023-01-01 is a Sunday and 2023-01-07 is a Saturday,
which are considered as the weekend.)
1 2023-01-01 Song A
1 2023-01-02 Song B
2 2023-01-02 Song A
2 2023-01-08 Song C
3 2023-01-03 Song B
4 2023-01-07 Song A
Query
WITH WeekendPlays AS (
SELECT
user_id,
LOGICAL_OR(EXTRACT(DAYOFWEEK FROM date) IN (1, 7)) AS played_on_weekend
FROM user_song_play
GROUP BY user_id
)
SELECT
COUNT(user_id) AS users_who_played_on_weekend
FROM WeekendPlays
WHERE played_on_weekend = TRUE
Output
In the example, users with IDs 1, 2, and 4 played songs during the weekend. So the total count of
users who played on the weekend is 3.
users_who_played_on_weekend
Caveats
Ensure the data types of the values you're evaluating with COALESCE are compatible, or you
may encounter errors or unexpected results. Example: COALESCE('Hello', 2023-01-01)
Scenario
Display the subscription price for users, preferring the promotional price if available.
Input Data → subscription
1 10 NULL
2 NULL 5
Query
Output
For user 1, since the promotional price is NULL, the monthly price is taken as the subscription price.
For user 2, the promotional price is used.
user_id subscription_price
1 10
2 5
GENERATE_DATE_ARRAY
GENERATE_DATE_ARRAY produces an array of consecutive dates within a specified date range.
Caveats
Ensure that the start date is earlier than or equal to the end date, or the function will return an
error.
If you're joining this array with another dataset, be aware of potential Cartesian products if not
used correctly.
When is it Useful?
1. Data Gap Filling: If you have missing dates in a time series dataset, you can use this function to
generate a continuous sequence of dates and then LEFT JOIN your data to ensure there are
entries for every day, even if some days have no data. It helps me a lot in A/B testing.
3. Visualizations: For plotting data over a continuous date range, ensuring there are no date gaps
can help in producing consistent and accurate visualizations.
4. Analyzing Patterns: If you want to analyze user behavior or other patterns on specific days (e.g.,
weekends, holidays), you can generate a list of those dates within a range and then use it to filter
or aggregate your data based on those dates.
Scenario
Generate a continuous sequence of dates from 2023-10-01 to 2023-10-06 and join it with the song
stream counts. If there's no data for a particular day, the stream count should be 0.
Input Data → daily_song_streams (Notice that the data for 2023-10-03 and 2023-10-05 are missing)
Query
WITH DateRange AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2023-10-01', '2023-10-06')) AS day
)
SELECT
d.day AS date,
COALESCE(s.song_id, 101) AS song_id, -- assuming 101 as default song_id
COALESCE(s.streams, 0) AS streams
FROM DateRange d
LEFT JOIN daily_song_streams s ON d.day = s.date
ORDER BY d.day
Output
Using GENERATE_DATE_ARRAY , we've filled the data gaps for the missing dates with stream counts set to 0.
2023-10-03 101 0
2023-10-05 101 0
Caveats
Ensure the columns in the ROLLUP are in a logical hierarchical order for meaningful aggregates.
NULL values in the ROLLUP output represent aggregated data (either subtotals or grand totals).
Consider using the COALESCE function if you want to replace these NULLs with a meaningful
descriptor.
Scenario
We want to get the total streams for songs by both artist and album, as well as subtotals by artist and
a grand total for all songs.
Input Data → song_streams
Query
SELECT
artist,
album,
SUM(streams) AS total_streams
FROM song_streams
GROUP BY ROLLUP(artist, album)
Output
The ROLLUP function provides aggregates at multiple levels. Rows with NULL in the album column
represent the total streams for that artist across all albums. The row with both artist and album as
NULL represents the grand total streams for all songs.
1. The order of columns in the ROLLUP matters. The aggregation will be hierarchical based on the
order provided.
3. ROLLUP can be very useful for generating reports that require hierarchical aggregates without
running multiple queries.
SAFE_DIVIDE
SAFE_DIVIDE is a function that safely divides two numbers, returning null if the denominator is zero
to prevent division by zero errors.
Caveats
While SAFE_DIVIDE protects against division by zero errors, ensure you handle or account for null
values in subsequent operations or aggregations.
Scenario
Calculate the ratio of current year's sales to last year's sales for various products.
Input Data → sales
102 400 0
Query
Output
product_id sales_growth_ratio
101 2.0
102 NULL
103