0% found this document useful (0 votes)
40 views13 pages

The Ultimate SQL Cheat Sheet For Data Scientists in Tech

Uploaded by

hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views13 pages

The Ultimate SQL Cheat Sheet For Data Scientists in Tech

Uploaded by

hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

The Ultimate SQL Cheat Sheet for

Data Scientists in Tech


By Khouloud El Alami

This cheat sheet is a compilation of the top advanced SQL functions that I use the most as a Data
Scientist at Spotify. Some of these functions are more specific to the BigQuery environment.

Table of Contents
1. CTE (Common Table Expressions)

2. Window Functions

ROW_NUMBER

RANK

LAG and LEAD

3. Conditional Functions

CASE WHEN

IF Statements

4. Computing Running Totals

5. Computing Moving Averages over 7 days

6. LOGICAL_OR

7. COALESCE

8. GENERATE_DATE_ARRAY

9. ROLLUP

10. SAFE_DIVIDE

Disclaimer
This is not Spotify data nor is it based on it. I used ChatGPT to generate hypothetical data points. I
chose the music industry because it’s the type of data I’m the most proficient with and I can easily
draft queries from those samples. I also picked music data because it’s fun.

Three things you can do for me


Follow me on Medium, LinkedIn and X. Stay tuned, I’m working on a cool new project to help you
even more in your data journey. It’s coming soon!

If you’ve found this Cheat Sheet useful, email me here to let me know →
[email protected]. I’d love to know that what I’m doing is useful and meaningful. It’ll
encourage me to work harder!

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 1


CTE (Common Table Expressions)
CTEs provide a way to create temporary tables that can be easily referenced within a SELECT,
INSERT, UPDATE, or DELETE statement within the same query.

Caveats
CTEs are not materialized, meaning they are not stored as physical tables. They are recomputed
every time they are referenced in the main query, which can lead to performance issues if not
used carefully.

Scenario
We want to find songs that have streams higher than their respective artist's average number of
streams.

Input Data → song_streams

song_id song_name artist streams

1 Song A Artist X 100

2 Song B Artist X 150

3 Song C Artist X 50

4 Song D Artist Y 500

5 Song E Artist Y 250

6 Song F Artist Z 300

Query

WITH ArtistAverage AS (
SELECT
artist,
AVG(streams) AS avg_streams
FROM song_streams
GROUP BY artist
)
SELECT
s.song_id,
s.artist,
s.streams,
a.avg_streams
FROM song_streams s
JOIN ArtistAverage a ON s.artist = a.artist
WHERE s.streams > a.avg_streams

Output

"Song B" by "Artist X" has 150 streams, which is higher than the average of 100 streams for "Artist X".

song_id artist streams avg_streams

2 Artist X 150 100

4 Artist Y 500 375

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 2


Window Functions (The Beast)
Window functions allow us to perform calculations across a set of rows that are related to the
current row. Unlike aggregate functions, which return a single value for each group, window functions
do not cause rows to become grouped into a single output row. Window functions return a single
aggregated value for each row. The term "window" refers to the set of rows the function will process
like an interval.

Caveats
Window functions can lead to performance issues if used on large datasets without appropriate
indexing or partitioning.

Ensure that the ORDER BY clause in the window specification is set correctly. The order can
dramatically change the result.

Scenario
Let's use window functions to derive insights about song streams.

Input Data → song_streams

song_id artist album song_title streams

1 ArtistA Album1 Song1 5000

2 ArtistA Album1 Song2 3000

3 ArtistA Album2 Song3 4000

4 ArtistB Album3 Song4 6000

5 ArtistB Album3 Song5 2000

1. ROW_NUMBER
Objective: Assign a unique number to each row within a partition of the result set.
Query

SELECT
artist,
album,
song_title,
streams,
ROW_NUMBER() OVER(PARTITION BY artist ORDER BY streams DESC) as row_num
FROM song_streams

Output

The song with the highest streams for each artist gets the row number 1.

artist album song_title streams row_num

ArtistA Album1 Song1 5000 1

ArtistA Album2 Song3 4000 2

ArtistA Album1 Song2 3000 3

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 3


ArtistB Album3 Song4 6000 1

ArtistB Album3 Song5 2000 2

2. RANK
Objective: Assign a unique rank to each distinct row within a partition of the result set.
Query

SELECT
artist,
album,
song_title,
streams,
RANK() OVER(PARTITION BY artist ORDER BY streams DESC) as rank
FROM song_streams;

Output

Unlike ROW_NUMBER(), RANK() can assign the same rank to rows with the same streams count.

artist album song_title streams rank

ArtistA Album1 Song1 5000 1

ArtistA Album2 Song3 4000 2

ArtistA Album1 Song2 3000 3

ArtistB Album3 Song4 6000 1

ArtistB Album3 Song5 2000 2

The difference between RANK() and ROW_NUMBER() lies in how they handle ties (duplicate values) within
the ordered data.

1. ROW_NUMBER():

Assigns a unique number to each row within the partition.

It doesn't care about ties. Even if two (or more) rows have the same value in the order
column, they will be assigned different numbers. The specific row that gets the next number
in the case of a tie might be arbitrary depending on the database system.

2. RANK():

Assigns a unique rank to each distinct row within the partition.

In the case of ties, it will assign the same rank to all tied rows. For instance, if two rows are
tied for 1st place, both will be given a rank of 1, but the next row will be given a rank of 3 (it
skips the rank 2).

3. LAG and LEAD


Objective: Access data from the previous or next row in the result set.

Query

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 4


SELECT
artist,
album,
song_title,
streams,
LAG(song_title) OVER(PARTITION BY artist ORDER BY streams) as prev_song,
LEAD(song_title) OVER(PARTITION BY artist ORDER BY streams) as next_song
FROM song_streams;

Output
For each song, the result displays the previous song and the next song (based on streams count) for
the same artist. If there's no previous or next song, the corresponding value is NULL.

artist album song_title streams prev_song next_song

ArtistA Album1 Song2 3000 NULL Song3

ArtistA Album2 Song3 4000 Song2 Song1

ArtistA Album1 Song1 5000 Song3 NULL

ArtistB Album3 Song5 2000 NULL Song4

ArtistB Album3 Song4 6000 Song5 NULL

Conditional Functions

1. CASE WHEN
CASE WHEN provides conditional logic for handling multiple conditions and results.

Caveats
Ensure that the conditions in the CASE WHEN statement are mutually exclusive.

The conditions are evaluated in the order they’re written. Once a condition is met, the
corresponding result is returned, and the remaining conditions are not evaluated. If conditions are
not mutually exclusive, only the first true condition encountered will determine the result.

Scenario
Categorize songs as "Popular" if the play count is over 100 and "Less Popular" otherwise.
Input Data → song_plays

song_id play_count

1 150

2 50

Query

SELECT
song_id,
play_count,
CASE

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 5


WHEN play_count > 100 THEN 'Popular'
ELSE 'Less Popular'
END AS popularity
FROM song_plays

Output
Songs with a play count greater than 100 are categorized as "Popular", while others are categorized
as "Less Popular".

song_id play_count popularity

1 150 Popular

2 50 Less Popular

2. IF Statements
The IF function is used to return one value if a condition is true, and another value if it's false.
Sometimes IF and CASE statements may be used interchangeably.

Caveats
IF statements are limited to just two outcomes. For multiple conditions, multiple nested IFs would
be needed, which can reduce readability.

Scenario
We want to classify songs as "Recent" if they were released in 2020 or later, otherwise classify them
as "Older".
Input Data → songs

song_id title release_year genre

1 Song A 2019 Pop

2 Song B 2020 Classical

Query

SELECT
song_id,
title,
IF(release_year >= 2020, 'Recent', 'Older') AS classification
FROM songs

Output
Songs released in 2020 or later are classified as "Recent", and those released before are classified
as "Older".

song_id title classification

1 Song A Older

2 Song B Recent

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 6


Computing Running Totals
Running Totals provides a cumulative sum of values, often over time or in a specific sequence

Caveats
Running totals require ordered data. Ensure you've specified the correct order in your window
function.

When dealing with large datasets, computing running totals can be resource-intensive. Ensure
that your dataset is partitioned appropriately, and consider indexing if performance is an issue.

Scenario
We want to compute the running total of streams for each song over the days.
Input Data → daily_song_streams

date song_name streams

2023-01-01 Song A 100

2023-01-02 Song A 150

2023-01-03 Song A 200

2023-01-04 Song A 250

2023-01-01 Song B 50

2023-01-02 Song B 75

2023-01-03 Song B 80

2023-01-04 Song B 95

Query

SELECT
date,
song_name,
streams,
SUM(streams) OVER (PARTITION BY song_name ORDER BY date ASC) AS running_total_streams
FROM daily_song_streams
ORDER BY song_name, date

Output
For "Song A", the running total starts with 100 streams on 2023-01-01 and accumulates to 700
streams by 2023-01-04. Similarly, for "Song B", the running total progresses from 50 streams on
2023-01-01 to 300 streams by 2023-01-04.

date song_name streams running_total_streams

2023-01-01 Song A 100 100

2023-01-02 Song A 150 250

2023-01-03 Song A 200 450

2023-01-04 Song A 250 700

2023-01-01 Song B 50 50

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 7


2023-01-02 Song B 75 125

2023-01-03 Song B 80 205

2023-01-04 Song B 95 300

Computing Moving Averages over 7 days


Moving Averages provide a way to compute the average of a set of values over a specific interval.
It's useful to smooth out fluctuations and highlight trends.

Caveats
Ensure that the interval (e.g., 7 days) is appropriate for your data. Too short of an interval may not
smooth out fluctuations sufficiently, while too long of an interval might mask important details.

Scenario
We want to compute the 7-day moving average of streams for each song.
Input Data → daily_song_streams table from the previous example.
Query

SELECT
date,
song_name,
streams,
AVG(streams) OVER (PARTITION BY song_name ORDER BY date ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
AS 7_day_avg_streams
FROM daily_song_streams
ORDER BY song_name, date

Output
The 7-day moving average for "Song A" starts with 100 streams on 2023-01-01 and averages to 175
streams by 2023-01-04. For "Song B", the moving average starts with 50 streams and progresses to
75 streams by 2023-01-04.

date song_name streams 7_day_avg_streams

2023-01-01 Song A 100 100

2023-01-02 Song A 150 125

2023-01-03 Song A 200 150

2023-01-04 Song A 250 175

2023-01-01 Song B 50 50

2023-01-02 Song B 75 62.5

2023-01-03 Song B 80 68.33

2023-01-04 Song B 95 75

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 8


LOGICAL_OR
LOGICAL_OR is a function that returns TRUE if at least one of the input values is TRUE , and FALSE

otherwise. It’s a BigQuery specific function.

Caveats
Ensure that the data type you're working with is boolean. LOGICAL_OR operates on boolean
values and may throw errors if the datatype is incompatible.

Remember that it only checks if at least one value is TRUE. If you have multiple TRUE values,
the result is still TRUE.

Scenario
We want to identify users who played songs on the weekend and then count how many there are.
Input Data → user_song_play (Note: Here, 2023-01-01 is a Sunday and 2023-01-07 is a Saturday,
which are considered as the weekend.)

user_id date song_name

1 2023-01-01 Song A

1 2023-01-02 Song B

2 2023-01-02 Song A

2 2023-01-08 Song C

3 2023-01-03 Song B

4 2023-01-07 Song A

Query

WITH WeekendPlays AS (
SELECT
user_id,
LOGICAL_OR(EXTRACT(DAYOFWEEK FROM date) IN (1, 7)) AS played_on_weekend
FROM user_song_play
GROUP BY user_id
)

SELECT
COUNT(user_id) AS users_who_played_on_weekend
FROM WeekendPlays
WHERE played_on_weekend = TRUE

Output
In the example, users with IDs 1, 2, and 4 played songs during the weekend. So the total count of
users who played on the weekend is 3.

users_who_played_on_weekend

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 9


COALESCE
COALESCE returns the first non-null value in a list.

Caveats
Ensure the data types of the values you're evaluating with COALESCE are compatible, or you
may encounter errors or unexpected results. Example: COALESCE('Hello', 2023-01-01)

If all values are NULL, COALESCE will return NULL.

Scenario
Display the subscription price for users, preferring the promotional price if available.
Input Data → subscription

user_id monthly_price promotional_price

1 10 NULL

2 NULL 5

Query

SELECT user_id, COALESCE(promotional_price, monthly_price) as subscription_price


FROM subscription

Output
For user 1, since the promotional price is NULL, the monthly price is taken as the subscription price.
For user 2, the promotional price is used.

user_id subscription_price

1 10

2 5

GENERATE_DATE_ARRAY
GENERATE_DATE_ARRAY produces an array of consecutive dates within a specified date range.

Caveats
Ensure that the start date is earlier than or equal to the end date, or the function will return an
error.

If you're joining this array with another dataset, be aware of potential Cartesian products if not
used correctly.

When is it Useful?
1. Data Gap Filling: If you have missing dates in a time series dataset, you can use this function to
generate a continuous sequence of dates and then LEFT JOIN your data to ensure there are
entries for every day, even if some days have no data. It helps me a lot in A/B testing.

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 10


2. Date-Based Iterations: When you need to repeatedly perform calculations or operations for each
date in a specific range, GENERATE_DATE_ARRAY can provide the sequence of dates to iterate over.

3. Visualizations: For plotting data over a continuous date range, ensuring there are no date gaps
can help in producing consistent and accurate visualizations.

4. Analyzing Patterns: If you want to analyze user behavior or other patterns on specific days (e.g.,
weekends, holidays), you can generate a list of those dates within a range and then use it to filter
or aggregate your data based on those dates.

Scenario
Generate a continuous sequence of dates from 2023-10-01 to 2023-10-06 and join it with the song
stream counts. If there's no data for a particular day, the stream count should be 0.
Input Data → daily_song_streams (Notice that the data for 2023-10-03 and 2023-10-05 are missing)

date song_id streams

2023-10-01 101 5000

2023-10-02 101 5200

2023-10-04 101 5100

2023-10-06 101 5300

Query

WITH DateRange AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2023-10-01', '2023-10-06')) AS day
)

SELECT
d.day AS date,
COALESCE(s.song_id, 101) AS song_id, -- assuming 101 as default song_id
COALESCE(s.streams, 0) AS streams
FROM DateRange d
LEFT JOIN daily_song_streams s ON d.day = s.date
ORDER BY d.day

Output
Using GENERATE_DATE_ARRAY , we've filled the data gaps for the missing dates with stream counts set to 0.

date song_id streams

2023-10-01 101 5000

2023-10-02 101 5200

2023-10-03 101 0

2023-10-04 101 5100

2023-10-05 101 0

2023-10-06 101 5300

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 11


ROLLUP
ROLLUP is an extension of the GROUP BY clause, allowing us to create subtotals and grand totals in
our result set.

Caveats
Ensure the columns in the ROLLUP are in a logical hierarchical order for meaningful aggregates.

NULL values in the ROLLUP output represent aggregated data (either subtotals or grand totals).
Consider using the COALESCE function if you want to replace these NULLs with a meaningful
descriptor.

Scenario
We want to get the total streams for songs by both artist and album, as well as subtotals by artist and
a grand total for all songs.
Input Data → song_streams

artist album song_title streams

ArtistA Album1 Song1 5000

ArtistA Album1 Song2 3000

ArtistA Album2 Song3 4000

ArtistB Album3 Song4 6000

ArtistB Album3 Song5 2000

Query

SELECT
artist,
album,
SUM(streams) AS total_streams
FROM song_streams
GROUP BY ROLLUP(artist, album)

Output

The ROLLUP function provides aggregates at multiple levels. Rows with NULL in the album column
represent the total streams for that artist across all albums. The row with both artist and album as
NULL represents the grand total streams for all songs.

artist album total_streams

ArtistA Album1 8000

ArtistA Album2 4000

ArtistB Album3 8000

ArtistA NULL 12000

ArtistB NULL 8000

NULL NULL 20000

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 12


Worth mentioning when doing rollups

1. The order of columns in the ROLLUP matters. The aggregation will be hierarchical based on the
order provided.

2. NULL in the output typically represents a subtotal or grand total.

3. ROLLUP can be very useful for generating reports that require hierarchical aggregates without
running multiple queries.

SAFE_DIVIDE
SAFE_DIVIDE is a function that safely divides two numbers, returning null if the denominator is zero
to prevent division by zero errors.

Caveats
While SAFE_DIVIDE protects against division by zero errors, ensure you handle or account for null
values in subsequent operations or aggregations.

Scenario
Calculate the ratio of current year's sales to last year's sales for various products.
Input Data → sales

product_id current_year_sales last_year_sales

101 500 250

102 400 0

103 600 300

Query

SELECT product_id, SAFE_DIVIDE(current_year_sales, last_year_sales) AS sales_growth_ratio


FROM sales

Output

product_id sales_growth_ratio

101 2.0

102 NULL

103

The Ultimate SQL Cheat Sheet for Data Scientists in Tech 13

You might also like