0% found this document useful (0 votes)
12 views

The Ultimate SQL Guide

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

The Ultimate SQL Guide

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Throughout this canvas,

Example Schema we will be using queries


that reference this data. Tables
You can find the table
relationships in the
with schema diagram, and you
can scroll through the
values in each table to the
right. daily_streams
daily_streams
The global top 200 tracks streamed on Spotify each day in 2017. daily_streams.csv

columns definition sample day track_id rank title artist

day day of rank 2017-10-05 2017-04-21 5CtI0qwDJkDQGwXD1H1cLb 1 Despacito - Remix Luis Fonsi
unique identifer for 2017-04-22 5CtI0qwDJkDQGwXD1H1cLb 1 Despacito - Remix Luis Fonsi
track_id each track
3gWu8y0TKCIdy2mpTqVnTl

daily rank in top 200 2017-04-23 5CtI0qwDJkDQGwXD1H1cLb 1 Despacito - Remix Luis Fonsi
rank (1 is best)
128
2017-04-24 5CtI0qwDJkDQGwXD1H1cLb 1 Despacito - Remix Luis Fonsi
title track title Let You Down
2017-04-25 5CtI0qwDJkDQGwXD1H1cLb 1 Despacito - Remix Luis Fonsi

artist artist name NF 2017-04-26 5CtI0qwDJkDQGwXD1H1cLb 1 Despacito - Remix Luis Fonsi


total Spotify streams
artists
streams 628310 2017-04-27
DuckDB
5CtI0qwDJkDQGwXD1H1cLb
68 ms (Just now) 6 columns · 70.000 rows
1 Despacito - Remix Luis Fonsi
that day
Metadata about each artist featured in daily_streams and tracks .
2017-04-28 5CtI0qwDJkDQGwXD1H1cLb 1 Despacito - Remix Luis Fonsi

columns definition sample


unique identifer for
artist_id each artist
6fOMl44jA4Sp5b9PpYCkzz

name artist name NF

measure of the popularity


tracks
popularity (0-100) measured on 85
g.csv
number of followers
followers as of updated_at
5850077

date followers and track_id track_title artist_id


updated_at popularity were updated
2021-08-20
3gWu8y0TKCIdy2mpTqVnTl Let You Down 6fOMl44jA4Sp5b
url
https://open.spotify.com/
spotify url for artist artist/
52okn5MNA47tk87PeZJLEL Let You Down 6fOMl44jA4Sp5b

6mrKP2jyIQmM0rw6fQryjr Let You Down 6fOMl44jA4Sp5b


tracks 73VrGWg4I2DJiWXtbseyiG You’re The Best Thing About Me 51Blml2LZPmy7T

Metadata about each track featured in daily_streams . 3E2Zh20GDCR9B1EYjfXWyv Weak 6s22t5Y3prQHya

6OuV4vfyb8v3vGGGksqyKf Weak 6s22t5Y3prQHya


columns definition sample
unique identifer for 02q0ZnV2L4XByzEvWZJqBC Spring Day 3Nrfpe0tUJi4K4D
track_id each track
3gWu8y0TKCIdy2mpTqVnTl

0ffweDHrg4bZqdG4IfGYuo Pied Piper 3Nrfpe0tUJi4K4D


track_title track title Let You Down DuckDB 74 ms (Just now) 15 columns · 1.502 rows
2SYa5Lx1uoCvyDIW4oee9b MIC Drop 3Nrfpe0tUJi4K4D
unique identifer for
artist_id each artist
6fOMl44jA4Sp5b9PpYCkzz

array of all artist


all_artists names on the track
['NF']

name of album for


album_name each track
Let You Down

type of album (album,


album_type single, compilation)
single
artists
release_date date of album release 2017-09-15
c.csv
True | False whether
explicit track is explicit
false

0-1 measure of how artist_id name popularity followers updated_at url


acousticness acoustic a track is
0,319
6fOMl44jA4Sp5b9PpYCkzz NF 85 5850077 2021-08-20 https:/
0-1 measure of how
danceability danceable a track is
0,658
51Blml2LZPmy7TTiAg47vQ U2 82 8615769 2021-08-20 https:/

duration track duration (ms) 212120 6s22t5Y3prQHyaHWUN1R1C AJR 79 1911798 2021-08-20 https:/
0-1 measure of how 3Nrfpe0tUJi4K4DXYWgMUX BTS 98 37314303 2021-08-20 https:/
energy energetic a track is
0,715

0-1 measure of how 3cjEqqelV9zb4BYE3qDQ4O EXO 77 8207699 2021-08-20 https:/


instrumentalness instrumental a track is
0
1bqxdqvUtPWZri43cKHac8 MAX 76 690629 2021-08-20 https:/
loudness in db (-60 to
loudness 0)
-5,688
3cqIsBnzV3BabbPWKz8Txf Mud 43 64629 2021-08-20 https:/
0-1 measure of how
valence "happy" a track is
0,473
0bdfiayQAKewqEvaU6rXCv
DuckDB 72 ms (Just now) MØ
6 columns · 427 rows 76 1340546 2021-08-20 https:/

7rkW85dBwwrJtlHRDkJDAC NAV 82 2620150 2021-08-20 https:/

: primary key
SQL vs Excel Selecting data Filtering data VLOOKUPs PIVOT TABLES

SQL for Excel TERMINOLOGY:

Selecting all columns from a


FILTERING TEXT VALUES FILTERING DATES
LEFT JOIN is the near- vlookup Aggregations with GROUP pivot_1

users
select_all equivalent to VLOOKUPs in BYs are a simple way to
In Excel: In SQL: 1 select 1 select
table. 1 select * from daily_streams -- when querying a database table, you'll need to Excel. 2 artists.name, make pivot tables. 2 artist,
Worksheet Tables Filtering for an exact match. exact_match 3 artists.followers, 3 sum(streams) total_streams,
To select all columns from a table specify the schema name: SELECT * FROM <schema_name>.<table_name>
You can specify which colums you You can use this with common
Filtering Query with a WHERE clause 4 artists.popularity, 4 avg(rank) avg_rank,
you can use the *: When you want to only show rows 1 select want to join to your main table and aggregations like:
Sorting Query with an ORDER BY clause
day track_id rank title artist
2 *
Filter for a specific date date 5 tracks.track_title 5 min(day) first_day,
2017-06-09 4mJDfMcT7odIUjWlb2WO4L 97 ...Baby One More Time - Recorded at Spotify Studios NYC Ed Sheeran
that match a particular value we can how you want to match them up. 6 from 6 count(distinct title) tracks_ranked
Drop duplicates Query with DISTINCT keyword add a WHERE statement and use a
3 from To select a specific date you can 1 select 7 tracks SUM 7 from
2017-09-03 19WjVVgXjdKXfLXoaDPwTM 10 ...Ready For It? Taylor Swift 4 artists 2 *
logical filter. use and = and put the date in the Syntax: 8 left join artists on tracks.artist_id = artists.artist_id COUNT 8 daily_streams
If statements Query with CASE statement 5 where 3 from
2017-09-04 7zgqtptZvhf8GEmdsM2vp2 3 ...Ready For It? Taylor Swift
format: `YYYY-MM-DD' 9 where
MIN
9 where artist in ('Drake','Taylor Swift','Ed Sheeran')
Pivot tables Query with an aggregation and GROUP BY statement, or PIVOT 2017-09-05 7zgqtptZvhf8GEmdsM2vp2 4 ...Ready For It? Taylor Swift
6 name = 'Drake' 4 daily_streams 10 artists."name" = 'Taylor Swift' 10 group by
5 where SELECT * MAX
VLookup Query with a JOIN statement 11 artist
2017-09-06 7zgqtptZvhf8GEmdsM2vp2 4 ...Ready For It? Taylor Swift FROM <table_name>
6 day = '2017-03-31' name followers popularity track_title
2017-09-07 7zgqtptZvhf8GEmdsM2vp2 5 ...Ready For It? Taylor Swift LEFT JOIN <other_table_name> Taylor Swift 42336448 96 Gorgeous GROUP BY will determine your row artist total_streams avg_rank first_day tracks_ranked
day track_id rank title artist ON <table_name>.<column> =
2017-09-08 7zgqtptZvhf8GEmdsM2vp2 7 ...Ready For It? Taylor Swift
2017-03-31 3NdDpSvN911VPGivFlV5d0 15 I Don’t Wanna Live Forever (Fifty Shades Darker) - From "Fifty… ZAYN Taylor Swift 42336448 96 Bad Blood groupings. Drake 2204505804 99,626281454 2017-01-01 29
MINDSET SHIFTS 2017-09-09
DuckDB 149 ms (Just now)
19WjVVgXjdKXfLXoaDPwTM
6 columns · 70.000 rows
10 ...Ready For It? Taylor Swift
<other_table_name>.<column>
Ed Sheeran 4434290129 89,248064516 2017-01-01 31
2017-03-31 0dA2Mk56wEzDgegdC6R17g 6 Stay (with Alessia Cara) Zedd Taylor Swift 42336448 96 Getaway Car
[1] You don't just have data in a sheet - you have to go get it by writing a query. The biggest Taylor Swift 495223188 92,550135501 2017-06-09 14
2017-03-31 4vb4mFvYsr2h6enhjJsq9Y 161 Water Under the Bridge Adele Taylor Swift 42336448 96 Shake It Off
shift when going from spreadsheets to SQL is that the data is not sitting in front of you. The DuckDB 132 ms (Just now) 4 columns · 22 rows
2017-03-31 0tICYNayWWhH9GPeFrfjfD 173 This Girl (Kungs Vs. Cookin' On 3 Burners) - Kungs Vs. Cookin… Kungs Taylor Swift 42336448 96 Call It What You Want
data is sitting in a database you can't really see. That means to get that data you need to
DuckDB 110 ms (Just now) 6 columns · 200 rows DuckDB 170 ms (Just now) 5 columns · 3 rows
write the query to return what you want. It takes practice, but over time it gets easy. 2017-03-31 7cGFbx7MP0H23iHZTZpqMM 98 Everybody Logic
A better way to Select all better_select_all Filtering for NOT an exact not_exact_match

[2] There are a lot of rules. With a spreadsheet, you can manipulate data however you want columns from a table. match.
1 select 1 select You can join multiple tables b
- delete a column, change a value of a row, add formulas, etc. With SQL, there are more 2 day, 2 *
But that's not considered best When you want to only show rows together at once.
strict rules in place about how you query data, which operations are done in which order, 3 track_id, 3 from 1 select Some databases offer true pivot_2
practice since it can be that don't match a particular value, 2 day, title, streams, artists."name", tracks.release_date
etc. This is really a benefit in most cases though (see below). 4 rank, 4 artists Chain JOINs together to link PIVOT functionality.
computationally very expensive to 5 title, we can use <> or != (depending 5 where Filter for before/after a before_after_date
multiple tables together at once.
3 from daily_streams 1 -- Count doesn't currently support Pivots in DuckDB
Often there is a PIVOT keyword that 2
do that. Instead, its better to list out 6 artist, on your dialect of SQL) 6 name <> 'Drake' specific date: 4 left join tracks using (track_id)
1 select
the columns you want as a list: 7 streams 5 left join artists using (artist_id) lets you truly turn rows to columns
2 * Note: "name" and "rank" are in
To return anything before or after a Unexpected end of input.
BENEFITS OF SQL 8 from artist_id name popularity followers updated_at url
3 from quotes here because it is a SQL
6 where "rank" =1
and columns to rows.
9 daily_streams 6fOMl44jA4Sp5b9PpYCkzz NF 85 5850077 2021-08-20 specific date we can use the
https://open.spotify.com/artist/6fOMl44jA4Sp5b9PpYCkzz
[1] Performance. If you've ever spent 10 minutes waiting for a spreadsheet to open, you'll 4 daily_streams keyword so must be escaped in day title streams name release_date
51Blml2LZPmy7TTiAg47vQ U2 82 8615769 2021-08-20 operators:
https://open.spotify.com/artist/51Blml2LZPmy7TTiAg47vQ 5 where
care about this one. With SQL, you only return the data you need, and SQL databases quotes when we are referring to 2017-09-16 Mi Gente 4361394 J Balvin 2017-06-30
6s22t5Y3prQHyaHWUN1R1C AJR 79 1911798 2021-08-20 https://open.spotify.com/artist/6s22t5Y3prQHyaHWUN1R1C 6 day >= '2017-03-31'
(espeically modern ones) are designed to run these types of queries very efficently. DuckDB 90 ms (Just now) 6 columns · 426 rows > , >= , < , <= a column with the same name. 2017-05-28 Despacito (Featuring Daddy Yankee) 9849173 Luis Fonsi 2017-01-13
3Nrfpe0tUJi4K4DXYWgMUX BTS 98 37314303 2021-08-20 https://open.spotify.com/artist/3Nrfpe0tUJi4K4DXYWgMUX day track_id rank title artist streams 2017-07-30 Despacito - Remix 4086239 Luis Fonsi 2017-04-17
[2] Auditability. When you get sent someone else's spreadsheet it's not uncommon to spend 2017-10-05 3gWu8y0TKCIdy2mpTqVnTl 128 Let You Down NF 628310 2017-10-10 rockstar 6260386 Post Malone 2017-09-15
loads of time trying to work out exactly how someone got to the numbers on the page. SQL
2017-10-06 6mrKP2jyIQmM0rw6fQryjr 90 Let You Down NF 848518 DuckDB
2017-12-31 94rockstar
ms (Just now) 5 columns · 350 rows 5273727 Post Malone 2017-09-15
is a standardized language, so queries are easy to read, understand, and they have clear Connecting · 00∶00
2017-10-07 52okn5MNA47tk87PeZJLEL 86 Let You Down NF 827855
execution order (not the case in spreadsheets.) Filtering for multiple exact_matches
2017-10-08 52okn5MNA47tk87PeZJLEL 73 Let You Down NF 846227
possible values. 1 select DuckDB 167 ms (Just now) 6 columns · 52.400 rows
[3] Everyone's looking at the same thing. Since spreadsheets are often disconnected from 2017-10-09 52okn5MNA47tk87PeZJLEL 61 Let You Down NF 1022044
When you want to return values that 2 *
source data, you don't know how old the data is, which filters were set before it was 3 from
match >1 possible value, then you
downloaded, etc. With SQL, you can always re-run a query and get the latest results from the 4 artists
can use IN and () : 5 where
database.
6 name in ('Drake', 'Ariana Grande') -- this is the same as name = 'Drake' OR

Drop Duplicates drop_dups


name = 'Ariana Grande'
Filter for a date range date_range Common Excel Functions Translated to SQL
WHEN TO USE SPREADSHEETS VS. SQL To return only unique values, you 1 select distinct
artist_id name popularity followers updated_at url
To filter for a date range, we can use 1 select
can use the DISTINCT keyword 2 album_type 3TVXtAsR1Inumwj472S9r4 Drake 98 56396485 2021-08-20 https://open.spotify.com/artist/3TVXtAsR1Inumwj472S9r4
BETWEEN . The values are inclusive - 2 *
Spreadsheets are a great way to explore data, experiment with different scenarios, and do
3 from 3 from
quick calculations. It's important to understand the risks, however. after SELECT: 66CXWjxzNUsdJxJ2JdwvnR Ariana Grande 95 67401549 2021-08-20 meaning you will return the values
https://open.spotify.com/artist/66CXWjxzNUsdJxJ2JdwvnR
4 tracks 4 daily_streams
you specify in the BETWEEN . 5 where Text functions: text_fns
So, I would advise using SQL for anything in which auditability, accuracy, and performance 6 day between '2017-01-01' and '2017-01-31' -- All of January
DuckDB 77 ms (Just now) 6 columns · 2 rows
are important: The range is inclusive meaning both Concatenate 1 select distinct
sides of the range will be included. day track_id rank title artist
Left/Right/Mid
2 artist || ': ' || title as artist_track, -- Concatenate
working with large data 3 left(artist,2) artist_prefix, -- similar for right, and mid
2017-01-31 3E2Zh20GDCR9B1EYjfXWyv 40 Weak AJR
Connecting · 00∶00
Trim 4 len(title) track_title_characters, -- string length
working on a high-stakes project or piece of work 2017-01-31 4qqArAiTPueDxIp7cf87h7 150 Final Song MØ
5 strpos(artist,'k') str_find, -- returns index of first match of searched
Len
working with others Excluding multiple possible not_exact_matches
2017-01-31 7xHWNBFm6ObGEQPaUxHuKO 81 The Greatest Sia character
values. Find 6 replace(artist,'a','-') str_replace -- Substitute
running business processes 1 select 2017-01-31
DuckDB
27SdWb2rFzO6GWiYDBTD9j
152 ms (Just now) 6 columns · 6.200 rows
84 Cheap Thrills Sia
2 * Substitue 7 from
When you want to return values that 2017-01-31 1a5Yu5L18qNxVhXx38njON 27 Hear Me Now Alok
8 daily_streams
I would advise using Excel for: 3 from
are NOT included in a list of values Every database will have slightly 9 where
4 artists
quickly validating the results of SQL queries then you can add a NOT : 5 where different syntax or even different 10 artist = 'Drake'

prototyping or quick analysis Sorting results sorting 6 name NOT in ('Drake', 'Ariana Grande') names for these functions. artist_track artist_prefix track_title_characters str_find str_replace

The ORDER BY clause determines 1 select artist_id name popularity followers updated_at url Drake: KMT Dr 3 4 Dr-ke
3gGUMEwIX6XodWsYEvKSal YBN Nahmir 71 2597606 2021-08-20 Filter for today or yesterday
https://open.spotify.com/artist/3gGUMEwIX6XodWsYEvKSal today
how your statement will be ordered. 2 name, popularity, followers
Drake: 4422 Dr 4 4 Dr-ke
3 from 50co4Is1HCEo8bhOyUWKpn Young Thug 90 6393024 2021-08-20 https://open.spotify.com/artist/50co4Is1HCEo8bhOyUWKpn
If you want to look atdata from 1 select
If you don't specify an order, then Drake: Blem Dr 4 4 Dr-ke
4 artists 7vk5e3vY1uw9plTHJAMwjN Alan Walker 85 29730734 2021-08-20 today, or yesterday we can make use
https://open.spotify.com/artist/7vk5e3vY1uw9plTHJAMwjN 2 *
no order will be guaranteed. 5 order by Drake: Glow Dr 4 4 Dr-ke
3 from
6fXEqmGQEt6ONuqVmwrN46 Bag Raiders 59 202525 2021-08-20 of CURRENT_DATE()
https://open.spotify.com/artist/6fXEqmGQEt6ONuqVmwrN46 DuckDB 171 ms (Just now) 5 columns · 29 rows
6 followers desc, 4 daily_streams Drake: Signs Dr 5 4 Dr-ke
Use ASC and DESC keywords 7 popularity asc DuckDB 90 ms (Just now)
6ZjFtWeHP9XN7FeKSUe80S 6 columns
Bing Crosby · 425 rows 59 360276 2021-08-20 https://open.spotify.com/artist/6ZjFtWeHP9XN7FeKSUe80S 5 where
to determine how you want to 6 day = current_date()--the data is from 2017 so this won't return any results
order each column. name popularity followers 7 -- current_date() - 1 is yesterday
Ed Sheeran 95 83505848
If you want to order by a
column you have aliased, it's Ariana Grande 95 67401549 Date functions: date_fns

best to use its original definition Drake 98 56396485 Now 1 select distinct
(not the alias) Justin Bieber
DuckDB 170 ms (Just now)
98 48708262
3 columns · 427 rows
Using wildcards. like
Year(), Month()
2 current_date() today, -- NOW
Connecting · 00∶00 3 current_timestamp right_now,
Eminem 94 46913506 When we want to filter for a value or 1 select Adding to dates 4 extract('year' FROM DATE '1992-09-20') "year", -- YEAR(), MONTH()
set of values but we don't want to 2 * 5 DATE '1992-03-22' + 7 next_week-- adding days to date
specify each specific value, we can
3 from Every database will have slightly 6 from daily_streams
4 artists different syntax or even different
use the LIKE keyword with the % 5 where
wildcard match: names for these functions.
6 name like 'James%'

artithmetic
To do simple arithmetic, we Advanced relative date recent_ranges Connecting · 00∶00
can use Arithmetic 1 select
filtering
2 popularity, 1 select
operators. 2 *
3 popularity / 100::float popularity_fraction --::float just ensures we get a If you want to look at data from
Connecting · 00∶00 3 from
These include: decimal back
today, or X days in the past we can
4 from 4 daily_streams
use functions like DATE_DIFF 5 where
+, - , *, /, % 5 artists
6 date_diff('day',day,current_date()) <=30 -- filtering for the last 30 days
The syntax for DATE_DIFF will be
popularity popularity_fraction 7 -- date_diff('day',day,current_date())<=7 would be the last week Stat functions: stat_fns
100 1,000000000
different in every SQL dialect so be
sure to look yours up before using it. day track_id rank title artist streams Average / Afterageif 1 select
98 0,980000019
2 avg(acousticness) "avg", -- AVG
Median
98 0,980000019 For DuckDB, the syntax is: 3 avg(
DuckDB 145 ms (Just now) 2 columns · 427 rows Standard Deviation 4 case
98 0,980000019
date_diff('part', startdate, Min 5 when release_date <= '2015-01-01' then acousticness
FILTERING NUMERICAL VALUES enddate)
6 else 0
DuckDB 175 ms (Just now) 6 columns · 0 rows Max 7 end
8 ) avg_if,
Every database will have slightly 9 quantile_cont (acousticness, 0.5) median,
different syntax or even different 10 stddev(acousticness) standard_deviation,
Filter for ranges of values ranges
names for these functions. 11 min(acousticness) "min",
12 max(acousticness) "max"
For numerical fields we will often 1 select
13 from
want to filter for specific ranges of 2 *
14 tracks
3 from
values. We can use logical 15 left join artists using (artist_id)
4 artists
operators to do that. 5 where
FILTERING NULLS 16 where
17 name = 'Taylor Swift'
6 popularity>=90
7 and followers>10000000
avg avg_if median standard_deviation min max

artist_id name popularity followers updated_at url Filter out nulls not_null 0,105318182 0,014181818 0,075 0,073646767 0,003 0,216

4q3ewBCX7sLwd24euuV69X Bad Bunny 100 36320082 2021-08-20 https://open.spotify.com/artist/4q3ewBCX7sLwd24euuV69X


To excluded any null values in a 1 select
3Nrfpe0tUJi4K4DXYWgMUX BTS 98 37314303 2021-08-20 https://open.spotify.com/artist/3Nrfpe0tUJi4K4DXYWgMUX 2 *
column we can use the expression
3TVXtAsR1Inumwj472S9r4 Drake 98 56396485 2021-08-20 https://open.spotify.com/artist/3TVXtAsR1Inumwj472S9r4 3 from
IS NOT NULL in our WHERE clause: DuckDB 137 ms (Just now) 6 columns · 1 row
4 tracks
1uNFoZAHBGtllmzznpCI3s Justin Bieber 98 48708262 2021-08-20 https://open.spotify.com/artist/1uNFoZAHBGtllmzznpCI3s
DuckDB 176 ms (Just now) 6 columns · 35 rows 5 where
1vyhD5VmyZ7KMfW5gqLgo5 J Balvin 97 29131631 2021-08-20 To return only null values,
https://open.spotify.com/artist/1vyhD5VmyZ7KMfW5gqLgo5 we can do
6 album_type is not null
WHERE <col> IS NULL .
track_id track_title artist_id all_artists

4mJDfMcT7odIUjWlb2WO4L ...Baby One More Time - Recorded at Spotify Studios NYC 6eUKZXaKkcviH0Ku9w2n3V ['Ed Sheeran']
IF statements if_statement
19WjVVgXjdKXfLXoaDPwTM ...Ready For It? 06HL4z0CvFAxyc27GXpf02 ['Taylor Swift']
In SQL, the equivalent of the IF 1 select
COMBINING FILTERS 2yLa0QULdQr0qAIvVwN6B5 ...Ready For It? 06HL4z0CvFAxyc27GXpf02 ['Taylor Swift']
statement is a CASE statement. 2 case
7zgqtptZvhf8GEmdsM2vp2 ...Ready For It? 06HL4z0CvFAxyc27GXpf02 ['Taylor Swift'] 3 when release_date>= '2016-01-01' then 'modern'
DuckDB 136 ms (Just now) 15 columns · 1.502 rows They syntax is:
6yr8GiTHWvFfi4o6Q5ebdT 'Till I Collapse 7dGJo4pcD2V6oG8kP0tJRR ['Eminem', 'Nate Dogg'] 4 when release_date between '2000-01-01' and '2016-01-01' then '2000s'
5 when release_date between '1991-01-01' and '2000-01-01' then '90s'
combining_filters
To comibine several filters, CASE 6 else 'oldie'
WHEN <condition> THEN 7 end track_era,
we can use AND and OR. 1 select
8 count(distinct track_id) tracks
<output if true>
2 *
To mix ANDs and ORs, you can use WHEN <condition> THEN 9 from
3 from
10 tracks
parenthesis. 4 daily_streams <output>
11 group by
5 where ELSE <output>
12 1
6 day >= '2017-12-01' END
7 and rank <=5
8 and streams>10000
9 and artist like 'Mariah%'

day track_id rank title artist streams


Connecting · 00∶00
2017-12-24 0bYg9bo50gSsH3LtXe2SQn 1 All I Want for Christmas Is You Mariah Carey 8069105

2017-12-25 0bYg9bo50gSsH3LtXe2SQn 1 All I Want for Christmas Is You Mariah Carey 6467590

2017-12-23 0bYg9bo50gSsH3LtXe2SQn 1 All I Want for Christmas Is You Mariah Carey 5023033

2017-12-22 0bYg9bo50gSsH3LtXe2SQn 2 All I Want for Christmas Is You Mariah Carey 4355458
DuckDB 167 ms (Just now) 6 columns · 14 rows
2017-12-21 0bYg9bo50gSsH3LtXe2SQn 4 All I Want for Christmas Is You Mariah Carey 3675940

SUMIF statements sumif

To do a SUMIF (or similar) you can 1 select


wrap your CASE statement in an 2 artist,
3 sum(
aggregation:
4 case
5 when rank = 1 then streams
SUM(CASE 6 else 0
WHEN <condition> THEN 7 end
<value if true> 8 ) streams_at_no_one
WHEN <condition> THEN 9 from
10 daily_streams
<value>
11 group by
ELSE <value>
12 1
END)
13 order by
14 streams_at_no_one desc

artist streams_at_no_one

Ed Sheeran 702186930

Luis Fonsi 607647520

Post Malone 537584454

J Balvin 131422592

Taylor Swift 78703324


DuckDB 177 ms (Just now) 2 columns · 450 rows
Kendrick Lamar 40647687
Date formatting Time formatting

Date Enter a date:

12 Jun 2024
The table below uses the strftime function:
strftime(<date>,<specifier>)
Showing data for the current timestamp

2024-11-23T11:51:40.888Z
The table below uses the strftime function:
strftime(<timestamp>,<specifier>)

formatting Date Part


Year
Specifier
%Y

%-y
Description
Year with century as a decimal number.

Year without century as a decimal number.


Example
...

...
Time Part
Hour
Specifier
%H
Description
Hour (24-hour clock) as a zero-padded
decimal number.
Example
11

%-H Hour (24-hour clock) as a decimal number. 11


%y Year without century as a zero-padded ...
decimal number. %I Hour (12-hour clock) as a zero-padded 11
What is it?
decimal number.
In every database you have the ability to customize the format of a date or time (or whatever exact Month %B Full month name ...

data types your database offers). The function to do that will differ in each database, but some %-I Hour (12-hour clock) as a decimal number. 11
%b Abbreviated month name ...
popular options are:
%p Locale’s AM or PM. AM
%-m Month as a decimal number ...
strftime, format_date
Minute %M Minute as a zero-padded decimal number. 51
%m Month as a zero-padded decimal number ...
The string patterns you give these functions are surprisingly consistent across databases, so these
%-M Minute as a decimal number. 51
tables show the string patterns you will need when you are re-formatting dates and times. Week %U Week number of the year (Sunday as first DoW) ...
Second %S Second as a zero-padded decimal 40
%W Week number of the year (Monday as first DoW) ...
number.

Day %A Full weekday name ... %-S Second as a decimal number. 40

%a Abbreviated weekday name ...


Millisecond %g Millisecond as a decimal number, zero- 888
padded on the left.
%d Day of month as zero-padded decimal ...

Microsecond %f Microsecond as a decimal number, 888000


%-d Day of month as decimal number ...
zero-padded on the left.
%j Day of year as zero-padded decimal ...
Timezone %Z Time zone name.
%-j Day of year as decimal number ...
%z Time offset from UTC in the form +00
ISO %X ISO date ... ±HH:MM, ±HHMM, or ±HH.

ISO %c ISO date and time 2024-11-23 11:51:40


These are the formats supported by DuckDB as of June 9, 2023. Always check your database for more options.

%X ISO time 11:51:40

These are the formats supported by DuckDB as of June 9, 2023. Always check your database for more options.
Regex Patterns Regex Cheat Sheet
Regular How to create the patterns of character strings A quick reference of regex syntax.

expressions Exact match


The easiest regex pattern to create is to look for an exact match. To do that, you can use the
Creating Groups
When we want to start matching multiple characters, we can create groups. To do that we
Metacharacters
*, +, ?, (, ), [, ], {, }

character you're looking to match. can use square brackets: []


Patterns and common ranges
However, there are metacharacters in regex that have other meanings, so for those you will
What is it? need to escape them first with a backslash ( \ ). groups Pattern Description
Regular expressions are a notation for describing sets of character strings. Most databases offer
1 select [a-z] All lowercase letters from a to z.
functions that let you input regular expressions (regex) in order to better match, extract, and Metacharacters: *, +, ?, (, ), |, [, ], {, }
2 regexp_extract_all ('Hello', '[lo]') exact_match,
manipulate strings and substrings in your data. 3 regexp_extract_all ('He?lo', '[\?l]') exact_match_meta [A-Z] All uppercase letters from a to z.

[0-9] All digits from 0 to 9.


The goal is to create a pattern that matches whatever you are looking for in your data. For example, exact_match exact_match_meta
exact_match_ex
if we wanted to extract the email domain for users, we know we'd want to look for the "@" and take [l,l,o] [?,l] \d All digits from 0 to 9 = [0-9]
any characters after that. In regex, that might look like: 1 select
DuckDB 74 ms (Just now) 2 columns · 1 row \s whitespace = [\t\n\f\f]
2 regexp_extract ('Hel-lo', '-') exact_match,
3 regexp_extract ('He?lo', '\?') exact_match_meta \w word characters = [0-9A-Za-z_]
[a-zA-Z0-9]+@([a-zA-Z0-9.]+)
exact_match exact_match_meta In essence groups within square brackets act as ORs. They look for any of the characters
- ? within the group: Negated patterns and common ranges

groups_vs_others Pattern Description


DuckDB 51 ms (Just now) 2 columns · 1 row
1 select [^a-z] All except lowercase letters from a to z.
2 regexp_extract_all ('1556', '15') exact_match, -- just looking for '15'
[^A-Z] All except uppercase letters from a to z.

Regex Common Actions


3 regexp_extract_all ('1556', '[15]') group_match, -- looking for any 1 or any
5 [^0-9] All except digits from 0 to 9.
4 regexp_extract_all ('1556', '(15)') exact_match-- looking for '15',
DuckDB: Multiple columns were returned with the name "exact_match". Please ensure that all
\D All digits from 0 to 9 = [^0-9]
column names are distinct.
\S whitespace = [^\t\n\f\f]
Grouping in memory
Typically you can do three actions with regex \W word characters = [^0-9A-Za-z_]
When we use parentheis () we are telling regex to record matches in memory. This is required
patterns: when you are working with replacements:

We can combine groups with other strings to start creating more complex patterns: Repetitions
groups_in_memory Pattern Description
Match group_combos
1 select x* 0 or more x , prefer more
This is return True if the pattern is found in the text, 2 regexp_extract_all('this is test",
1 select x+ 1 or more x , prefer more
and False otherwise. It's useful just to see if a 3 "this is test, and this is not test",
2 regexp_extract_all('DanTanFan','[D]an') only_dan,
string mattches a certain pattern.
4 "they are tests','(?P<testname>this)( is)') naming_groups, x? 0 or 1 x , prefer 1
3 regexp_extract_all('DanTanFan','[DT]an') dan_tan,
5
4 regexp_extract_all('DanTanFan','[DTF]an') all_ans x{n,m} n or n +1 or ... or m x , prefer more
6 regexp_replace('this is just one test','(o.e)','my') replacing_groups
regex_match_ex x{n,} n or more x , prefer more
only_dan dan_tan all_ans
naming_groups replacing_groups
1 select [Dan] [Dan,Tan] [Dan,Tan,Fan] x{n} exactly n x
[this is,this is,this is] this is just my test
2 regexp_matches('[email protected]', '[a-zA-Z0-9]+@[a-zA-Z0-9.]+') is_email
x*? 0 or more x , prefer fewer
is_email x+? 1 or more x , prefer fewer
DuckDB 76 ms (Just now) 2 columns · 1 row
true DuckDB 50 ms (Just now) 3 columns · 1 row
x?? 0 or 1 x , prefer 0
x{n,m}? n or n +1 or ... or m x , prefer fewer
DuckDB 48 ms (Just now) 1 column · 1 row
x{n,}? n or more x , prefer fewer

x{n}? exactly n x

Abstracting patterns Repeating patterns


Now that we have groups, we can start to abstract common patterns. These can include ranges Now that we can create patterns of characters, we can start to combine those patterns in Repetition placement:
Replace [a-z] or classes of characters (e.g. any whitespace character). powerful ways. To do that we can make use of repetitions:
^ : start of string
This allows you to replace the patterns that have matched with a Pattern Description $ : end of string
Pattern Description
different value. It's a more robust replace than the typical REPLACE
x* 0 or more x , prefer more
you'll see in your text function list. [a-z] All lowercase letters from a to z.
x+ 1 or more x , prefer more
[A-Z] All uppercase letters from a to z. Grouping
regex_replace x? 0 or 1 x , prefer 1
[0-9] All digits from 0 to 9. Pattern Description
1 select x{n,m} n or n +1 or ... or m x , prefer more
2 regexp_replace('hello', '[lo]', '-') replace_first_l \d All digits from 0 to 9 = [0-9] (re) numbered capturing group (submatch)
x{n,} n or more x , prefer more

replace_first_l
\s whitespace = [\t\n\f\f] x{n} exactly n x (?P<name>re) named & numbered capturing group

he-lo \w word characters = [0-9A-Za-z_] x*? 0 or more x , prefer fewer (?:re) non-capturing group

x+? 1 or more x , prefer fewer (?flags) set flags within current group
DuckDB 56 ms (Just now) 1 column · 1 row extract_patterns (?flags:re) set flags during re
x?? 0 or 1 x , prefer 0
1 select
x{n,m}? n or n +1 or ... or m x , prefer fewer
2 regexp_extract_all ('m25eP13s5s006ag1+5-e', '[a-z]') all_lowercase_letters,
3 regexp_extract_all ('m25eP13s5s006ag1+5-e', '[\d]') all_numbers, x{n,}? n or more x , prefer fewer
Flags
4 regexp_extract_all ('m25eP13s5s006ag1+5-e', '[\w]') all_words,
x{n}? exactly n x
5 regexp_extract_all('DanTanFan','[A-G]an') ranges
Pattern Description
Extract
all_lowercase_letters all_numbers all_words ranges i case-insensitive (default false)
pattern_repetitions
This allows you to extract the patterns you've matched with your regex pattern. [m,e,s,s,a,g,e] [2,5,1,3,5,0,0,6,1,5] [m,2,5,e,P,1,3,s,5,s,0,0,6,a,g,1,5,e] [Dan,Fan]
m multi-line mode
Commonly there will be functions to help with multiple matches - e.g. do you want 1 select

to select all matches, just the first one. 2 regexp_extract_all ('m25eP13s5s006ag1+5-e', '[\d]+') all_numbers, s let . match \n (default false)
DuckDB 73 ms (Just now) 4 columns · 1 row 3 regexp_extract_all('hello','[l]+')find_ls,
4 regexp_extract_all('a0a1a11','a[0-1]?') only_one_number,
U ungreedy: swap meaning of x* and x*?
regex_extract 5 regexp_extract_all('AlphaBEt','[A-Z][a-z]+') upper_then_lower
Patterns can aslo be negated using a ^ , or specific character classes. 6
1 select
2 regexp_extract_all ('m25eP13s5s006ag1+5-e', '[a-z]+') hidden_message,
Pattern Description all_numbers find_ls only_one_number upper_then_lower Common patterns
3 list_aggr (
[25,13,5,006,1,5] [ll] [a0,a1,a1] [Alpha,Et]
4 regexp_extract_all ('m25eP13s5s006ag1+5-e', '[a-z]+'), [^a-z] All except lowercase letters from a to z. email domain @[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+
5 'string_agg',
6 ''
[^A-Z] All except uppercase letters from a to z.
DuckDB 74 ms (Just now) 4 columns · 1 row
7 ) AS str
url host (?:[a-zA-Z]+://)?([a-zA-Z0-9-.]+)/?
[^0-9] All except digits from 0 to 9.
hidden_message str
(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9-.]+)/{1}
\D All digits from 0 to 9 = [^0-9] url path
[m,e,s,s,ag,e] message Pattern placement ([a-zA-Z0-9-./]+)
\S whitespace = [^\t\n\f\f]
We can also specify where in our string we want to look for our patterns by using ^ and $ : url query \?(.*)
\W word characters = [^0-9A-Za-z_]
DuckDB 52 ms (Just now) 2 columns · 1 row
^ : Look at the start of the string url ref #(.*)
$ : Look at the end of the string
extract_negated_patterns
url protocol ^([a-zA-Z]+)://
1 select pattern_repetitions_1
2 regexp_extract_all ('m25eP13s5s006ag1+5-e', '[^a-z]')
all_except_lowercase_letters, 1 select
3 regexp_extract_all ('m25eP13s5s006ag1+5-e', '[\D]') all_except_numbers, 2 regexp_extract_all ('AlphaBEt', '^[A-Z][a-z]+') start_of_string,
4 regexp_extract_all('DanTanFan','[^A-G]an') anti_ranges 3 regexp_extract_all ('hi,hi', '[hi]+$') end_of_string

all_except_lowercase_letters all_except_numbers anti_ranges start_of_string end_of_string

[2,5,P,1,3,5,0,0,6,1,+,5,-] [m,e,P,s,s,a,g,+,-,e] [Tan] [Alpha] [hi]

DuckDB 73 ms (Just now) 3 columns · 1 row


DuckDB 72 ms (Just now) 2 columns · 1 row
How to answer practice questions:
1. To make the canvas editable, you will need an account.

For new users: For existing users:

2. Once the canvas is editable, you can type directly into the cells:

3. To run a query, click away from the cell, or hit Shift + Enter on the keyboard

4. To reveal solutions, you can unhide the Solution frames beneath each question:

More helpful info:


These examples are written with DuckDB syntax, a database that runs in your browser. This makes it
very easy for anyone to execute queries without any set-up.
You can read the full DuckDB documentation here.
To see how the Count canvas works, you can check out the docs here.
Cell approach:

top_tracks
dancy_tracks
1 select
1 select
2 track_id,
2 track_id,
3 artist,
3 danceability
4 title,
4 from
5 sum(streams) total_streams
5 tracks
6 from
6 order by
7 daily_streams
7 danceability desc
8 group by
8 limit
9 track_id,
9 50
10 artist,
11 title
track_id danceability
12 order by
13 sum(streams) desc 5hx4FpDZOnDVYlyoG3rMks 0,957

14 limit 01VvADRZqvXrHfwAKUOS8v 0,954


15 50 1sCxVKWImDZSZKvG0U9B23 0,953

track_id artist title total_streams 7FB8l7UA1HKqnuSLjP9qDc 0,952

7qiZfU4dY1lWllzX7mPBI3 Ed Sheeran Shape of You 76Y0gxtTxN0FyDCYh5qYQj


1453582393 0,950

5CtI0qwDJkDQGwXD1H1cLb Luis Fonsi Despacito - Remix 6foxplXS0YEq8cO374Lty4


881385850 0,950

4aWmUDTfIPGksMNLV2rQP2 Luis Fonsi Despacito (Featuring Daddy Yankee) 4at9j7aC7FvDlsllpOvsAC


724654078 0,949

6RUKPb4LETWmmr3iAEQktW The Chainsmokers Something Just Like This 5GFDrUTLGJix84sNhjCG0g


671110075 0,945
CTEs as cells 5hTpBe8h35rJ67eAWHQsJx 0,941
7KXjTSCq5nL1LoYtL7XAwS Kendrick Lamar HUMBLE. 633417398
In the Count canvas you can reference other cells as if they were
3eR23VReFzcdmS7TYCrhCe Kygo It Ain't Me (with Selena Gomez) 594663409 52uHpw5tjtvml4jadVpG8X 0,940
CTEs. Under the hood, they are turned into CTEs.
0KKkJNfGyhkQ5aFogxQAPU Bruno Mars That's What I Like 562700363 4lozYCMRLtEc46exlhoK2Q 0,938
The benefit of breaking them out into cell is that you can:
3NdDpSvN911VPGivFlV5d0 ZAYN I Don’t Wanna Live Forever (Fifty Shades Darker) - From "Fifty… 560246014 2uSLx5uwRIhiA3sJJraNqX 0,938
DuckDB 152 ms (Just now) 4 columns · 50 rows DuckDB 136 ms (Just now) 2 columns · 50 rows
1. See the results of each one
4iLqG9SeJSnt0cSPICSjxv Charlie Puth Attention 526576774 43ZyHQITOjhciSUUNPVRHc 0,936
2. Use it in multiple other queries

cte_cells

1 select
2 *
3 from
4 top_tracks
5 inner join dancy_tracks using (track_id)

Connecting · 00∶00

You might also like