JOIN Operations
JOIN Operations
References
1. https://courses.cs.washington.edu/courses/cse154/18au/lectures/lec26-sql-joins/?print-pdf#/
undefined. https://www.enterprisedb.com/postgres-tutorials/overview-postgresql-indexes
undefined. https://github.com/morenoh149/postgresDBSamples/tree/master
Introduction
Join operations are essential for working with data that is spread across multiple tables. Join operations allow
us to combine information from multiple tables into a single result set, giving us a more complete picture of the
data we're working with.
Whether you're a database administrator, a software developer, or just someone curious about how databases
work, understanding join operations is crucial for working with data effectively. In this lecture, we will cover the
basics of join operations, including different types of joins, how to write join queries, and common use cases for
join operations. In the end of this lecture, we will also dive in into some best practices and pitfalls to avoid in
writing JOIN operations.
In simple terms, a JOIN operation combines data from two or more tables in a database based on a related
column between them. It's like a powerful tool that lets you merge information from different sources, creating a
unified picture of your data.
For example, let's say you have one table with your friends' names and contact information, and another table
with their favorite hobbies. You could use a JOIN operation to combine the two tables based on the common
column of their names, creating a new table that includes both their contact info and hobbies. Now you can plan
the perfect party that caters to everyone's interests!
JOIN operations are crucial for complex data analysis, where data is spread across multiple tables with different
attributes. They allow you to extract valuable insights by connecting the dots between different data sources.
Here's a conceptual example
Terms:
Primary Key : a column guaranteed to be unique for each record (e.g. Alice's ID 1)
Foreign Key : a column in Table A storing a primary key from table B
In such database structures, there are primary keys and foreign keys which are used to interlinked data between
tables. The customer_id in Orders table resolves into column id in Customer table.
There are at least two standards to write JOIN query: using WHERE or JOIN keyword. We'll take an example
how to connect teachers with its courses.
Types of JOIN
INNER JOIN
This type of JOIN returns only the rows that have matching values in both tables based on a specified join
condition. In other words, it returns the intersection of the two tables. INNER JOIN is the most commonly used
JOIN operation in databases.
LEFT JOIN
This type of JOIN returns all the rows from the left table and the matching rows from the right table based on a
specified join condition. If there are no matching rows in the right table, the result will still include the rows from
the left table with NULL values in the columns from the right table.
RIGHT JOIN
This type of JOIN is similar to LEFT JOIN but returns all the rows from the right table and the matching rows
from the left table based on a specified join condition. If there are no matching rows in the left table, the result
will still include the rows from the right table with NULL values in the columns from the left table.
This type of JOIN returns all the rows from both tables and NULL values in the columns that do not have
matching values in the other table based on a specified join condition. In other words, it returns the union of the
two tables.
CROSS JOIN
This type of JOIN returns the Cartesian product of the two tables, where each row from the first table is
combined with each row from the second table. CROSS JOIN does not require a join condition and can result in
a large number of rows.
Practical example
In [ ]:
df_3 = _deepnote_execute_sql('SELECT "Customer"."CustomerId", "Customer"."FirstName"||\'
\' ||"Customer"."LastName" as CustomerName \n FROM "Invoice" \n JOIN "Customer" ON "Invoice"
."CustomerId" = "Customer"."CustomerId"', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_3
Out[ ]:
CustomerId customername
0 2 Leonie Köhler
1 4 Bjørn Hansen
2 8 Daan Peeters
3 14 Mark Philips
4 23 John Gordon
In [ ]:
df_6 = _deepnote_execute_sql('SELECT "Customer"."CustomerId", "Customer"."FirstName"||\'
\' ||"Customer"."LastName" as CustomerName \n FROM "Invoice" \n JOIN "Customer" ON "Invoice"
."CustomerId" = "Customer"."CustomerId"\n GROUP BY "Customer"."CustomerId"', 'SQL_5C2E9A9B
_B591_4FC4_ADD5_829EB3188444')
df_6
Out[ ]:
CustomerId customername
0 29 Robert Brown
1 54 Steve Murray
2 4 Bjørn Hansen
3 34 João Fernandes
4 51 Joakim Johansson
5 52 Emma Jones
6 10 Eduardo Martins
7 35 Madalena Sampaio
8 45 Ladislav Kovács
9 6 Helena Holý
10 39 Camille Bernard
11 36 Hannah Schneider
12 31 Martha Silk
13 50 Enrique Muñoz
14 14 Mark Philips
15 22 Heather Leacock
16 59 Puja Srivastava
17 13 Fernanda Ramos
18 2 Leonie Köhler
19 16 Frank Harris
20 CustomerId
11 customername
Alexandre Rocha
21 44 Terhi Hämäläinen
22 42 Wyatt Girard
23 41 Marc Dubois
24 46 Hugh O'Reilly
25 40 Dominique Lefebvre
26 43 Isabelle Mercier
27 32 Aaron Mitchell
28 53 Phil Hughes
29 7 Astrid Gruber
30 9 Kara Nielsen
31 38 Niklas Schröder
32 15 Jennifer Peterson
33 26 Richard Cunningham
34 12 Roberto Almeida
36 24 Frank Ralston
37 57 Luis Rojas
38 19 Tim Goyer
39 25 Victor Stevens
40 30 Edward Francis
41 21 Kathy Chase
42 49 Stanislaw Wójcik
43 47 Lucas Mancini
44 3 François Tremblay
45 17 Jack Smith
46 20 Dan Miller
47 28 Julia Barnett
48 37 Fynn Zimmermann
49 33 Ellie Sullivan
50 1 Luís Gonçalves
51 5 Frantiek Wichterlová
52 18 Michelle Brooks
53 55 Mark Taylor
54 27 Patrick Gray
55 23 John Gordon
56 56 Diego Gutiérrez
57 58 Manoj Pareek
58 8 Daan Peeters
In [ ]:
df_5 = _deepnote_execute_sql('SELECT "Customer"."CustomerId", "Customer"."FirstName"||\'
\' ||"Customer"."LastName" as CustomerName, "Track"."Name" \n FROM "Invoice" \n JOIN "Custom
er" ON "Invoice"."CustomerId" = "Customer"."CustomerId"\n JOIN "InvoiceLine" ON "Invoice".
"InvoiceId" = "InvoiceLine"."InvoiceId"\n JOIN "Track" ON "Track"."TrackId" = "InvoiceLine
"."TrackId"', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_5
Out[ ]:
In [ ]:
df_4 = _deepnote_execute_sql('SELECT "Customer"."CustomerId", "Customer"."FirstName"||\'
\' ||"Customer"."LastName" as CustomerName, string_agg("Track"."Name", \' ,\' ) \n FROM "Invo
ice" \n JOIN "Customer" ON "Invoice"."CustomerId" = "Customer"."CustomerId"\n JOIN "Invoice
Line" ON "Invoice"."InvoiceId" = "InvoiceLine"."InvoiceId"\n JOIN "Track" ON "Track"."Trac
kId" = "InvoiceLine"."TrackId"\n GROUP BY "Customer"."CustomerId"', 'SQL_5C2E9A9B_B591_4FC
4_ADD5_829EB3188444')
df_4
Out[ ]:
17 13 Fernanda Ramos Dust N' Bones,Live and Let Die,The Memory Rema...
23 41 Marc Dubois Suite No. 3 in D, BWV 1068: III. Gavotte I & I...
27 32 Aaron Mitchell How Many More Times,What Is And What Should Ne...
30 9 Kara Nielsen The Thing That Should Not Be,Welcome Home (San...
54 27 Patrick Gray These Colours Don't Run,For the Greater Good o...
In [ ]:
df_1 = _deepnote_execute_sql('SELECT "Customer"."FirstName", SUM("Total") as TotalBought
\n FROM "Invoice" \n JOIN "Customer" ON "Invoice"."CustomerId" = "Customer"."CustomerId"\n G
ROUP BY "Customer"."CustomerId" \n ORDER BY TotalBought DESC', 'SQL_5C2E9A9B_B591_4FC4_ADD
5_829EB3188444')
df_1
Out[ ]:
Out[ ]:
FirstName totalbought
0 Helena 49.62
1 Richard 47.62
2 Luis 46.62
3 Ladislav 45.62
4 Hugh 45.62
5 Julia 43.62
6 Fynn 43.62
7 Frank 43.62
8 Astrid 42.62
9 Victor 42.62
10 Terhi 41.62
11 Johannes 40.62
12 Frantiek 40.62
13 Isabelle 40.62
14 François 39.62
15 Bjørn 39.62
16 João 39.62
17 Heather 39.62
18 Wyatt 39.62
19 Jack 39.62
20 Dan 39.62
21 Luís 39.62
22 Joakim 38.62
23 Tim 38.62
24 Dominique 38.62
25 Jennifer 38.62
26 Manoj 38.62
27 Camille 38.62
28 Kara 37.62
29 Niklas 37.62
30 Martha 37.62
31 Roberto 37.62
32 Hannah 37.62
33 Madalena 37.62
34 Eduardo 37.62
35 Edward 37.62
36 Kathy 37.62
37 Stanislaw 37.62
38 Lucas 37.62
39 Robert 37.62
40 Diego 37.62
41 Emma 37.62
42 Fernanda 37.62
43 Marc totalbought
FirstName 37.62
44 Mark 37.62
45 Aaron 37.62
46 Phil 37.62
47 Enrique 37.62
48 John 37.62
49 Ellie 37.62
50 Daan 37.62
51 Steve 37.62
52 Michelle 37.62
53 Mark 37.62
54 Patrick 37.62
55 Leonie 37.62
56 Frank 37.62
57 Alexandre 37.62
58 Puja 36.64
In [ ]:
df_2 = _deepnote_execute_sql('SELECT "CustomerId", SUM("Total") as TotalBought \n FROM "In
voice" \n GROUP BY "CustomerId" \n ORDER BY TotalBought DESC', 'SQL_5C2E9A9B_B591_4FC4_ADD5
_829EB3188444')
df_2
Out[ ]:
CustomerId totalbought
0 6 49.62
1 26 47.62
2 57 46.62
3 45 45.62
4 46 45.62
5 28 43.62
6 37 43.62
7 24 43.62
8 7 42.62
9 25 42.62
10 44 41.62
11 48 40.62
12 5 40.62
13 43 40.62
14 3 39.62
15 4 39.62
16 34 39.62
17 22 39.62
18 42 39.62
19 17 39.62
20 20 39.62
21 1 39.62
21 1 39.62
CustomerId totalbought
22 51 38.62
23 19 38.62
24 40 38.62
25 15 38.62
26 58 38.62
27 39 38.62
28 9 37.62
29 38 37.62
30 31 37.62
31 12 37.62
32 36 37.62
33 35 37.62
34 10 37.62
35 30 37.62
36 21 37.62
37 49 37.62
38 47 37.62
39 29 37.62
40 56 37.62
41 52 37.62
42 13 37.62
43 41 37.62
44 14 37.62
45 32 37.62
46 53 37.62
47 50 37.62
48 23 37.62
49 33 37.62
50 8 37.62
51 54 37.62
52 18 37.62
53 55 37.62
54 27 37.62
55 2 37.62
56 16 37.62
57 11 37.62
58 59 36.64
In [ ]:
df_7 = _deepnote_execute_sql('SELECT "InvoiceDate", COUNT(1) as TotalSold\n FROM "Invoice"
\n GROUP BY "InvoiceDate"\n ORDER BY TotalSold DESC', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB318
8444')
df_7
Out[ ]:
InvoiceDate totalsold
0 InvoiceDate
2012-12-28 totalsold
2
1 2013-07-02 2
2 2009-04-04 2
3 2011-04-18 2
4 2012-01-22 2
349 2010-05-14 1
350 2010-01-13 1
351 2010-11-24 1
352 2010-06-13 1
353 2010-06-30 1
Performance optimization
The EXPLAIN results in PostgreSQL will show the query execution plan, which is a step-by-step breakdown of
how the PostgreSQL query planner will execute the query.
QUERY PLAN
--------------------------------------------------------------------------------
-
Hash Join (cost=31.05..41.43 rows=1000 width=50)
Hash Cond: (orders.customer_id = customers.customer_id)
-> Seq Scan on orders (cost=0.00..8.00 rows=1000 width=26)
-> Hash (cost=20.00..20.00 rows=1000 width=28)
-> Seq Scan on customers (cost=0.00..20.00 rows=1000 width=28)
The query plan shows that PostgreSQL will use a hash join to combine the two tables.
The first step is a sequential scan of the orders table, with an estimated cost of 0.00..8.00. PostgreSQL estimates
that there are 1000 rows in the table, and it will need to read all of them to perform the join.
The second step is also a sequential scan, but this time of the customers table. This table is also estimated to
have 1000 rows, and will also need to be fully scanned to perform the join.
Finally, the query plan shows that PostgreSQL will use a hash join operation to combine the two tables, based on
the join condition orders.customer_id = customers.customer_id. The hash join has an estimated cost of
31.05..41.43, which includes the cost of reading the tables and performing the join operation.
By analyzing the query plan, we can see that the query will perform a full table scan of both the orders and
customers tables, and that a hash join is used to combine the two tables. This query plan provides useful
information for optimizing the query and improving its performance if needed.
Deep Dive
In [ ]:
In [ ]:
df_8 = _deepnote_execute_sql('-- NORMAL QUERY\n EXPLAIN ANALYZE SELECT * FROM "orders_no_
index" WHERE "customer_id" = 733219;', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_8
Out[ ]:
QUERY PLAN
1 Workers Planned: 1
2 Workers Launched: 0
In this example, we can see that the query is doing a parallel sequential scan on the table "orders_no_index",
filtering rows where the "customer_id" column equals '733219'. The estimated cost of this operation is 13722.94,
and the actual execution time is 72.526 ms.
To identify which column should be indexed, we can look at the Filter line in the output. In this case, we can see
that the query is filtering rows based on the "customer_id" column. Since this is a simple equality check, we can
create an index on this column to speed up the query.
In [ ]:
df_9 = _deepnote_execute_sql('EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 73
3219;', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_9
Out[ ]:
QUERY PLAN
As you can see, the EXPLAIN ANALYZE output now shows an "Index Scan" operation on the
orders_customer_id_idx index, with a cost of 3.76. The estimated number of rows returned by the query is 2,
which didn't match the actual number of rows returned. The actual time it took to execute the query is now
much faster, at 0.057 milliseconds, compared to 72.526 milliseconds without the index.
In summary, adding an index on the customer_id column of the orders table significantly improved the
performance of the query by allowing PostgreSQL to scan the index instead of the entire table. This also will
improve JOIN operation.
QUERY
In [ ]:
df_11 = _deepnote_execute_sql('SELECT\n customers.customer_name,\n orders.order_date,\n
orders.total_amount\n FROM\n orders\n JOIN customers ON orders.customer_id = customers.cus
tomer_id\n WHERE customers.customer_id = 733219;', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB31884
44')
df_11
Out[ ]:
WITHOUT INDEX
In [ ]:
Out[ ]:
QUERY PLAN
1 Workers Planned: 1
2 Workers Launched: 0
WITH INDEX
In [ ]:
Out[ ]:
Out[ ]:
QUERY PLAN
Exploration
Install Postgres & load Chinook dataset, then try to answer these questions:
Which are the top 5 most popular genres in the Chinook database? How many tracks are there for each
genre?
Who are the top 10 best-selling artists in the Chinook store? How many tracks have they sold?
What is the average purchase price per invoice for each country in the Chinook database? Which countries
have the highest and lowest average purchase prices?
What is the correlation between the length of a track and its price in the Chinook store? Are longer tracks
generally more expensive or less expensive?
How has the Chinook store's sales performance changed over time? Is it trending upwards, downwards, or
staying the same? Can you identify any patterns in the data?
Supplementary material
In [ ]:
In [ ]:
import pandas as pd
import plotly.express as px
In [ ]:
# calculate the 90th percentile of CLV values
pct_90 = df_14['TotalSpent'].quantile(0.9)
There are 7 high-value customers in the Chinook store (i.e. the top 10%).
In [ ]:
In [ ]:
Created in Deepnote