0% found this document useful (0 votes)
40 views17 pages

JOIN Operations

The document discusses JOIN operations in databases. JOINs combine data from two or more tables based on related columns. The main types of JOINs are INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN, and CROSS JOIN. Examples are provided to illustrate each JOIN type and how to write JOIN queries. Practical examples using SQL queries on tables are also included to demonstrate real-world JOIN operations.

Uploaded by

famasya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views17 pages

JOIN Operations

The document discusses JOIN operations in databases. JOINs combine data from two or more tables based on related columns. The main types of JOINs are INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN, and CROSS JOIN. Examples are provided to illustrate each JOIN type and how to write JOIN queries. Practical examples using SQL queries on tables are also included to demonstrate real-world JOIN operations.

Uploaded by

famasya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

JOIN Operations

References

1. https://courses.cs.washington.edu/courses/cse154/18au/lectures/lec26-sql-joins/?print-pdf#/

undefined. https://www.enterprisedb.com/postgres-tutorials/overview-postgresql-indexes

undefined. https://github.com/morenoh149/postgresDBSamples/tree/master

Introduction

Join operations are essential for working with data that is spread across multiple tables. Join operations allow
us to combine information from multiple tables into a single result set, giving us a more complete picture of the
data we're working with.

Whether you're a database administrator, a software developer, or just someone curious about how databases
work, understanding join operations is crucial for working with data effectively. In this lecture, we will cover the
basics of join operations, including different types of joins, how to write join queries, and common use cases for
join operations. In the end of this lecture, we will also dive in into some best practices and pitfalls to avoid in
writing JOIN operations.

What is JOIN operations

In simple terms, a JOIN operation combines data from two or more tables in a database based on a related
column between them. It's like a powerful tool that lets you merge information from different sources, creating a
unified picture of your data.

Source: XKCD 1810

For example, let's say you have one table with your friends' names and contact information, and another table
with their favorite hobbies. You could use a JOIN operation to combine the two tables based on the common
column of their names, creating a new table that includes both their contact info and hobbies. Now you can plan
the perfect party that caters to everyone's interests!

JOIN operations are crucial for complex data analysis, where data is spread across multiple tables with different
attributes. They allow you to extract valuable insights by connecting the dots between different data sources.
Here's a conceptual example

Terms:

Primary Key : a column guaranteed to be unique for each record (e.g. Alice's ID 1)
Foreign Key : a column in Table A storing a primary key from table B

In such database structures, there are primary keys and foreign keys which are used to interlinked data between
tables. The customer_id in Orders table resolves into column id in Customer table.

There are at least two standards to write JOIN query: using WHERE or JOIN keyword. We'll take an example
how to connect teachers with its courses.

SELECT c.name as customer_name, t.product_name


FROM Orders o, Customers c
WHERE o.customer_id = c.id

Which is equivalent with

SELECT c.name as customer_name, t.product_name


FROM Orders o
JOIN Customers c ON o.customer_id = c.id

Questions: Should referenced key always be primary key?

Types of JOIN

INNER JOIN

This type of JOIN returns only the rows that have matching values in both tables based on a specified join
condition. In other words, it returns the intersection of the two tables. INNER JOIN is the most commonly used
JOIN operation in databases.

SELECT Customers.customer_id, Customers.name, Orders.order_id, Orders.product_name


FROM Customers
INNER JOIN Orders
ON Customers.customer_id = Orders.customer_id;

This query will return the following result:

LEFT JOIN

This type of JOIN returns all the rows from the left table and the matching rows from the right table based on a
specified join condition. If there are no matching rows in the right table, the result will still include the rows from
the left table with NULL values in the columns from the right table.

SELECT Customers.customer_id, Customers.name, Orders.order_id, Orders.product_name


FROM Customers
LEFT JOIN Orders
ON Customers.customer_id = Orders.customer_id;

This query will return the following result:

RIGHT JOIN

This type of JOIN is similar to LEFT JOIN but returns all the rows from the right table and the matching rows
from the left table based on a specified join condition. If there are no matching rows in the left table, the result
will still include the rows from the right table with NULL values in the columns from the left table.

SELECT Customers.customer_id, Customers.name, Orders.order_id, Orders.product_name


FROM Customers
RIGHT JOIN Orders
ON Customers.customer_id = Orders.customer_id;

This query will return the following result:

FULL OUTER JOIN

This type of JOIN returns all the rows from both tables and NULL values in the columns that do not have
matching values in the other table based on a specified join condition. In other words, it returns the union of the
two tables.

SELECT Customers.customer_id, Customers.name, Orders.order_id, Orders.product_name


FROM Customers
FULL OUTER JOIN Orders
ON Customers.customer_id = Orders.customer_id;

This query will return the following result:

CROSS JOIN

This type of JOIN returns the Cartesian product of the two tables, where each row from the first table is
combined with each row from the second table. CROSS JOIN does not require a join condition and can result in
a large number of rows.

SELECT Customers.customer_id, Customers.name, Orders.order_id, Orders.product_name


FROM Customers
CROSS JOIN Orders;

This query will return the following result:

Practical example

In [ ]:
df_3 = _deepnote_execute_sql('SELECT "Customer"."CustomerId", "Customer"."FirstName"||\'
\' ||"Customer"."LastName" as CustomerName \n FROM "Invoice" \n JOIN "Customer" ON "Invoice"
."CustomerId" = "Customer"."CustomerId"', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_3
Out[ ]:

CustomerId customername

0 2 Leonie Köhler

1 4 Bjørn Hansen

2 8 Daan Peeters

3 14 Mark Philips

4 23 John Gordon

... ... ...

407 25 Victor Stevens

408 29 Robert Brown

409 35 Madalena Sampaio

410 44 Terhi Hämäläinen

411 58 Manoj Pareek

412 rows × 2 columns

In [ ]:
df_6 = _deepnote_execute_sql('SELECT "Customer"."CustomerId", "Customer"."FirstName"||\'
\' ||"Customer"."LastName" as CustomerName \n FROM "Invoice" \n JOIN "Customer" ON "Invoice"
."CustomerId" = "Customer"."CustomerId"\n GROUP BY "Customer"."CustomerId"', 'SQL_5C2E9A9B
_B591_4FC4_ADD5_829EB3188444')
df_6
Out[ ]:

CustomerId customername

0 29 Robert Brown

1 54 Steve Murray

2 4 Bjørn Hansen

3 34 João Fernandes

4 51 Joakim Johansson

5 52 Emma Jones

6 10 Eduardo Martins

7 35 Madalena Sampaio

8 45 Ladislav Kovács

9 6 Helena Holý

10 39 Camille Bernard

11 36 Hannah Schneider

12 31 Martha Silk

13 50 Enrique Muñoz

14 14 Mark Philips

15 22 Heather Leacock

16 59 Puja Srivastava

17 13 Fernanda Ramos

18 2 Leonie Köhler

19 16 Frank Harris
20 CustomerId
11 customername
Alexandre Rocha

21 44 Terhi Hämäläinen

22 42 Wyatt Girard

23 41 Marc Dubois

24 46 Hugh O'Reilly

25 40 Dominique Lefebvre

26 43 Isabelle Mercier

27 32 Aaron Mitchell

28 53 Phil Hughes

29 7 Astrid Gruber

30 9 Kara Nielsen

31 38 Niklas Schröder

32 15 Jennifer Peterson

33 26 Richard Cunningham

34 12 Roberto Almeida

35 48 Johannes Van der Berg

36 24 Frank Ralston

37 57 Luis Rojas

38 19 Tim Goyer

39 25 Victor Stevens

40 30 Edward Francis

41 21 Kathy Chase

42 49 Stanislaw Wójcik

43 47 Lucas Mancini

44 3 François Tremblay

45 17 Jack Smith

46 20 Dan Miller

47 28 Julia Barnett

48 37 Fynn Zimmermann

49 33 Ellie Sullivan

50 1 Luís Gonçalves

51 5 Franti​ek Wichterlová

52 18 Michelle Brooks

53 55 Mark Taylor

54 27 Patrick Gray

55 23 John Gordon

56 56 Diego Gutiérrez

57 58 Manoj Pareek

58 8 Daan Peeters

In [ ]:
df_5 = _deepnote_execute_sql('SELECT "Customer"."CustomerId", "Customer"."FirstName"||\'
\' ||"Customer"."LastName" as CustomerName, "Track"."Name" \n FROM "Invoice" \n JOIN "Custom
er" ON "Invoice"."CustomerId" = "Customer"."CustomerId"\n JOIN "InvoiceLine" ON "Invoice".
"InvoiceId" = "InvoiceLine"."InvoiceId"\n JOIN "Track" ON "Track"."TrackId" = "InvoiceLine
"."TrackId"', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_5
Out[ ]:

CustomerId customername Name

0 2 Leonie Köhler Balls to the Wall

1 2 Leonie Köhler Restless and Wild

2 4 Bjørn Hansen Put The Finger On You

3 4 Bjørn Hansen Inject The Venom

4 4 Bjørn Hansen Evil Walks

... ... ... ...

2235 44 Terhi Hämäläinen Looking For Love

2236 44 Terhi Hämäläinen Sweet Lady Luck

2237 44 Terhi Hämäläinen Feirinha da Pavuna/Luz do Repente/Bagaço da La...

2238 44 Terhi Hämäläinen Samba pras moças

2239 58 Manoj Pareek Hot Girl

2240 rows × 3 columns

In [ ]:
df_4 = _deepnote_execute_sql('SELECT "Customer"."CustomerId", "Customer"."FirstName"||\'
\' ||"Customer"."LastName" as CustomerName, string_agg("Track"."Name", \' ,\' ) \n FROM "Invo
ice" \n JOIN "Customer" ON "Invoice"."CustomerId" = "Customer"."CustomerId"\n JOIN "Invoice
Line" ON "Invoice"."InvoiceId" = "InvoiceLine"."InvoiceId"\n JOIN "Track" ON "Track"."Trac
kId" = "InvoiceLine"."TrackId"\n GROUP BY "Customer"."CustomerId"', 'SQL_5C2E9A9B_B591_4FC
4_ADD5_829EB3188444')
df_4
Out[ ]:

CustomerId customername string_agg

0 29 Robert Brown Your Time Is Gonna Come,Mensagen De Amor (2000...

1 54 Steve Murray Someday Never Comes,Give Me Novacaine,Extraord...

2 4 Bjørn Hansen Put The Finger On You,Inject The Venom,Evil Wa...

3 34 João Fernandes Helpless,Mouth To Mouth,Thank You,Living Lovin...

4 51 Joakim Johansson Invaders,Run to the Hills,I Don't Know,Flying ...

5 52 Emma Jones Samba Makossa,Lixo Do Mangue,Firmamento,Já Foi...

6 10 Eduardo Martins Admirável Gado Novo,Mis Penas Lloraba Yo (Ao V...

7 35 Madalena Sampaio Wave (Os Cariocas),Garota de Ipanema (Dick Far...

8 45 Ladislav Kovács O Pulso,Nem 5 Minutos Guardados,Dirty Little T...

9 6 Helena Holý When You Gonna Learn (Digeridoo),Whatever It I...

10 39 Camille Bernard Prometheus Overture, Op. 43,Sonata for Solo Vi...

11 36 Hannah Schneider She Loves Me Not,Paths Of Glory,Fear Is The Ke...

12 31 Martha Silk Diga Lá, Coração,Comportamento Geral,Podres Po...

13 50 Enrique Muñoz Hallowed Be Thy Name,Phantom Lord,Seek & Destr...

14 14 Mark Philips Right Through You,Not The Doctor,Bleed The Fre...

15 22 Heather Leacock When Love Comes To Town,Angel Of Harlem,Sozinh...

16 59 Puja Srivastava Cotton Fields,Don't Look Now,Before You Accuse...

17 13 Fernanda Ramos Dust N' Bones,Live and Let Die,The Memory Rema...

18 2 Leonie Köhler Balls to the Wall,Restless and Wild,Lavadeira,...

19 16 Frank Harris Valentino's,Promises,Signe,Ghost Of The Naviga...

20 11 Alexandre Rocha Leper Messiah,Damage Inc.,Green Disease,Why Go...


20 11 Alexandre Rocha Leper Messiah,Damage Inc.,Green Disease,Why Go...
CustomerId customername string_agg
21 44 Terhi Hämäläinen Dazed And Confused,L'Avventura,Soul Parsifal,Q...

22 42 Wyatt Girard Com Açúcar E Com Afeto,Meu Caro Amigo,Trocando...

23 41 Marc Dubois Suite No. 3 in D, BWV 1068: III. Gavotte I & I...

24 46 Hugh O'Reilly Etnia,Samba Do Lado,Sobremesa,Sangue De Bairro...

25 40 Dominique Lefebvre Morena De Angola,A Banda,União Da Ilha,Put You...

26 43 Isabelle Mercier Prá Dizer Adeus,Família,Act IV, Symphony,Music...

27 32 Aaron Mitchell How Many More Times,What Is And What Should Ne...

28 53 Phil Hughes The Prisoner,Lord Of The Flies,Condição,Aquilo...

29 7 Astrid Gruber Soldier Side - Intro,Revenga,Solitary,Fire + W...

30 9 Kara Nielsen The Thing That Should Not Be,Welcome Home (San...

31 38 Niklas Schröder Atras Da Porta,Tatuagem,Pristina,Caffeine,RV,E...

32 15 Jennifer Peterson Perfect Crime,Bad Obsession,Stone Free,Satch B...

33 26 Richard Cunningham Radio Free Aurope,Perfect Circle,Drowning Man,...

34 12 Roberto Almeida Right Next Door to Hell,In The Evening,Fool In...

35 48 Johannes Van der Berg Underwater Love,Falamansa Song,Avisa,Desaforo,...

36 24 Frank Ralston Sunday Bloody Sunday,New Year's Day,Meet Kevin...

37 57 Luis Rojas Good Golly Miss Molly,Wrote A Song For Everyon...

38 19 Tim Goyer J Squared,Maria,Hey Cisco,Fortuneteller,High B...

39 25 Victor Stevens Stuck With Me,Nice Guys Finish Last,Macy's Day...

40 30 Edward Francis Black Mountain Side,Communication Breakdown,Ca...

41 21 Kathy Chase Longview,Basket Case,She,Geek Stink Breath,Yes...

42 49 Stanislaw Wójcik Romance Ideal,SKA,Finding My Way,Evil Ways,It'...

43 47 Lucas Mancini Me Liga,Quase Um Segundo,Palavras,A Melhor For...

44 3 François Tremblay Pilot,Through the Looking Glass, Pt. 1,Canta, ...

45 17 Jack Smith Believe,As We Sleep,Double Talkin' Jive,The Ga...

46 20 Dan Miller Bem Devagar,Saudosismo,Posso Perder Minha Mulh...

47 28 Julia Barnett So Central Rain,Pretty Persuasion,Can't Stand ...

48 37 Fynn Zimmermann Bye, Bye Brasil,Susie Q,Proud Mary,Over And Ou...

49 33 Ellie Sullivan Naked In Front Of The Computer,Moonchild,Can I...

50 1 Luís Gonçalves Experiment In Terra,Take the Celestra,Shout It...

51 5 Franti​ek Wichterlová Wet My Bed,Crackerman,#9 Dream,Give Peace a Ch...

52 18 Michelle Brooks Terra,Eclipse Oculto,Hey Hey,Lonely Stranger,L...

53 55 Mark Taylor Walking On The Water,Suzie-Q, Pt. 2,Fortunes O...

54 27 Patrick Gray These Colours Don't Run,For the Greater Good o...

55 23 John Gordon Your Time Has Come,Dandelion,Rock 'N' Roll Mus...

Love Gun,Deuce,Wake Me Up When September


56 56 Diego Gutiérrez
Ends,...

57 58 Manoj Pareek Shock Me,She,Black Night,Pictures Of Home,Casc...

58 8 Daan Peeters Dog Eat Dog,Overdose,Love In An Elevator,Janie...

In [ ]:
df_1 = _deepnote_execute_sql('SELECT "Customer"."FirstName", SUM("Total") as TotalBought
\n FROM "Invoice" \n JOIN "Customer" ON "Invoice"."CustomerId" = "Customer"."CustomerId"\n G
ROUP BY "Customer"."CustomerId" \n ORDER BY TotalBought DESC', 'SQL_5C2E9A9B_B591_4FC4_ADD
5_829EB3188444')
df_1
Out[ ]:
Out[ ]:

FirstName totalbought

0 Helena 49.62

1 Richard 47.62

2 Luis 46.62

3 Ladislav 45.62

4 Hugh 45.62

5 Julia 43.62

6 Fynn 43.62

7 Frank 43.62

8 Astrid 42.62

9 Victor 42.62

10 Terhi 41.62

11 Johannes 40.62

12 Franti​ek 40.62

13 Isabelle 40.62

14 François 39.62

15 Bjørn 39.62

16 João 39.62

17 Heather 39.62

18 Wyatt 39.62

19 Jack 39.62

20 Dan 39.62

21 Luís 39.62

22 Joakim 38.62

23 Tim 38.62

24 Dominique 38.62

25 Jennifer 38.62

26 Manoj 38.62

27 Camille 38.62

28 Kara 37.62

29 Niklas 37.62

30 Martha 37.62

31 Roberto 37.62

32 Hannah 37.62

33 Madalena 37.62

34 Eduardo 37.62

35 Edward 37.62

36 Kathy 37.62

37 Stanislaw 37.62

38 Lucas 37.62

39 Robert 37.62

40 Diego 37.62

41 Emma 37.62

42 Fernanda 37.62
43 Marc totalbought
FirstName 37.62

44 Mark 37.62

45 Aaron 37.62

46 Phil 37.62

47 Enrique 37.62

48 John 37.62

49 Ellie 37.62

50 Daan 37.62

51 Steve 37.62

52 Michelle 37.62

53 Mark 37.62

54 Patrick 37.62

55 Leonie 37.62

56 Frank 37.62

57 Alexandre 37.62

58 Puja 36.64

In [ ]:
df_2 = _deepnote_execute_sql('SELECT "CustomerId", SUM("Total") as TotalBought \n FROM "In
voice" \n GROUP BY "CustomerId" \n ORDER BY TotalBought DESC', 'SQL_5C2E9A9B_B591_4FC4_ADD5
_829EB3188444')
df_2
Out[ ]:

CustomerId totalbought

0 6 49.62

1 26 47.62

2 57 46.62

3 45 45.62

4 46 45.62

5 28 43.62

6 37 43.62

7 24 43.62

8 7 42.62

9 25 42.62

10 44 41.62

11 48 40.62

12 5 40.62

13 43 40.62

14 3 39.62

15 4 39.62

16 34 39.62

17 22 39.62

18 42 39.62

19 17 39.62

20 20 39.62

21 1 39.62
21 1 39.62
CustomerId totalbought
22 51 38.62

23 19 38.62

24 40 38.62

25 15 38.62

26 58 38.62

27 39 38.62

28 9 37.62

29 38 37.62

30 31 37.62

31 12 37.62

32 36 37.62

33 35 37.62

34 10 37.62

35 30 37.62

36 21 37.62

37 49 37.62

38 47 37.62

39 29 37.62

40 56 37.62

41 52 37.62

42 13 37.62

43 41 37.62

44 14 37.62

45 32 37.62

46 53 37.62

47 50 37.62

48 23 37.62

49 33 37.62

50 8 37.62

51 54 37.62

52 18 37.62

53 55 37.62

54 27 37.62

55 2 37.62

56 16 37.62

57 11 37.62

58 59 36.64

In [ ]:
df_7 = _deepnote_execute_sql('SELECT "InvoiceDate", COUNT(1) as TotalSold\n FROM "Invoice"
\n GROUP BY "InvoiceDate"\n ORDER BY TotalSold DESC', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB318
8444')
df_7

Out[ ]:

InvoiceDate totalsold
0 InvoiceDate
2012-12-28 totalsold
2

1 2013-07-02 2

2 2009-04-04 2

3 2011-04-18 2

4 2012-01-22 2

... ... ...

349 2010-05-14 1

350 2010-01-13 1

351 2010-11-24 1

352 2010-06-13 1

353 2010-06-30 1

354 rows × 2 columns

Performance optimization

EXPLAIN SELECT orders.order_id, customers.customer_name


FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id;

The EXPLAIN results in PostgreSQL will show the query execution plan, which is a step-by-step breakdown of
how the PostgreSQL query planner will execute the query.

Here's an example of what the results might look like:

QUERY PLAN
--------------------------------------------------------------------------------
-
Hash Join (cost=31.05..41.43 rows=1000 width=50)
Hash Cond: (orders.customer_id = customers.customer_id)
-> Seq Scan on orders (cost=0.00..8.00 rows=1000 width=26)
-> Hash (cost=20.00..20.00 rows=1000 width=28)
-> Seq Scan on customers (cost=0.00..20.00 rows=1000 width=28)

The query plan shows that PostgreSQL will use a hash join to combine the two tables.

The first step is a sequential scan of the orders table, with an estimated cost of 0.00..8.00. PostgreSQL estimates
that there are 1000 rows in the table, and it will need to read all of them to perform the join.

The second step is also a sequential scan, but this time of the customers table. This table is also estimated to
have 1000 rows, and will also need to be fully scanned to perform the join.

Finally, the query plan shows that PostgreSQL will use a hash join operation to combine the two tables, based on
the join condition orders.customer_id = customers.customer_id. The hash join has an estimated cost of
31.05..41.43, which includes the cost of reading the tables and performing the join operation.

By analyzing the query plan, we can see that the query will perform a full table scan of both the orders and
customers tables, and that a hash join is used to combine the two tables. This query plan provides useful
information for optimizing the query and improving its performance if needed.

Deep Dive

In [ ]:

df_10 = _deepnote_execute_sql('SELECT * FROM orders WHERE customer_id = 733219;', 'SQL_5


C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_10
Out[ ]:

order_id customer_id order_date total_amount

0 1000061 733219 2022-12-16 207.69

1 1021143 733219 2022-10-13 357.43

2 1316755 733219 2022-02-06 889.33

In [ ]:
df_8 = _deepnote_execute_sql('-- NORMAL QUERY\n EXPLAIN ANALYZE SELECT * FROM "orders_no_
index" WHERE "customer_id" = 733219;', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_8

Out[ ]:

QUERY PLAN

0 Gather (cost=1000.00..14723.14 rows=2 width=1...

1 Workers Planned: 1

2 Workers Launched: 0

3 -> Parallel Seq Scan on orders_no_index (c...

4 Filter: (customer_id = 733219)

5 Rows Removed by Filter: 999997

6 Planning Time: 1.797 ms

7 Execution Time: 384.361 ms

In this example, we can see that the query is doing a parallel sequential scan on the table "orders_no_index",
filtering rows where the "customer_id" column equals '733219'. The estimated cost of this operation is 13722.94,
and the actual execution time is 72.526 ms.

To identify which column should be indexed, we can look at the Filter line in the output. In this case, we can see
that the query is filtering rows based on the "customer_id" column. Since this is a simple equality check, we can
create an index on this column to speed up the query.

Now let's optimize with index

CREATE INDEX orders_customer_id_idx ON orders (customer_id);

In [ ]:
df_9 = _deepnote_execute_sql('EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 73
3219;', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_9
Out[ ]:

QUERY PLAN

Index Scan using orders_customer_id_idx on


0
ord...

1 Index Cond: (customer_id = 733219)

2 Planning Time: 0.065 ms

3 Execution Time: 0.049 ms

As you can see, the EXPLAIN ANALYZE output now shows an "Index Scan" operation on the
orders_customer_id_idx index, with a cost of 3.76. The estimated number of rows returned by the query is 2,
which didn't match the actual number of rows returned. The actual time it took to execute the query is now
much faster, at 0.057 milliseconds, compared to 72.526 milliseconds without the index.

In summary, adding an index on the customer_id column of the orders table significantly improved the
performance of the query by allowing PostgreSQL to scan the index instead of the entire table. This also will
improve JOIN operation.

QUERY

In [ ]:
df_11 = _deepnote_execute_sql('SELECT\n customers.customer_name,\n orders.order_date,\n
orders.total_amount\n FROM\n orders\n JOIN customers ON orders.customer_id = customers.cus
tomer_id\n WHERE customers.customer_id = 733219;', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB31884
44')
df_11

Out[ ]:

customer_name order_date total_amount

0 Customer 733219 2022-12-16 207.69

1 Customer 733219 2022-10-13 357.43

2 Customer 733219 2022-02-06 889.33

WITHOUT INDEX

In [ ]:

df_12 = _deepnote_execute_sql('EXPLAIN ANALYZE\n SELECT\n customers.customer_name,\n order


s_no_index.order_date,\n orders_no_index.total_amount\n FROM\n orders_no_index\n JOIN cust
omers ON orders_no_index.customer_id = customers.customer_id\n WHERE customers.customer_id
= 733219;', 'SQL_5C2E9A9B_B591_4FC4_ADD5_829EB3188444')
df_12

Out[ ]:

QUERY PLAN

0 Gather (cost=1000.42..14725.79 rows=2 width=2...

1 Workers Planned: 1

2 Workers Launched: 0

3 -> Nested Loop (cost=0.42..13725.59 rows=1...

4 -> Parallel Seq Scan on orders_no_ind...

5 Filter: (customer_id = 733219)

6 Rows Removed by Filter: 999997

7 -> Index Scan using customers_pkey on...

8 Index Cond: (customer_id = 733219)

9 Planning Time: 0.096 ms

10 Execution Time: 73.538 ms

WITH INDEX

In [ ]:

df_13 = _deepnote_execute_sql('EXPLAIN ANALYZE\n SELECT\n customers.customer_name,\n order


s.order_date,\n orders.total_amount\n FROM\n orders\n JOIN customers ON orders.customer_id
= customers.customer_id\n WHERE customers.customer_id = 733219;', 'SQL_5C2E9A9B_B591_4FC4_
ADD5_829EB3188444')
df_13

Out[ ]:
Out[ ]:

QUERY PLAN

0 Nested Loop (cost=0.85..6.42 rows=2 width=25)...

1 -> Index Scan using customers_pkey on custo...

2 Index Cond: (customer_id = 733219)

3 -> Index Scan using orders_customer_id_idx ...

4 Index Cond: (customer_id = 733219)

5 Planning Time: 0.096 ms

6 Execution Time: 0.074 ms

Exploration

Install Postgres & load Chinook dataset, then try to answer these questions:

Which are the top 5 most popular genres in the Chinook database? How many tracks are there for each
genre?

Who are the top 10 best-selling artists in the Chinook store? How many tracks have they sold?

What is the average purchase price per invoice for each country in the Chinook database? Which countries
have the highest and lowest average purchase prices?

What is the correlation between the length of a track and its price in the Chinook store? Are longer tracks
generally more expensive or less expensive?

How has the Chinook store's sales performance changed over time? Is it trending upwards, downwards, or
staying the same? Can you identify any patterns in the data?

Supplementary material

FInd top 90 percentile

In [ ]:

df_14 = _deepnote_execute_sql('SELECT c."CustomerId", c."FirstName", c."LastName", \n


SUM(i."Total") AS "TotalSpent",\n COUNT(DISTINCT i."InvoiceId") AS "NumPurchases",\
n COUNT(DISTINCT il."TrackId") AS "NumTracksPurchased"\n FROM "Customer" c\n JOIN "In
voice" i ON i."CustomerId" = c."CustomerId"\n JOIN "InvoiceLine" il ON il."InvoiceId" = i.
"InvoiceId"\n GROUP BY c."CustomerId"\n ORDER BY "TotalSpent" DESC\n ', 'SQL_5C2E9A9B_B591_4
FC4_ADD5_829EB3188444')
df_14
Out[ ]:

CustomerId FirstName LastName TotalSpent NumPurchases NumTracksPurchased

0 6 Helena Holý 502.62 7 38

1 26 Richard Cunningham 474.62 7 38

2 45 Ladislav Kovács 446.62 7 38

3 46 Hugh O'Reilly 446.62 7 38

4 57 Luis Rojas 415.62 7 38

5 25 Victor Stevens 404.62 7 38


6 CustomerId
7 Astrid
FirstName Gruber TotalSpent
LastName 404.62 NumPurchases
7 NumTracksPurchased
38

7 37 Fynn Zimmermann 388.62 7 38

8 24 Frank Ralston 378.62 7 38

9 5 Franti​ek Wichterlová 376.62 7 38

10 43 Isabelle Mercier 376.62 7 38

11 28 Julia Barnett 370.62 7 38

12 4 Bjørn Hansen 362.62 7 38

13 17 Jack Smith 352.62 7 38

14 34 João Fernandes 352.62 7 38

15 48 Johannes Van der Berg 352.62 7 38

16 44 Terhi Hämäläinen 350.62 7 38

17 15 Jennifer Peterson 343.62 7 38

18 51 Joakim Johansson 340.62 7 38

19 42 Wyatt Girard 338.62 7 38

20 3 François Tremblay 338.62 7 38

21 1 Luís Gonçalves 338.62 7 38

22 22 Heather Leacock 338.62 7 38

23 20 Dan Miller 338.62 7 38

24 40 Dominique Lefebvre 336.62 7 38

25 58 Manoj Pareek 335.62 7 38

26 39 Camille Bernard 335.62 7 38

27 19 Tim Goyer 335.62 7 38

28 55 Mark Taylor 334.62 7 38

29 56 Diego Gutiérrez 334.62 7 38

30 52 Emma Jones 334.62 7 38

31 2 Leonie Köhler 334.62 7 38

32 8 Daan Peeters 334.62 7 38

33 9 Kara Nielsen 334.62 7 38

34 10 Eduardo Martins 334.62 7 38

35 11 Alexandre Rocha 334.62 7 38

36 12 Roberto Almeida 334.62 7 38

37 13 Fernanda Ramos 334.62 7 38

38 14 Mark Philips 334.62 7 38

39 16 Frank Harris 334.62 7 38

40 18 Michelle Brooks 334.62 7 38

41 21 Kathy Chase 334.62 7 38

42 23 John Gordon 334.62 7 38

43 27 Patrick Gray 334.62 7 38

44 29 Robert Brown 334.62 7 38

45 30 Edward Francis 334.62 7 38

46 31 Martha Silk 334.62 7 38

47 32 Aaron Mitchell 334.62 7 38

48 33 Ellie Sullivan 334.62 7 38

49 35 Madalena Sampaio 334.62 7 38

50 36 Hannah Schneider 334.62 7 38


51 CustomerId
38 Niklas
FirstName Schröder TotalSpent
LastName 334.62 NumPurchases
7 NumTracksPurchased
38

52 41 Marc Dubois 334.62 7 38

53 47 Lucas Mancini 334.62 7 38

54 49 Stanislaw Wójcik 334.62 7 38

55 50 Enrique Muñoz 334.62 7 38

56 53 Phil Hughes 334.62 7 38

57 54 Steve Murray 334.62 7 38

58 59 Puja Srivastava 331.66 6 36

In [ ]:

import pandas as pd
import plotly.express as px

# create a histogram of CLV values


fig = px.histogram(df_14, x='TotalSpent', nbins=20)
fig.show()

In [ ]:
# calculate the 90th percentile of CLV values
pct_90 = df_14['TotalSpent'].quantile(0.9)

# count the number of high-value customers


num_high_value_customers = (df_14['TotalSpent'] >= pct_90).sum()

print(f"There are {num_high_value_customers} high-value customers in the Chinook store (i


.e. the top 10%).")

There are 7 high-value customers in the Chinook store (i.e. the top 10%).

In [ ]:

df_14.loc[df_14['TotalSpent'] >= pct_90]


Out[ ]:
Out[ ]:

CustomerId FirstName LastName TotalSpent NumPurchases NumTracksPurchased

0 6 Helena Holý 502.62 7 38

1 26 Richard Cunningham 474.62 7 38

2 45 Ladislav Kovács 446.62 7 38

3 46 Hugh O'Reilly 446.62 7 38

4 57 Luis Rojas 415.62 7 38

5 25 Victor Stevens 404.62 7 38

6 7 Astrid Gruber 404.62 7 38

In [ ]:

Created in Deepnote

You might also like