Lecture 7 - code_notes_improved1
Lecture 7 - code_notes_improved1
Introduction to JOINS
So far we worked with one table at a time. But the real power of SQL
comes from working with data across multiple tables at once. The term
relational database refers to the fact that tables within it relate to one
another. They contain common identifiers (e.g. primary and foreign
keys) that allow information from multiple tables to be easily
combined. In this lesson, we’ll see how to leverage SQL to link tables
together with what is called JOINS.
To understand what JOINS are and why they’re helpful, let’s think
about Parch & Posey’s orders table. Looking back at this table, we
notice that none of the orders say the name of the client, which is a
very useful piece of information. Instead, the table refers to customers
by numerical values in the account_id column. We’ll need to join
with another table, in order to connect the orders data to the
customer names.
But why isn’t the customer’s name in the orders table in the first
place? There are several reasons why relational databases are like this
and are split into multiple tables. Let’s focus on two of the most
important ones:
1. Objects of different nature, such as orders and accounts, are
easier to organize if kept separate.
Take a look at the Entity Relationship Diagram of Parch and Posey that
is illustrated next, to familiarize yourselves with how primary and
foreign keys help connect different tables.
SQL JOINS
A JOIN clause is used to combine rows from two or more tables, based
on a related column between them. The rows are combined
horizontally, in the sense that two rows (one from each table) with
possibly different columns, are concatenated the one next to the other
in one common row that includes columns from both tables. This
concatenation happens only if the two rows from the two tables have
matching values between their corresponding columns, i.e. columns
that represent the same type of entity. For example, if the
account_id in one row of the orders table is equal to the id in a
row of the accounts table, the two rows can get concatenated into
one, in the final result. Examples will help us understand this better.
INNER JOINS
The INNER JOIN selects rows that have matching values in the
corresponding columns of both tables. It will select all rows from both
tables only if there is a match between the columns for all rows. If there
are rows in one table that do not have matches in the other table, then
these rows will not be shown in the output.
--The simplest possible join combines all
columns from both tables
SELECT * FROM ORDERS --Selects all columns from
both tables
INNER JOIN accounts --Defines which table to
join ORDERS with
on accounts.id=orders.account_id --Defines which
columns the data should be matched through; id
and account_id both represent the id of the
customer so they are the corresponding columns
between the two tables.
--We can select only a subset of the columns.
With the following query we produce an output
with all the orders, similar to the orders
table, but with the addition of the client’s
name for every order.
SELECT accounts.name, orders.* FROM ORDERS
INNER JOIN accounts
on accounts.id=orders.account_id
_________________________________________________________
SOME NAMING AMBIGUITY ERRORS!!: It’s important to specify the
table name in front of each column, because we may get some
ambiguity error, if the two tables have columns with the same names.
Try the following query to see the ambiguity error:
SELECT id, * FROM ORDERS
INNER JOIN accounts
on accounts.id=orders.account_id
If we run a JOIN query with *, meaning asking for all the columns, there
will be no ambiguity error and we may end up with two columns having
the same name, but completely different content. Try the following
query to notice that.
select * from accounts a
join sales_reps r
on r.id=a.sales_rep_id --In this case, an error
will show up, when we try to specifically choose
one of these two columns. The following query,
for example, gives such an error.
select id from
(select * from accounts a
join sales_reps r
on r.id=a.sales_rep_id)
--Notice that here we also used our first
SUBQUERY, i.e. a query inside another query!!
In such a case we may want to use aliases for the columns, or not select
both these columns, if they are not both needed, but there should be
no ambiguity in the end.
_________________________________________________________
OUTER JOINS
In the previous section, we talked about INNER JOINS which link two or
more tables to find all the rows that match, and only those. The OUTER
JOIN returns not only the rows that match on the criteria we specify,
but also the unmatched rows from either of the tables that we want to
join.
There are three types of OUTER JOINS: LEFT
JOIN, RIGHT JOIN, and FULL JOIN.
LEFT and RIGHT OUTER JOINS
The LEFT JOIN keyword returns all records from the left table (table1),
and only the matched records from the right table (table2). The result
is NULL from the right side, if there is no match (refer to figure below,
2nd Venn diagram).
What’s the difference between the LEFT and RIGHT JOIN?
When we begin building a query using LEFT JOIN, the first table we
name in the FROM clause is the table considered on the left, and the
second table, in the JOIN clause, is considered on the right. For
example, if we want all the rows from the first table and only the
matching rows from the second table, we will use a LEFT OUTER JOIN.
Conversely, if we want all the rows from the second table and any
matching rows from the first table, we will use a RIGHT OUTER JOIN.
The word OUTER is not necessary in the queries. LEFT JOIN and RIGHT
JOIN are equivalent to LEFT OUTER JOIN and RIGHT OUTER JOIN
respectively.
Let’s run some queries!! For the next examples, and because the
existing tables don’t have any data mismatches, so we wouldn’t be able
to showcase the value of the OUTER JOINS, we decided to add some
more rows in two of the tables of the Parch database. Let’s assume that
the company considers to expand, so hired some new sales reps and
also opened some new regions of sales. Let’s add this new data in our
tables:
INSERT INTO sales_reps(id, name, region_id)
values(321991, 'Alina Shein', 5), (321992,
'Alberto Quin', 8);
INSERT INTO region(id, name)
values(5, 'International'), (6, 'South'), (7,
'North');
NOTE 1:
____________________________________________________
Pay attention to how the output changes when we replace
count(o.id) with count(*)!! Count(*) gives num_orders=1,
even for regions that had no sales_reps assigned and hence had no
orders whatsoever. This happened because count(*) counts all rows
in the joined tables, even the ones for which o.id, which represents
the id of any order placed for that region, is NULL!!! The rows for which
o.id is NULL should be omitted from the final count, since it means
that no orders were placed in that specific row! So in this case, it is
important to capture the rows with NULLs in the relevant columns, this
is why using count(o.id) is more accurate than count(*).
select r.name, count(distinct sr.id) sales_reps,
count(*) num_orders from orders o
join accounts a
on o.account_id = a.id
right join sales_reps sr
on sr.id = a.sales_rep_id
right join region r
on r.id =sr.region_id
group by r.name
Versus
NOTE 2:
____________________________________________________
NOTE 3:
____________________________________________________
In order to build successful queries with multiple joins, make sure that
after every join you add, you run the query to see if you are getting the
output that you expected. Especially when these queries need outer
joins, make sure that you select the correct outer join, by printing out
the intermediate results after every join, and you see all the data that
you expected in the output. In this above query, for example, only if
you use a RIGHT JOIN when you join sales_reps to the
accounts/orders result of the previous join, will you be able to
keep all the sales representatives in the final output, otherwise only the
ones with accounts assigned to them will show up, and the ones with
no accounts will be eliminated during the join!
______________________________________(end of Note 3)
Should we use filtering conditions in the ON clause instead of the
WHERE, when and why???
Let’s use some new data for this example. In your Parch database build
the following table, a shorter version of the familiar sales_reps table
Takeways:
It is better to avoid filtering conditions in the
ON clause when we have outer joins, because the
result is most likely not the intended one.
It is good to move filtering conditions in the ON
clause, because it both gives us the output that we
want, and it saves computational time, since the
rows that we want to filter out, will never even
try to get joined, they get excluded before the
join even happens.
To summarize:
Here are the different types of JOINs in SQL:
(INNER) JOIN: Returns records that have matching values in both tables
LEFT (OUTER) JOIN: Returns ALL records from the left table, and only
the matched records from the right table, IF ANY. For the records of the
left table that have no match in the right table, empty records/NULL
values are added in the columns of the latter.
RIGHT (OUTER) JOIN: Returns ALL records from the right table, and
only the matched records from the left table, IF ANY. For the records of
the right table that have no match in the left table, empty records/NULL
values are added in the columns of the latter.
FULL (OUTER) JOIN: Returns all records from both tables, whether they
have a match or not between the two. For any records that have no
match, the columns of the table that does not have a match are
populated with empty records/cells/ NULL values.
Please, check out JOIN_concept.csv for more explanation with some toy
examples.
A last note…
It is important to be intentional about the type of JOIN that you need
to use…
Instead for example of always using a FULL JOIN and removing the
NULLs, if needed, we need to select intentionally what join is
appropriate! Sometimes, if your data is too big, you don’t want to try to
join all the rows (i.e. FULL join), needed or not, and then decide with
the eye what join to use, or what rows to remove. For very voluminous
data, trying to join everything may be so costly, that the query will
never end! So you have to learn how to use the correct join, without
joining everything, just because you can and then deciding what to do
next.
You also need to keep in mind that some data may give correct output
for the wrong join, just coincidentally, but the query would not be
correct for different data. For example, if we run the following query in
the old vs new Parch database (i.e. before and after we added the new
data in the Parch database), it would give the correct answer in the old
data, but a wrong answer in the new data.
Run these queries on both old and new data:
In the new data, the second query will include all the sales reps,
whether they are assigned to a region or not, but the first query will
only include the assigned sales reps.
If the question we are asking is: find all sales reps and their regions
then, even though in the old data both queries would give the same
output, a correct output, the first query would be wrong, because if the
data was different, the first query would omit the sales reps that have
no assigned region!
So for a query with JOINS to be considered correct, it is not enough for
the output to look correct, the query should be able to give the
correct answer, even if the data was a bit different!
UNION
Union combines rows vertically. So, in a sense, it takes two tables (or
sets of rows) and places one underneath the other. The two tables or
sets of rows should have columns with the same name, type and order,
otherwise the union will fail!
-- UNION
-- all distinct rows selected by either queries
select name, region_id from sales_reps
union
select name, region_id from sales_reps_new
order by name;
-- UNION ALL
-- all rows selected by either queries
select name, region_id from sales_reps
union all
select name, region_id from sales_reps_new
order by name;
-- INTERSECT
-- returns rows that are common in both queries
select name, region_id from sales_reps
intersect
select name, region_id from sales_reps_new
order by name
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT