Lab: INNER JOIN, GROUP BY, and HAVING Clauses
Lab: INNER JOIN, GROUP BY, and HAVING Clauses
Contents
rd
Chapter 3. 3 Lab: INNER JOIN, GROUP BY, and HAVING clauses ....................................................... 1
3.1 Introduction ................................................................................................................................... 1
3.2 Exercises in Access ....................................................................................................................... 2
3.3 Exercises in Oracle ..................................................................................................................... 12
3.4 Best practice rules ....................................................................................................................... 31
3.5 Homework................................................................................................................................... 32
3.1 Introduction
Please recall that:
Any time when ambiguities are possible in SQL statements, you should prefix column
names with corresponding table instance names.
Even when no ambiguities are possible, in ON sub-clauses prefixing of all column names
with their corresponding table instance names is compulsory.
In ORDER BY, ASC is implicit.
Constants should be hard-codded in SQL only exceptionally; generally, use parameters
instead (e.g. P3.1b below generalizes billion of queries of the type P3.1a below).
The SQL GROUP BY c1, …, cn clause partitions the set of records obtained by filtering
with the corresponding WHERE clause (if any) the set of records computed by the
corresponding FROM clause according to the equivalence relation ker(c1 … cn), where
c1, …, cn are columns from that FROM clause table instances.
Recall that, for any function product f g, ker(f g) = ker(f) ker(g), and that, for any
function f : A B, ker(f) = {(x, y) A2 | f(x) = f(y)} A2 (called the kernel or nucleus of
f).
If you do not give names to your SQL SELECT clause expressions, RDBMSs are assigning
automatically generated ones to them, generally of the type Expr1, Expr2, …
Although there is no standard for what aggregate functions to be provided, most RDBMSs
offer at least the following most frequently used ones: COUNT (for computing set
cardinals), SUM, AVG (for computing arithmetic means), MIN(imum), and MAX(imum).
You cannot compose two SQL aggregate functions (although you can compose an
aggregate function with other library functions).
1
The so-called GROUP BY golden rule states that, in the presence of the GROUP BY clause,
corresponding SELECT clause can only contain columns/expressions listed in the GROUP
BY clause and/or any columns/expressions based on columns of the corresponding FROM
clause, provided that they are arguments of aggregate functions.
The order of the SQL SELECT clauses is immutable not only for syntactical reasons, but
for conceptual ones too: it is exactly the order in which RDBMSs are evaluating these
queries.
b. Parameterize a. above and compute result for both 1,000,000 and 500,000.
Solution:
a.
Inspecting corresponding data instances, obviously, only three cities qualify for the result (in this
order): New York, London, and Bucharest.
Data needed to link these three tables’ instances: CITIES.State = STATES.x and
STATES.Country = COUNTRIES.x
2
SQL solution:
SELECT City, STATES.State, COUNTRIES.Country,
CITIES.Population
FROM (CITIES INNER JOIN STATES ON CITIES.State = STATES.x)
INNER JOIN COUNTRIES ON STATES.Country = COUNTRIES.x
WHERE CITIES.Population >= 1000000
ORDER BY CITIES.Population DESC, COUNTRIES.Country,
STATES.State, City;
3
Figure 3.2 Entering actual parameter value for P3.1b
P3.2 a. Compute the set of countries (name, population, sum of corresponding cities population,
unaccounted cities population), in the descending order of unaccounted cities population, sum of
corresponding cities population, stored countries population, and then ascending on country
names.
b. Same as a. above, but only for countries for which the sum of cities population is at least
equal to a parameter value; run it for 7,000,000 people.
c. Same as b. above, but only for countries whose names start with ‘R’; run it for 2,500,000
people.
Solution:
Inspecting corresponding data instances, obviously, all four countries qualify for the result
(in this order): U.S.A., U.K., Romania, and Moldavia.
4
Country Population SumCityPop UnaccCityPop
Data needed to link these three tables’ instances: CITIES.State = STATES.x and
STATES.Country = COUNTRIES.x
SQL solution:
Both conceptually and from the RDBMSs performance point of view, it is preferable to split
complex problems into smaller and simpler sub-problems and to interconnect in the end their
solutions.
Consequently, let us first solve the sub-problem of computing the sum of cities populations per
countries.
Obviously, by using the SQL aggregate function SUM in the following query, it computes the sum
of all cities populations in the world (see figure 3.4 for its result):
For computing total city populations per country, we obviously need to partition cities on group
per countries, such as for SUM to compute totals per countries, instead of the worldwide one:
5
SELECT Sum(CITIES.Population) AS CityPopSum, Country
FROM STATES INNER JOIN CITIES
ON STATES.x = CITIES.State
GROUP BY Country;
Running this query, saved as P3-2-0, against the current lab’s db instance, it is computing the
following result:
The second sub-problem is to use the results of the previous one for computing final results;
obviously, a join of query P3-2-0 with the COUNTRIES table is needed in order to get both country
names and populations:
SELECT COUNTRIES.Country, COUNTRIES.Population, CityPopSum,
[Population]-[CityPopSum] AS UnaccPop
FROM [P3-2-0] INNER JOIN COUNTRIES
ON [P3-2-0].Country = COUNTRIES.x
ORDER BY [Population]-[CityPopSum] DESC, CityPopSum DESC,
Population DESC, COUNTRIES.Country;
6
Figure 3.6 Result of P3.2a
Note that, unfortunately, many programmers would actually come up with the following
equivalent, but not optimal solution:
P3-2-0Bis:
P-3-2aBis:
SELECT COUNTRIES.Country, COUNTRIES.Population, CityPopSum,
[Population]-[CityPopSum] AS UnaccPop
FROM [P3-2-0Bis] INNER JOIN COUNTRIES
ON [P3-2-0Bis].Country = COUNTRIES.Country
ORDER BY [Population]-[CityPopSum] DESC, CityPopSum DESC,
Population DESC, COUNTRIES.Country;
Note that P3-2-0B is already taking more time and both memory and disk space, as it makes an
additional join and computes country names (that, in average, have some 32 ASCII chars) instead
of surrogate key values (that need 4 binary bytes).
Much worse is P-3-2aBis, which is joining not on surrogate key values (requiring the fastest –
arithmetic-logic– unit of the CPU and only one memory cycle per comparison), like P-3-2a, but
on ASCII strings (requiring the slowest –decimal– unit of the CPU and an average of 32 memory
cycles per comparison).
b.
Obviously, the only thing that has to be done is to add a HAVING clause to P3-2-0:
7
P3-2-0b:
SELECT Sum(CITIES.Population) AS CityPopSum, STATES.Country
FROM STATES INNER JOIN CITIES ON STATES.x = CITIES.State
GROUP BY STATES.Country
HAVING Sum(CITIES.Population) >= [Please enter desired minimum cities total
population per country:];
Figure 3.7 shows Access’ actual parameter values input window, figures 3.8 – corresponding result
of P3-2-0b for 7,000,000, and 3.9 – the one for the corresponding P-3-2b:
8
Figure 3.9 Result of P3.2b for 7,000,000 people
Note that a same result may be obtained with a single statement, by using a subquery (but
generally, subqueries are less fast evaluated by RDBMSs than queries hierarchies):
SELECT COUNTRIES.Country, COUNTRIES.Population, CityPopSum,
[Population]-[CityPopSum] AS UnaccPop
FROM (SELECT Sum(CITIES.Population) AS CityPopSum,
STATES.Country
FROM STATES INNER JOIN CITIES
ON STATES.x = CITIES.State
GROUP BY STATES.Country
HAVING Sum(CITIES.Population) >= [Please enter
desired minimum cities total population per
country:]) AS [P3-2-0b]
INNER JOIN COUNTRIES ON [P3-2-0b].Country = COUNTRIES.x
ORDER BY [Population]-[CityPopSum] DESC, CityPopSum DESC,
COUNTRIES.Population DESC, COUNTRIES.Country;
9
c.
Even if not that obvious, the best thing to do is to add a corresponding filter to P-3-2-0b:
P-3-2-0c:
SELECT Sum(CITIES.Population) AS CityPopSum, STATES.Country
FROM COUNTRIES INNER JOIN (STATES INNER JOIN CITIES
ON STATES.x = CITIES.State)
ON STATES.Country = COUNTRIES.x
WHERE COUNTRIES.Country Like "R*"
GROUP BY STATES.Country
HAVING Sum(CITIES.Population) >= [Please enter desired minimum cities total
population per country:];
Figure 3.10 shows Access’ actual parameter values input window, figures 3.11 – corresponding
result of P3-2-0c for 2,500,000, and 3.12 – the one for the corresponding P-3-2c:
10
Figure 3.11 Result of P3.2-0c for 2,500,000 people
Please note again that, unfortunately, some programmers would rather come up with one of the
following equivalent, but not at all optimal solutions:
P-3-2-0cBis:
SELECT Sum(CITIES.Population) AS CityPopSum, STATES.Country
FROM COUNTRIES INNER JOIN (STATES INNER JOIN CITIES
ON STATES.x = CITIES.State)
ON STATES.Country = COUNTRIES.x
GROUP BY STATES.Country
HAVING COUNTRIES.Country Like "R*" AND
11
Sum(CITIES.Population) >= [Please enter desired
minimum cities total population per country:];
P-3-2cBis:
SELECT COUNTRIES.Country, COUNTRIES.Population, CityPopSum,
[Population]-[CityPopSum] AS UnaccPop
FROM [P3-2-0cBis] INNER JOIN COUNTRIES
ON [P3-2-0cBis].Country = COUNTRIES.x
WHERE COUNTRIES.Country Like "R*"
ORDER BY [Population]-[CityPopSum] DESC, CityPopSum DESC,
COUNTRIES.Population DESC, COUNTRIES.Country;
P3-2-0c is only computing, in this particular case, one group (for Romania) and, generally,
about one dozen group (for Romania, Russia, Rwanda, etc.), whereas
Both P-3-2-0cBis and P-3-2cBis are still computing all groups (four in this particular case,
but some 250 for full countries’ data) and then are throwing away the vast majority of their
computation results (three groups in this particular case, but some 238 for full countries’
data).
Generally, note that we cannot get rid of HAVING clauses, as this is the only place where we can
add filters on data computed (generally through aggregation) after grouping; dually, we could
sometimes get rid of WHERE clauses (not always, as –see figure 3.4 above– sometimes we might
need filters on global applications of aggregate functions) and only use HAVING ones, but this
would be a stupid thing to do, both conceptually and, especially, performance-wise.
b. Parameterize a. above and compute results for both 1,000,000 and 500,000.
Solution:
a.
12
Inspecting corresponding data instances, obviously, only three cities qualify for the result (in this
order): New York, London, and Bucharest.
Data needed to link these three tables’ instances: CITIES.State = STATES.x and
STATES.Country = COUNTRIES.x
SQL solution:
SELECT City, STATES.State, COUNTRIES.Country,
CITIES.Population
FROM (CITIES INNER JOIN STATES ON CITIES.State = STATES.x)
INNER JOIN COUNTRIES ON STATES.Country = COUNTRIES.x
WHERE CITIES.Population >= 1000000
ORDER BY CITIES.Population DESC, COUNTRIES.Country,
STATES.State, City;
13
Figure 3.13 Result of P3.1a
14
b.
The only difference with respect to the above query is replacing the hard-codded constant
1000000 with a parameter:
SELECT City, STATES.State, COUNTRIES.Country,
CITIES.Population
FROM (CITIES INNER JOIN STATES ON CITIES.State = STATES.x)
INNER JOIN COUNTRIES ON STATES.Country = COUNTRIES.x
WHERE CITIES.Population >=
:Minimum_city_population
ORDER BY CITIES.Population DESC, COUNTRIES.Country,
STATES.State, City;
Obviously, the result of running it against the lab’s db instance with the actual parameter value
1000000 is the same as the one in figure 3.13 above.
The result of running it against the lab’s db instance with the actual parameter value 500000 (figure
3.14) also selects Chișinău, Memphis, and Washington (figure 3.15). Note that Oracle variable
names can be of at most 30 chars and cannot contain spaces.
15
Figure 3.15 Result of P3.1b for 500,000
P3.2 a. Compute the set of countries (name, population, sum of corresponding cities population,
unaccounted cities population), in the descending order of unaccounted cities population, sum of
corresponding cities population, stored countries population, and then ascending on country
names.
b. Same as a. above, but only for countries for which the sum of cities population is at least
equal to a parameter value; run it for 7,000,000 people.
c. Same as b. above, but only for countries whose names start with ‘R’; run it for 2,500,000
people.
Solution:
Inspecting corresponding data instances, obviously, all four countries qualify for the result
(in this order): U.S.A., U.K., Romania, and Moldavia.
16
Country Population SumCityPop UnaccCityPop
Data needed to link these three tables’ instances: CITIES.State = STATES.x and
STATES.Country = COUNTRIES.x
SQL solution:
Both conceptually and from the RDBMSs performance point of view, it is preferable to split
complex problems into smaller and simpler sub-problems and to interconnect in the end their
solutions.
Consequently, let us first solve the sub-problem of computing the sum of cities populations per
countries.
Obviously, by using the SQL aggregate function SUM in the following query, it computes the sum
of all cities populations in the world (see figure 3.4 for its result):
17
Figure 3.16 The sum of all cities’ populations
For computing total city populations per country, we obviously need to partition cities on group
per countries, such as for SUM to compute totals per countries, instead of the worldwide one:
Running this query against the current lab’s db instance is computing the following result:
18
Figure 3.17 The sum of all cities’ populations per country
In order to make use of it in the final step, you should save this query as view P3-2-0; right-click
the View node of LAB_DB, then click on New View (figure 3.18); in the Create View window that
pops up (figure 3.19), enter the Name of the view (that should be distinct from names of any other
tables and views of LAB_DB) and copy the statement in the SQL Query text box; click on the
Check Syntax button: the message “SQL Parse Results: No errors found in SQL” should be
displayed in the bottom-left corner of the window; click on the Test Query button: the Test Query
window that pops up (figure 3.20) should display the “Query executed successfully” Result; click
on Close and then on the OK button of the Create View (figure 3.19): your view is saved and ready
to be used.
19
Figure 3.18 Creating a new view in Oracle SQL Developer
20
Figure 3.20 Testing a view in Oracle SQL Developer
Note that the same result could have been obtained by running the following DDL statement:
--------------------------------------------------------
-- DDL for View P3_2_0
--------------------------------------------------------
CREATE OR REPLACE FORCE VIEW "LAB_DB"."P3_2_0"
("CITYPOPSUM", "COUNTRY") AS
SELECT SUM(CITIES.POPULATION) AS CityPopSum, COUNTRY
FROM STATES INNER JOIN CITIES ON STATES.X = CITIES.STATE
GROUP BY COUNTRY;
The second sub-problem is to use the results of the previous one for computing final results;
obviously, a join of view P3-2-0 with the COUNTRIES table is needed in order to get both country
names and populations:
SELECT COUNTRIES.Country, COUNTRIES.Population, CityPopSum,
Population - CityPopSum AS UnaccPop
FROM P3_2_0 INNER JOIN COUNTRIES
ON P3_2_0.Country = COUNTRIES.x
ORDER BY Population - CityPopSum DESC, CityPopSum DESC,
Population DESC, COUNTRIES.Country;
The result of running it against the lab’s db instance is the following:
21
Figure 3.21 Result of P3.2a
Note that, unfortunately, many programmers would actually come up with the following
equivalent, but not optimal solution:
P3_2_0Bis:
P_3_2aBis:
SELECT COUNTRIES.Country, COUNTRIES.Population, CityPopSum,
Population - CityPopSum AS UnaccPop
FROM P3_2_0Bis INNER JOIN COUNTRIES
ON P3_2_0Bis.Country = COUNTRIES.Country
ORDER BY Population - CityPopSum DESC, CityPopSum DESC,
Population DESC, COUNTRIES.Country;
Note that P3_2_0B is already taking more time and both memory and disk space, as it makes an
additional join and computes country names (that, in average, have some 32 ASCII chars) instead
of surrogate key values (that need 4 binary bytes).
Much worse is P_3_2aBis, which is joining not on surrogate key values (requiring the fastest –
arithmetic-logic– unit of the CPU and only one memory cycle per comparison), like P_3_2a, but
22
on ASCII strings (requiring the slowest –decimal– unit of the CPU and an average of 32 memory
cycles per comparison).
b.
Obviously, the only thing that has to be done is to add a HAVING clause to P3_2_0; here is the
corresponding P3_2_0b:
SELECT Sum(CITIES.Population) AS CityPopSum, STATES.Country
FROM STATES INNER JOIN CITIES ON STATES.x = CITIES.State
GROUP BY STATES.Country
HAVING Sum(CITIES.Population) >= :Min_city_tot_pop_per_country;
Figure 3.22 shows Oracle’s actual parameter values input window, figures 3.23 – corresponding
result of P3_2_0b for 7,000,000, and 3.24 – the one for the corresponding P_3_2b (with P_3_2_0b
as a subquery):
23
Figure 3.23 Result of P3_2_0b for 7,000,000 people
24
c.
Even if not that obvious, the best thing to do is to add a corresponding filter to P_3_2_0b:
P_3_2_0c:
SELECT Sum(CITIES.Population) AS CityPopSum, STATES.Country
FROM COUNTRIES INNER JOIN (STATES INNER JOIN CITIES
ON STATES.x = CITIES.State)
ON STATES.Country = COUNTRIES.x
WHERE COUNTRIES.Country Like ‘R%’
GROUP BY STATES.Country
HAVING Sum(CITIES.Population) >=
:Min_city_tot_pop_per_country;
Figure 3.25 shows Oracle’ actual parameter values input window, figures 3.26 – corresponding
result of P3_2_0c for 2,500,000, and 3.27 – the one for the corresponding P_3_2c:
Unfortunately, Oracle does not accept parameterized views; consequently, the only way to store
and run parameterized queries are the PL/SQL stored procedures; as it is best to group all such
procedures addressing some same functional specifications in a PL/SQL package, let us create such
a package. Right-click the Packages node of LAB_DB and then click on New Package:
26
Figure 3.28 Creating a new PL/SQL package
In the Create PL/SQL Package window that pops up, enter desired package name:
In the header of the newly created package, replace the comment /* To do … */ with the following
two declarations (see figure 3.30):
TYPE GenericCursorType IS REF CURSOR;
procedure p3_2c (min_city_tot_pop_per_country number,
rc OUT GenericCursorType);
27
Figure 3.30 The header of the LAB_DB_SQL PL/SQL package
For creating the package body, right-click on the package’s name and then click on Create Body…:
In the newly created body, enter procedure’s P3_2c definition (see figure 3.32):
procedure p3_2c
(
Min_city_tot_pop_per_country in number,
rc out GenericCursorType
) is
begin
open rc for
SELECT COUNTRIES.Country, COUNTRIES.Population,
CityPopSum, Population - CityPopSum AS UnaccPop
FROM (SELECT Sum(CITIES.Population) AS CityPopSum,
STATES.Country
FROM COUNTRIES INNER JOIN (STATES INNER JOIN CITIES
28
ON STATES.x = CITIES.State)
ON STATES.Country = COUNTRIES.x
WHERE COUNTRIES.Country Like 'R%'
GROUP BY STATES.Country
HAVING Sum(CITIES.Population) >=
Min_city_tot_pop_per_country)
P3_2_0c INNER JOIN COUNTRIES
ON P3_2_0c.Country = COUNTRIES.x
ORDER BY Population - CityPopSum DESC, CityPopSum DESC,
COUNTRIES.Population DESC, COUNTRIES.Country;
end p3_2c;
In order to run this packaged procedure with desired parameters, enter in a LAB_DB SQL tab the
following statements:
var c refcursor;
exec lab_db_pl_sql.p3_2c(2500000, :c);
print c;
Running them (see figure 3.33), you get same results as in figure 3.26 above; the main advantage
with this approach is that you can obtain and then process this result from now on
programmatically too (e.g. in VBA, Java, .NET, etc.).
29
Figure 3.33 Running the LAB_DB_SQL.P3_2C PL/SQL packaged procedure
Please note again that, unfortunately, some programmers would rather come up with one of the
following equivalent, but not at all optimal solutions:
P_3_2_0cBis:
SELECT Sum(CITIES.Population) AS CityPopSum, STATES.Country
FROM COUNTRIES INNER JOIN (STATES INNER JOIN CITIES
ON STATES.x = CITIES.State)
ON STATES.Country = COUNTRIES.x
GROUP BY STATES.Country
HAVING COUNTRIES.Country Like ‘R%’ AND
Sum(CITIES.Population) >= : Min_city_tot_pop_per_country;
30
P_3_2cBis:
SELECT COUNTRIES.Country, COUNTRIES.Population, CityPopSum,
Population - CityPopSum AS UnaccPop
FROM (SELECT Sum(CITIES.Population) AS CityPopSum,
STATES.Country
FROM COUNTRIES INNER JOIN (STATES INNER JOIN CITIES
ON STATES.x = CITIES.State)
ON STATES.Country = COUNTRIES.x
GROUP BY STATES.Country
HAVING Sum(CITIES.Population) >=
:Min_city_tot_pop_per_country)
P3_2_0c INNER JOIN COUNTRIES
ON P3_2_0c.Country = COUNTRIES.x
WHERE COUNTRIES.Country Like ‘R%’
ORDER BY Population - CityPopSum DESC, CityPopSum DESC,
COUNTRIES.Population DESC, COUNTRIES.Country;
P3_2_0c is only computing, in this particular case, one group (for Romania) and, generally,
about one dozen group (for Romania, Russia, Rwanda, etc.), whereas
Both P_3_2_0cBis and P_3_2cBis are still computing all groups (four in this particular
case, but some 250 for full countries’ data) and then are throwing away the vast majority
of their computation results (three groups in this particular case, but some 238 for full
countries’ data).
Generally, note that we cannot get rid of HAVING clauses, as this is the only place where we can
add filters on data computed (generally through aggregation) after grouping; dually, we could
sometimes get rid of WHERE clauses (not always, as –see figure 3.4 above– sometimes we might
need filters on global applications of aggregate functions) and only use HAVING ones, but this
would be a stupid thing to do, both conceptually and, especially, performance wise.
BPR3.1 Both conceptually and from the RDBMSs performance point of view, it is preferable to
split complex problems into smaller and simpler sub-problems and to interconnect in the end
their solutions.
BPR3.2 Always use only necessary data (dually: never use unnecessary tables and/or
columns) in your queries.
31
BPR3.3 When possible, always join table instances on smallest numerical keys (generally,
primary surrogate ones), instead of any other existing equivalent keys.
BPR3.4 Always use WHERE for filtering as much as possible before grouping.
BPR3.5 Use HAVING only for filtering on data computed after grouping (dually: never use
HAVING for filters that can be placed on WHERE!).
BPR3.6 Never present users with unordered results, except for cases when they are explicitly
asking for it.
BPR3.7 Always order results intelligently, such as to maximize users experience with your
application.
BPR3.8 Never order data more than once, in the final querying step.
BPR3.9 Never order on more columns/expressions than needed: ordering costs a lot!
BPR3.10 Never order by using column positions! For example, always use SELECT x, y …
ORDER BY x, y; never use SELECT x, y … ORDER BY 1, 2; instead, as, one day,
when you will have to change it to SELECT y, x … ORDER BY x, y;, you have to also
change the ordering order (to SELECT y, x … ORDER BY x, y;) .
3.5 Homework
H3.0 Prove that:
H3.1 Prove that there is no SQL solution for P3.2 above without subqueries or queries hierarchies.
Hint: consider both the “GROUP BY golden rule” and the restriction that aggregate functions
cannot be composed between them.
H3.2 Compute the set of countries having at least k states, each of which has at least n cities (k and
n being natural parameters), for which the unaccounted states and cities population per countries
are at least equal to other two distinct parameters, respectively, in the descending order of the
unaccounted states population per country, city population per country, corresponding accounted
ones, stored countries population, and then ascending on country name.
p.s. It is highly possible that an exercise of this type, generally simpler from the arithmetic point
of view, be the main oral examination subject at the end of this semester!
H3.3 a. Add to the COUNTRIES table data for Hungary, Serbia, Bulgaria, Greece, Malta, and
Ukraine.
32
b. Add to your lab db a table for storing the NEIGHBORS binary relation defined over
COUNTRIES: NEIGHBORS = { (x,y) COUNTRIES2 |x is neighbor to y } and populate it with
actual data for all countries in COUNTRIES.
H3.4 Translate into relational algebra and optimize all the SELECT statements from these first
three DB labs.
33