Data Manipulation in SQL - Edited
Data Manipulation in SQL - Edited
• Using joins
Before taking this course, you should be comfortable working with introductory SQL topics,
such as selecting data from a database using arithmetic functions, GROUP BY statements, and
WHERE clauses to filter data. In short, the query on top should look pretty familiar to you. You
should also be familiar with joining data with a LEFT JOIN, RIGHT JOIN, INNER JOIN and
OUTER join. In this course, we will use and build upon these topics to interact with our
database. Alright, let's get started!
1.3.Selecting from the European Soccer Database
For this course, we will be using the European Soccer Database -- a relational database that
contains data about over 25,000 matches, 300 teams, and 10,000 players in Europe between 2008
and 2016. The data is contained within 4 tables -- country, league, team, and match. Selecting
from tables in this database is pretty simple. The query you see here gives you the number of
matches played in each of the 11 leagues listed in the "League" table.
Let's say we want to compare the number of home team wins, away team wins, and ties in the
2013/2014 season. The "Match" table has two relevant columns -- home_goal, and away_goal.
We can potentially add filters to the WHERE clause selecting wins, loses, and ties as separate
queries, but that's not very efficient if you want to compare these separate outcomes in a single
data set.
1.4.CASE statements
• Contains a WHEN, THEN, and ELSE statement, finished with END
This is where the CASE statement comes in. Case statements are SQL's version of an "IF this
THEN that" statement. Case statements have three parts -- a WHEN clause, a THEN clause, and
an ELSE clause. The first part -- the WHEN clause -- tests a given condition, say, x = 1. If this
condition is TRUE, it returns the item you specify after your THEN clause. You can create
multiple conditions by listing WHEN and THEN statements within the same CASE statement.
The CASE statement is then ended with an ELSE clause that returns a specified value if all of
your when statements are not true. When you have completed your statement, be sure to include
the term END and give it an alias. The completed CASE statement will evaluate to one column
in your SQL query.
1.4.1. CASE WHEN
In this example, we use a CASE statement to create a new variable that identifies matches as
home team wins, away team wins, or ties. A new column is created with the appropriate text for
each match given the outcome.
Exercise
Basic CASE statements
What is your favorite team?
The European Soccer Database contains data about 12,800 matches from 11 countries played
between 2011-2015! Throughout this course, you will be shown filtered versions of the tables in
this database in order to better explore their contents.
In this exercise, you will identify matches played between FC Schalke 04 and FC Bayern
Munich. There are 2 teams identified in each match in the hometeam_id and awayteam_id
columns, available to you in the filtered matches_germany table. ID can join to the team_api_id
column in
the teams_germany table, but you cannot perform a join on both at the same time.
However, you can perform this operation using a CASE statement once you've identified the
team_api_id associated with each team!
Instructions 1/2:
• Select the team's long name and API id from the teams_germany table.
• Filter the query for FC Schalke 04 and FC Bayern Munich using IN, giving you the
team_api_IDs needed for the next step.
SELECT
-- Select the team long name and team API id
team_long_name,
team_api_id
FROM teams_germany
-- Only include FC Schalke 04 and FC Bayern Munich
WHERE team_long_name IN ('FC Schalke 04', 'FC Bayern Munich
');
Instructions 2/2:
• Create a CASE statement that identifies whether a match in Germany included FC
Bayern Munich, FC Schalke 04, or neither as the home team.
• Group the query by the CASE statement alias, home_team.
-
- Identify the home team as Bayern Munich, Schalke 04, or n
either
SELECT
CASE WHEN hometeam_id = 10189 then 'FC Schalke 04'
WHEN hometeam_id = 9823 then 'FC Bayern Munich'
ELSE 'Other' END AS home_team,
COUNT(id) AS total_matches
FROM matches_germany
-- Group by the CASE statement alias
GROUP BY home_team;
Exercise
CASE statements comparing column values
Barcelona is considered one of the strongest teams in Spain's soccer league.
In this exercise, you will be creating a list of matches in the 2011/2012 season where Barcelona
was the home team. You will do this using a CASE statement that compares the values of two
columns to create a new group -- wins, losses, and ties.
In 3 steps, you will build a query that identifies a match's winner, identifies the identity of the
opponent, and finally filters for Barcelona as the home team. Completing a query in this order
will allow you to watch your results take shape with each new piece of information.
The matches_spain table currently contains Barcelona's matches from the 2011/2012 season, and
has two key columns, hometeam_id and awayteam_id, that can be joined with the teams_spain
table. However, you can only join teams_spain to one column at a time.
Instructions 1/3:
• Select the date of the match and create a CASE statement to identify matches as home
wins, home losses, or ties.
SELECT
-- Select the date of the match
date,
-- Identify home wins, losses, or ties
CASE WHEN home_goal > away_goal THEN 'Home win!'
WHEN home_goal < away_goal THEN 'Home loss :('
ELSE 'Tie' END AS outcome
FROM matches_spain;
Instructions 2/3:
• Left join the teams_spain table team_api_id column to the matches_spain table
awayteam_id. This allows us to retrieve the away team's identity.
• Select team_long_name from teams_spain as opponent and complete the CASE statement
from Step 1.
• SELECT
• m.date,
• --
Select the team long name column and call it 'opponent'
• t.team_long_name AS opponent,
• -- Complete the CASE statement with an alias
• CASE WHEN m.home_goal > m.away_goal THEN 'Home win!'
• WHEN m.home_goal < m.away_goal THEN 'Home loss :('
• ELSE 'Tie' END AS outcome
• FROM matches_spain AS m
• -- Left join teams_spain onto matches_spain
• LEFT JOIN teams_spain AS t
• ON m.awayteam_id = t.team_api_id;
Instructions 3/3:
• Complete the same CASE statement as the previous steps.
• Filter for matches where the home team is FC Barcelona (id = 8634).
• SELECT
• m.date,
• t.team_long_name AS opponent,
• -- Complete the CASE statement with an alias
• CASE WHEN m.home_goal > m.away_goal THEN 'Barcelona win
!'
• WHEN m.home_goal < m.away_goal THEN 'Barcelona loss
:('
ELSE 'Tie' END AS outcome
FROM matches_spain AS m
LEFT JOIN teams_spain AS t
ON m.awayteam_id = t.team_api_id
-- Filter for Barcelona as the home team
WHERE m.hometeam_id = 8634;
Exercise
CASE statements comparing two column values part 2
Similar to the previous exercise, you will construct a query to determine the outcome of
Barcelona's matches where they played as the away team. You will learn how to combine these
two queries in chapters 2 and 3.
Did their performance differ from the matches where they were the home team?
Instructions
• Complete the CASE statement to identify Barcelona's away team games (id = 8634) as
wins, losses, or ties.
• Left join the teams_spain table team_api_id column on the matches_spain table
hometeam_id column. This retrieves the identity of the home team opponent.
• Filter the query to only include matches where Barcelona was the away team.
• -- Select matches where Barcelona was the away team
• SELECT
• m.date,
• t.team_long_name AS opponent,
• CASE WHEN m.home_goal < m.away_goal THEN 'Barcelona win
!'
• WHEN m.home_goal > m.away_goal THEN 'Barcelona loss
:('
ELSE 'Tie' END AS outcome
FROM matches_spain AS m
-- Join teams_spain to matches_spain
LEFT JOIN teams_spain AS t
ON m.hometeam_id = t.team_api_id
WHERE m.awayteam_id = 8634;
2. In CASE things get more complex
2.1.Reviewing CASE WHEN
Previously, we covered CASE statements with one logical test in a WHEN statement, returning
outcomes based on whether that test is TRUE or FALSE. This example tests whether home or
away goals were higher, and identifies them as wins for the team that had a higher score.
Everything ELSE is categorized as a tie. The resulting table has one column identifying matches
as one of 3 possible outcomes.
2.2.CASE WHEN … AND then some
• Add multiple logical conditions to your WHEN clause!
If you want to test multiple logical conditions in a CASE statement, you can use AND inside
your WHEN clause. For example, let's see if each match was played, and won, by the team
Chelsea. Let's see the CASE statement in this query. Each WHEN clause contains two logical
tests -- the first tests if a hometead_id identifies Chelsea, AND then it tests if the home team
scored higher than the away team. If both conditions are TRUE, the new column output returns
the phrase "Chelsea home win!". The opposite set of conditions are included in a second when
statement -- if the awayteam_id belongs to Chelsea, AND scored higher, then the output returns
"Chelsea away win!". All other matches are categorized as a loss or tie. Here's the resulting table.
2.3.What ELSE is being excluded?
• What’s in your ELSE clause?
When testing logical conditions, it's important to carefully consider which rows of your data are
part of your ELSE clause, and if they're categorized correctly. Here's the same CASE statement
from the previous slide, but the WHERE filter has been removed. Without this filter, your ELSE
clause will categorize ALL matches played by anyone, who don't meet these first two conditions,
as "Loss or tie :(". Here are the results of this query. A quick look at it shows that the first few
matches are all categorized as "Loss or tie", but neither the hometeam_id or awayteam_id belong
to Chelsea.
2.4.Correctly categorize your data with CASE
The easiest way to correct for this is to ensure you add specific filters in the WHERE clause that
exclude all teams where Chelsea did not play. Here, we specify this by using an OR statement in
WHERE, which retrieves only results where the id 8455 is present in the hometeam_id or
awayteam_id columns. The resulting table from earlier, with the team IDs in bold here, clearly
specifies whether Chelsea was home or away team.
2.5.What’s NULL
It's also important to consider what your ELSE clause is doing. These two queries here are
identical, except for the ELSE NULL statement specified in the second. They both return
identical results -- a table with quite a few null results. But what if you want to exclude them?
2.6.What are your NULL values doing?
Let's say we're only interested in viewing the results of games where Chelsea won, and we don't
care if they lose or tie. Just like in the previous example, simply removing the ELSE clause will
still retrieve those results -- and a lot of NULL values.
2.7.Where to place your CASE?
To correct for this, you can treat the entire CASE statement as a column to filter by in your
WHERE clause, just like any other column. In order to filter a query by a CASE statement,
you include the entire CASE statement, except its alias, in WHERE. You then specify what you
want to include, or exclude. For this query, I want to keep all rows where this CASE statement
IS NOT NULL. My resulting table now only includes Chelsea's home and away wins -- and I
don't need to filter by their team ID anymore!
Exercise
In CASE of rivalry
Barcelona and Real Madrid have been rival teams for more than 80 years. Matches between
these two teams are given the name El Clásico (The Classic). In this exercise, you will query a
list of matches played between these two rivals.
You will notice in Step 2 that when you have multiple logical conditions in a CASE statement,
you may quickly end up with a large number of WHEN clauses to logically test every outcome
you are interested in. It's important to make sure you don't accidentally exclude key information
in your ELSE clause.
In this exercise, you will retrieve information about matches played between Barcelona (id =
8634) and Real Madrid (id = 8633). Note that the query you are provided with already identifies
the Clásico matches using a filter in the WHERE clause.
Instructions 1/2:
• Complete the first CASE statement, identifying Barcelona or Real Madrid as the home
team using the hometeam_id column.
• Complete the second CASE statement in the same way, using awayteam_id.
SELECT
date,
-
- Identify the home team as Barcelona or Real Madrid
• CASE WHEN hometeam_id = 8634 THEN 'FC Barcelona'
• ELSE 'Real Madrid CF' END AS home,
• -- Identify the away team as Barcelona or Real Madrid
• CASE WHEN awayteam_id = 8634 THEN 'FC Barcelona'
• ELSE 'Real Madrid CF' END AS away
• FROM matches_spain
• WHERE (awayteam_id = 8634 OR hometeam_id = 8634)
• AND (awayteam_id = 8633 OR hometeam_id = 8633);
Instructions 2/2:
• Construct the final CASE statement identifying who won each match. Note there are 3
possible outcomes, but 5 conditions that you need to identify.
• Fill in the logical operators to identify Barcelona or Real Madrid as the winner.
Exercise
Filtering your CASE statement
Let's generate a list of matches won by Italy's Bologna team! There are quite a few additional
teams in the two tables, so a key part of generating a usable query will be using your CASE
statement as a filter in the WHERE clause.
CASE statements allow you to categorize data that you're interested in -- and exclude data you're
not interested in. In order to do this, you can use a CASE statement as a filter in the WHERE
statement to remove output you don't want to see.
SELECT *
FROM table
WHERE
CASE WHEN a > 5 THEN 'Keep'
WHEN a <= 5 THEN 'Exclude' END = 'Keep';
In essence, you can use the CASE statement as a filtering column like any other column in your
database. The only difference is that you don't alias the statement in WHERE.
Instructions 1/3:
• Identify Bologna's team ID listed in the teams_italy table by selecting the
team_long_name and
-- Select team_long_name and team_api_id from team
SELECT
team_long_name,
team_api_id
FROM teams_italy
-- Filter for team long name
WHERE team_long_name = 'Bologna';
Instructions 2/3:
• Select the season and date that a match was played.
• Complete the CASE statement so that only Bologna's home and away wins are identified.
• -- Select the season and date columns
• SELECT
• season,
• date,
• -- Identify when Bologna won a match
• CASE WHEN hometeam_id = 9857
• AND home_goal > away_goal
• THEN 'Bologna Win'
• WHEN awayteam_id = 9857
• AND away_goal > home_goal
• THEN 'Bologna Win'
• END AS outcome
• FROM matches_italy;
Instructions 3/3:
• Select the home_goal and away_goal for each match.
• Use the CASE statement in the WHERE clause to filter all NULL values generated by
the statement in the previous step.
• -
- Select the season, date, home_goal, and away_goal columns
• SELECT
• season,
• date,
• home_goal,
• away_goal
• FROM matches_italy
• WHERE
• -- Exclude games not won by Bologna
• CASE WHEN hometeam_id = 9857 AND home_goal > away_goal
THEN 'Bologna Win'
• WHEN awayteam_id = 9857 AND away_goal > home_goal
THEN 'Bologna Win'
• END IS NOT NULL;
3. CASE WHEN with aggregate functions
3.1.In CASE you need to aggregate
• CASE statements are great for
o Categorizing data
o Filtering data
o Aggregating data
CASE statements can be used to create columns for categorizing data, and to filter your data in
the WHERE clause. You can also use CASE statements to aggregate data based on the result of a
logical test.
3.2.COUNTing CASES
• How many home and away goals did Liverpool score in each season ?
Let's say you wanted to prepare a summary table counting the number of home and away games
that Liverpool won in each season. If you've created summary tables in Spreadsheets, you can
probably visualize the final table, here -- but how do you get a count of Liverpool's wins in each
season?
3.3.CASE WHEN with COUNT
You guessed it -- a CASE statement. CASE statements are like any other column in your query,
so you can include them inside an aggregate function. Take a look at the CASE statement. The
WHEN clause includes a similar logical test to the previous lesson -- did Liverpool play as the
home team, AND did the home team score higher than the away team? The difference begins in
your THEN clause. Instead of returning a string of text, you return the column identifying the
unique match id. When this CASE statement is inside the COUNT function, it COUNTS every
id returned by this CASE statement.
You then add a second CASE statement for the away team, and group the query by the season.
When counting information in a CASE statement, you can return anything you'd like –
a number, a string of text, or any column in the table, SQL is COUNTing the number of rows
returned by the CASE statement.
3.4.CASE WHEN with SUM
Similarly, you can use the SUM function to calculate a total of any value. Let's say we're
interested in the number of home and away goals that Liverpool scored in each season. This is
fairly simple to set up -- if the hometeam_id is Liverpool's, return the home_goal value. The
ELSE condition is assumed to be NULL, so the query returns the total home_goals scored by
Liverpool in each season.
3.5.A ROUNDed AVG
You can make the results easier to read using ROUND. ROUND takes 2 arguments -- a
numerical value, and the number of decimal points to round the value to.
Place it outside your aggregate CASE statement, and include the number of decimal points at the
end. There, that's much easier to read!
3.6.Percentages with CASE and AVG
The second key application of CASE with AVG is in the calculation of percentages. This
requires a specific structure in order for your calculation to be accurate. The question we're
answering here is, "What percentage of Liverpool's games did they win in each season?" The
first component of this CASE statement is a WHEN clause identifying what you're calculating a
percentage of -- in this case, how many games did they win? This is tested in the same way as
previous slides, and your THEN clause returns a 1. The second component identifies Liverpool's
games that they LOST, and returns the value 0. All other matches -- ties, games not involving
Liverpool -- are excluded as NULLs. Here are the results of this query ...
... and here's the ROUNDed, more readable version of the results.
Exercise
COUNT using CASE WHEN
Do the number of soccer matches played in a given European country differ across seasons? We
will use the European Soccer Database to answer this question.
You will examine the number of matches played in 3 seasons within each country listed in the
database. This is much easier to explore with each season's matches in separate columns. Using
the country and unfiltered match table, you will count the number of matches played in each
country during the 2012/2013, 2013/2014, and 2014/2015 match seasons.
Instructions 1/2:
• Create a CASE statement that identifies the id of matches played in the 2012/2013
season. Specify that you want ELSE values to be NULL.
• Wrap the CASE statement in a COUNT function and group the query by the country
alias.
• SELECT
• c.name AS country,
• -- Count games from the 2012/2013 season
• COUNT(CASE WHEN m.season = '2012/2013'
• THEN m.id ELSE NULL END) AS matches_2012_2013
• FROM country AS c
• LEFT JOIN match AS m
• ON c.id = m.country_id
• -- Group by country name alias
• GROUP BY country;
Instructions 2/2:
• Create 3 CASE WHEN statements counting the matches played in each country across
the 3 seasons.
• END your CASE statement without an ELSE clause.
SELECT
c.name AS country,
-- Count matches in each of the 3 seasons
COUNT(CASE WHEN m.season = '2012/2013' THEN m.id END) AS mat
ches_2012_2013,
COUNT(CASE WHEN m.season = '2013/2014' THEN m.id END) AS mat
ches_2013_2014,
COUNT(CASE WHEN m.season = '2014/2015' THEN m.id END) AS mat
ches_2014_2015
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;
Exercise
COUNT and CASE WHEN with multiple conditions
In R or Python, you have the ability to calculate a SUM of logical values (i.e., TRUE/FALSE)
directly. In SQL, you have to convert these values into 1 and 0 before calculating a sum. This
can be done using a CASE statement.
There's one key difference when using SUM to aggregate logical values compared to using
COUNT in the previous exercise --
Your goal here is to use the country and match table to determine the total number of matches
won by the home team in each country during the 2012/2013, 2013/2014, and 2014/2015
seasons.
Instructions:
• Create 3 CASE statements to "count" matches in the '2012/2013', '2013/2014', and
'2014/2015' seasons, respectively.
• Have each CASE statement return a 1 for every match you want to include, and a 0 for
every match to exclude.
• Wrap the CASE statement in a SUM to return the total matches played in each season.
• Group the query by the country name alias.
• SELECT
• c.name AS country,
• -
- Sum the total records in each season where the home team
won
• SUM(CASE WHEN m.season = '2012/2013' AND m.home_goa
l > m.away_goal
THEN 1 ELSE 0 END) AS matches_2012_2013,
SUM(CASE WHEN m.season = '2013/2014' AND m.home_goa
l > m.away_goal
THEN 1 ELSE 0 END) AS matches_2013_2014,
SUM(CASE WHEN m.season = '2014/2015' AND m.home_goa
l > m.away_goal
THEN 1 ELSE 0 END) AS matches_2014_2015
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;
Exercise
Calculating percent with CASE and AVG
CASE statements will return any value you specify in your THEN clause. This is an incredibly
powerful tool for robust calculations and data manipulation when used in conjunction with an
aggregate statement. One key task you can perform is using CASE inside an AVG function to
calculate a percentage of information in your database.
With this approach, it's important to accurately specify which records count as 0, otherwise your
calculations may not be correct!
Your task is to examine the number of wins, losses, and ties in each country. The matches table
is filtered to include all matches from the 2013/2014 and 2014/2015 seasons.
Instructions 1/3:
• Create 3 CASE statements to COUNT the total number of home team wins, away team
wins, and ties, which will allow you to examine the total number of records.
• SELECT
• c.name AS country,
• -
- Count the home wins, away wins, and ties in each country
• COUNT(CASE WHEN m.home_goal > m.away_goal THEN m.id
• END) AS home_wins,
• COUNT(CASE WHEN m.home_goal < m.away_goal THEN m.id
• END) AS away_wins,
• COUNT(CASE WHEN m.home_goal = m.away_goal THEN m.id
END) AS ties
FROM country AS c
LEFT JOIN matches AS m
ON c.id = m.country_id
GROUP BY country;
Instructions 2/3:
• Calculate the percentage of matches tied using a CASE statement inside AVG.
• Fill in the logical operators for each statement. Alias your columns as ties_2013_2014
and ties_2014_2015, respectively.
• SELECT
• c.name AS country,
• -
- Calculate the percentage of tied games in each season
• AVG(CASE WHEN m.season='2013/2014' AND m.home_goal = m.
away_goal THEN 1
• WHEN m.season='2013/2014' AND m.home_goal != m.
away_goal THEN 0
• END) AS ties_2013_2014,
• AVG(CASE WHEN m.season='2014/2015' AND m.home_goal = m.
away_goal THEN 1
• WHEN m.season='2014/2015' AND m.home_goal != m.
away_goal THEN 0
• END) AS ties_2014_2015
• FROM country AS c
• LEFT JOIN matches AS m
• ON c.id = m.country_id
• GROUP BY country;
Instructions 3/3:
• The previous "ties" columns returned values with 14 decimal points, which is not easy to
interpret. Use the ROUND function to round to 2 decimal points.
• SELECT
• c.name AS country,
• -
- Round the percentage of tied games to 2 decimal points
• ROUND(AVG(CASE WHEN m.season='2013/2014' AND m.home_goa
l = m.away_goal THEN 1
• WHEN m.season='2013/2014' AND m.home_goal != m
.away_goal THEN 0
• END),2) AS pct_ties_2013_2014,
• ROUND(AVG(CASE WHEN m.season='2014/2015' AND m.home_goa
l = m.away_goal THEN 1
• WHEN m.season='2014/2015' AND m.home_goal != m
.away_goal THEN 0
• END),2) AS pct_ties_2014_2015
• FROM country AS c
• LEFT JOIN matches AS m
• ON c.id = m.country_id
• GROUP BY country;
4. WHERE are the Subqueries?
4.1.What is a subquery?
4.3.Why subqueries?
• Comparing groups to summarized values
o How did Liverpool compare to the English Premier League’s average
performance for that year?
• Reshaping data
o What is the highest monthly average of goals scored in the Bundesliga?
• Combining data that cannot be joined
o How do you get both the home and the away team names into a table of match
results?
So why might you need to use a subquery? Subqueries allow you to compare summarized values
to detailed data. For example, compare Liverpool's performance to the entire English Premier
League. Subqueries also allow you to better structure or reshape your data for multiple purposes,
such as determining the highest monthly average of goals scored in the Bundesliga. Finally,
subqueries allow you to combine data from tables where you are unable to perform a join, such
as getting both the home and away team names into your results table. We'll discuss all of these
questions in the coming lessons.
4.4.Simple subqueries
• Can be evaluated independently from the outer query
Let's start with the definition of a simple subquery. A simple subquery is a query, nested inside
another query, that can be run on its own. The example you see here has a subquery in the
WHERE clause -- if you copy the entire inner query, "SELECT the average home goal FROM
the match table", you can run it on its own and get a result.
• Is only processed once in the entire statement
A simple subquery is also evaluated once in the entire query. This means that SQL first
processes the information inside the subquery, gets the information it needs, and then moves on
to processing information in the OUTER query. Here is the same query you see above. The
subquery in WHERE is processed first, generating the overall average of home goals scored.
SQL then moves onto the main query, treating the subquery like the single, aggregate value it
just generated.
4.5.Subqueries in the WHERE clause
• Which matches in the 2012/2013 season scored home goals higher than overall average?
The first type of simple subquery we'll explore is the subquery in the WHERE clause. These are
useful for filtering results based on information you'd have to calculate separately beforehand.
Let's generate a list of matches in the 2012/2013 season where the number of home goals scored
was higher than overall average. You could calculate the average, and then include that number
in the main query...
...or you could put the query directly into the WHERE clause, inside parentheses. This way, you
have one less manual step to perform before getting the results you need.
4.6.Subquery filtering lists with IN
• Which teams are part of Poland’s league?
Subqueries are also useful for generating a filtering list. This query answers the question, "Which
teams are part of Poland's league?" The "team" table doesn't have the country IDs, but the
"match" table has both country and team IDs. By querying a list of hometeam_id's from match
where the country_id is 15722, which indicates "Poland", you can generate a list to compare to
the team_api_id column IN the WHERE clause.
Exercise
Filtering using scalar subqueries
Subqueries are incredibly powerful for performing complex filters and transformations. You can
filter data based on single, scalar values using a subquery in ways you cannot by using WHERE
statements or joins. Subqueries can also be used for more advanced manipulation of your data
set. You will likely encounter subqueries in any real-world setting that uses relational databases.
In this exercise, you will generate a list of matches where the total goals scored (for both teams
in total) is more than 3 times the average for games in the matches_2013_2014 table, which
includes all games played in the 2013/2014 season.
Instructions 1/2:
• Calculate triple the average home + away goals scored across all matches. This will
become your subquery in the next step. Note that this column does not have an alias, so it
will be called ?column? in your results.
In addition to filtering using a single-value (scalar) subquery, you can create a list of values in a
subquery to filter data based on a complex set of conditions. This type of subquery generates a
one column reference list for the main query. As long as the values in your list match a column
in your main query's table, you don't need to use a join -- even if the list is from a separate table.
Instructions:
• Create a subquery in the WHERE clause that retrieves all unique hometeam_ID values
from the match table.
• Select the team_long_name and team_short_name from the team table. Exclude all values
from the subquery in the main query.
• SELECT
• -- Select the team long and short names
• team_long_name,
• team_short_name
• FROM team
• -- Exclude all values from the subquery
• WHERE team_api_id NOT IN
• (SELECT DISTINCT (hometeam_id) FROM match);
Exercise
Filtering with more complex subquery conditions
In the previous exercise, you generated a list of teams that have no home matches listed in the
soccer database using a subquery in WHERE. Let's do some further exploration in this database
by creating a list of teams that scored 8 or more goals in a home match.
In order to do this, you will construct a subquery in the WHERE statement with its own filtering
condition.
Instructions
• Create a subquery in WHERE clause that retrieves all hometeam_ID values from match
with a home_goal score greater than or equal to 8.
• Select the team_long_name and team_short_name from the team table. Include all values
from the subquery in the main query.
SELECT
-- Select the team long and short names
team_long_name,
team_short_name
FROM team
-- Filter for teams with 8 or more home goals
WHERE team_api_id IN
(SELECT hometeam_ID
FROM match
WHERE home_goal >= 8);
5. Subqueries in FROM
• Restructure and transform your data
o Transforming data from long to wide before selecting
o Prefiltering data
• Calculating aggregates of aggregates
o Which 3 teams has the highest average of home goals scored ?
▪ Calculate the AVG for each team
▪ Get the 3 highest of the AVG values
You probably noticed that subqueries in WHERE can only return a single column. But what if
you want to return a more complex set of results? Subqueries in the FROM statement are a
robust tool for restructuring and transforming your data. Often, the data you need to answer a
question is not yet in the format necessary to query it directly, and requires some additional
processing to prepare for analysis. For example, you may want to transform your data into a
different shape, or pre-filter it before making calculations. Subqueries in a FROM statement are a
common way of preparing that data. Subqueries in FROM are also useful when calculating
aggregates of aggregate information. Let's say you're interested in getting the top 3 teams who
scored the highest number of home_goals on average in the 2011/2012 season. You would first
calculate the average for each team in the league, and THEN calculate the max value for any
team overall. This can be easily accomplished with a subquery in FROM.
5.1.FROM subqueries
Let's examine the home_goal average for every team in the database. First, you will create the
query that will become your subquery. This query here selects the team's long name from the
"team" table, and the AVG of home_goal column from the "match" table. The team table is left
joined onto the "match" table using hometeam_id, which will give you the identity of the home
team. The query is then filtered by season and grouped by team. The results look like this -- an
average value calculated for each team in the table.
5.1.1. … to main queries!
In order to get only the top team as a final result, place this ENTIRE query without the
semicolon inside the FROM statement of an outer query,
If you're interested in filtering data from one of these tables, you can also create a subquery from
one of the tables, and then join it to an existing table in the database. A subquery in FROM is an
effective way of answering detailed questions that requires filtering or transforming data before
including it in your final results.
Your goal in this exercise is to generate a subquery using the match table, and then join that
subquery to the country table to calculate information about matches with 10 or more goals in
total!
Instructions 1/2:
• Create the subquery to be used in the next step, which selects the country ID and match
ID (id) from the match table.
• Filter the query for matches with greater than or equal to 10 goals.
SELECT
-- Select the country ID and match ID
country_id,
id
FROM match
-- Filter for matches with 10 or more goals in total
WHERE (home_goal + away_goal) >= 10;
Instructions 2/2:
• Construct a subquery that selects only matches with 10 or more total goals.
• Inner join the subquery onto country in the main query.
• Select name from country and count the id column from match.
• SELECT
• -- Select country name and the count match IDs
• c.name AS country_name,
• COUNT(c.id) AS matches
• FROM country AS c
-- Inner join the subquery onto country
-- Select the country id and match id columns
INNER JOIN (SELECT country_id, id
FROM match
-- Filter the subquery by matches with 10+ goals
WHERE (home_goal + away_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country_name;
Exercise
Building on Subqueries in FROM
In the previous exercise, you found that England, Netherlands, Germany and Spain were the only
countries that had matches in the database where 10 or more goals were scored overall. Let's find
out some more details about those matches -- when they were played, during which seasons, and
how many of the goals were home versus away goals.
You'll notice that in this exercise, the table alias is excluded for every column selected in the
main query. This is because the main query is extracting data from the subquery, which is treated
as
a single table.
Instructions:
• Complete the subquery inside the FROM clause. Select the country name from the
country table, along with the date, the home goal, the away goal, and the total goals
columns from the match table.
• Create a column in the subquery that adds home and away goals, called total_goals. This
will be used to filter the main query.
• Select the country, date, home goals, and away goals in the main query.
• Filter the main query for games with 10 or more total goals.
• SELECT
• -
- Select country, date, home, and away goals from the subqu
ery
• country,
• date,
• home_goal,
• away_goal
• FROM
• -
- Select country name, date, home_goal, away_goal, and tota l
goals in the subquery
• (SELECT c.name AS country,
• m.date,
• m.home_goal,
• m.away_goal,
• (m.home_goal + m.away_goal) AS total_goals
• FROM match AS m
• LEFT JOIN country AS c
• ON m.country_id = c.id) AS subq
• -- Filter by total goals scored in the main query
• WHERE total_goals >= 10;
Subqueries in SELECT
SELECTing what?
Returns a single value
Include aggregate values to compare to individual values
Used in mathematical calculations
Deviation from the average
Subqueries in SELECT are used to return a single, aggregate value. This can be fairly useful,
since, as you'll recall, you cannot include an aggregate value in an ungrouped SQL query. 6. 6.
6. Subqueries in SELECT
6.1.SELECTing what?
• Returns a single value
o Include aggregate values to compare to individual values
• Used in mathematical calculations
o Deviation from the average
Subqueries in SELECT are used to return a single, aggregate value. This can be fairly useful,
since, as you'll recall, you cannot include an aggregate value in an ungrouped SQL query.
Subqueries in SELECT are one way to get around that. Subqieries in SELECT are also useful
when performing complex mathematical calculations on information in your data set. For
example, you may want to see how much an individual score deviates from an average -- say,
how higher than the average is this individual score?
6.2.Sunbqueiries in SELECT
Including a subquery in SELECT is fairly simple, and is set up the same way you set up
subqueries in the WHERE and FROM clauses. Let's say we want to create a column to compare
the total number of matches played in each season to the total number of matches played
OVERALL. We can first calculate the overall count of matches across all seasons, which is
12,837.
We can then add that single number to the SELECT statement, which yields the following
results...
...or, we can skip that step, and add the subquery directly to the SELECT statement to get
identical results.
6.3.SELECT subqueries for mathematical calculations
Subqueries in SELECT are also incredibly useful for calculations with the data you are querying.
The single value returned by a subquery in select can be used to calculate information based on
existing information in a database. For example, the overall average number of goals scored in a
match across all seasons is 2.72. If you want to calculate the difference from the average in any
given match, you can either calculate this number ahead of time in a separate query, and input
the value into the SELECT statement...
...or you can use a subquery that calculates this value for you in your SELECT statement, and
subtract it from the total goals in that match. Overall, this second option can save you a lot of
time and errors in your work, and the results you see here, are identical to calculating the result
manually.
6.4.SELECT subqueries – things to keep in mind
• Need to return a SINGLE value
o Will generate an error otherwise
• Make sure you have all filters in the right place
o Probably filter both the main and the subquery ~
There are a few unique considerations when working with subqueries in SELECT. The first is
that the subquery needs to return a single value. If your subquery result returns multiple rows,
your entire query will generate an error. This is because the information retrieved in a SELECT
query is applied identically to each row in the data set -- and that's not possible if there's more
than one unit of information. The second thing to keep an eye out is the correct placement of
your data's filters in both the main query and the subquery. Here is the query from the previous
slide. Since the subquery is processed before the main query, you'll need to include relevant
filters in the subquery as well as the main query. Without the WHERE clause you see here in the
subquery, the number returned would have been the overall average across all seasons rather
than in the 2011/2012
season.
Exercise
Add a subquery to the SELECT clause
Subqueries in SELECT statements generate a single value that allow you to pass an aggregate
value down a data frame. This is useful for performing calculations on data within your database.
In the following exercise, you will construct a query that calculates the average number of goals
per match in each country's league.
Instructions:
• In the subquery, select the average total goals by adding home_goal and away_goal.
• Filter the results so that only the average of goals in the 2013/2014 season is calculated.
• In the main query, select the average total goals by adding home_goal and away_goal.
This calculates the average goals for each league.
• Filter the results in the main query the same way you filtered the subquery. Group the
query by the league name.
• SELECT
• l.name AS league,
• -- Select and round the league's total goals
• ROUND(AVG(m.home_goal + m.away_goal), 2) AS avg_goals,
• -
- Select & round the average total goals for the season
• (SELECT ROUND(AVG(home_goal + away_goal), 2)
• FROM match
• WHERE season = '2013/2014') AS overall_avg
• FROM league AS l
• LEFT JOIN match AS m
• ON l.country_id = m.country_id
• -- Filter for the 2013/2014 season
• WHERE season = '2013/2014'
• GROUP BY league;
Exercise
Subqueries in Select for Calculations
Subqueries in SELECT are a useful way to create calculated columns in a query. A subquery
in SELECT can be treated as a single numeric value to use in your calculations. When writing
queries in SELECT, it's important to remember that filtering the main query does not filter the
subquery -- and vice versa.
In the previous exercise, you created a column to compare each league's average total goals to
the overall average goals in the 2013/2014 season. In this exercise, you will add a column that
directly compares these values by subtracting the overall average from the subquery.
Exercise
Subqueries in Select for Calculations
Subqueries in SELECT are a useful way to create calculated columns in a query. A subquery in
SELECT can be treated as a single numeric value to use in your calculations. When writing
queries in SELECT, it's important to remember that filtering the main query does not filter the
subquery -- and vice versa. In the previous exercise, you created a column to compare each
league's average total goals to the overall average goals in the 2013/2014 season. In this exercise,
you will add a column that directly compares these values by subtracting the overall average
from the subquery.
Instructions
• Select the average goals scored in a match for each league in the main query.
• Select the average goals scored in a match overall for the 2013/2014 season in the
subquery.
• Subtract the subquery from the average number of goals calculated for each league.
• Filter the main query so that only games from the 2013/2014 season are included.
SELECT
-- Select the league name and average goals scored
l.name AS league,
ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
-- Subtract the overall average from the league average
ROUND(AVG(m.home_goal + m.away_goal) -
(SELECT AVG(home_goal + away_goal)
FROM match
WHERE season = '2013/2014'),2) AS diff
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
-- Only include 2013/2014 results
WHERE season = '2013/2014'
GROUP BY l.name;
7. Subqueries everywhere! And best practices!
7.1.As many subqueries as you want …
The best practice you can start early on in your SQL journey is properly formatting your queries.
It's important to properly line up your SELECT, FROM, GROUP BY, and WHERE statements,
and all of the information contained in them. This way, you and others you work with can return
to a saved query and easily tell if these statements are part of a main query, or a subquery.
7.3.Annotate your queries
It's also considered best practice to annotate your queries with comments in order to tell the user
what it does -- using either a multiple line comment, inside a forward slash, star, and ending with
a star, and a forward slash.
You can also use in-line comments using two dashes. Every piece of information after an in-line
comment is treated as text, even if it's a recognized SQL command.
7.4.Indent your queries
• Indent your subqueries!
Additionally, make sure that you properly indent all information contained within a subquery.
That way, you can easily return to the query and understand what information is being processed
first, where you need to apply changes, such as to a range of dates, and what you can expect from
your results if you make those changes.
Make sure that you clearly indent all information that's part of a single column, such as a long
CASE statement, or a complicated subquery in SELECT. In order to best keep track of all the
conditions necessary to set up each WHEN clause, each THEN clause, and how they create the
column outcome, it's important to clearly indent each piece of information in the statement.
Overall, I highly recommend you read Holywell's SQL Style Guide to get a sense of all the
formatting conventions when working with SQL queries.
7.5.Is that subquery necessary?
• Subqueries require computing power
o How big is your database?
o How big is the table you’re querying from?
• Is the subquery actually necessary?
When deciding whether or not you need a subquery, it's important to know that each subquery
you add requires additional computing power to generate your results. Depending on the size of
your database and the number of records you extract in your query, you may significantly
increase the amount of time it takes to run your query. So it's always worth asking whether or not
a specific subquery is necessary to get the results you need.
7.6.Properly filter each subquery!
• Watch your filters!
Finally, when constructing a main query with multiple subquery, make sure that your filters are
properly placed in every subquery, and the main query, in order to generate accurate results. The
query here, for example, filters for the 2013/2014 season in 3 places -- once in the SELECT
subquery, once in the WHERE subquery, and once in the main query. This ensures that all data
returned is only about matches from the 2013/2014 season.
Exercise
ALL the subqueries EVERYWHERE
In soccer leagues, games are played at different stages. Winning teams progress from one stage to the
next, until they reach the final stage. In each stage, the stakes become higher than the previous one. The
match table includes data about the different stages that each match took place in.
In this lesson, you will build a final query across 3 exercises that will contain three subqueries -- one in
the SELECT clause, one in the FROM clause, and one in the WHERE clause. In the final exercise, your
query will extract data examining the average goals scored in each stage of a match. Does the average
number of goals scored change as the stakes get higher from one stage to the next?
Instructions
• Extract the average number of home and away team goals in two SELECT subqueries.
• Calculate the average home and away goals for the specific stage in the main query.
• Filter both subqueries and the main query so that only data from the 2012/2013 season is
included.
• Group the query by the m.stage column.
SELECT
-- Select the stage and average goals for each stage
m.stage,
ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
-
- Select the average overall goals for the 2012/2013 season
ROUND((SELECT AVG(home_goal + away_goal)
FROM match
WHERE season = '2012/2013'),2) AS overall
FROM match AS m
-- Filter for the 2012/2013 season
WHERE m.season = '2012/2013'
-- Group by stage
GROUP BY m.stage;
Exercise
Add a subquery in FROM
In the previous exercise, you created a data set listing the average home and away goals in each
match stage of the 2012/2013 match season.
In this next step, you will turn the main query into a subquery to extract a list of stages where the
average home goals in a stage is higher than the overall average for home goals in a match.
Instructions:
• Calculate the average home goals and average away goals from the match table for each
stage in the FROM clause subquery.
• Add a subquery to the WHERE clause that calculates the overall average home goals.
• Filter the main query for stages where the average home goals is higher than the overall
average.
• Select the stage and avg_goals columns from the s subquery into the main query.
• SELECT
• -- Select the stage and average goals from the subquery
• s.stage,
• ROUND(s.avg_goals,2) AS avg_goals
• FROM
• -- Select the stage and average goals in 2012/2013
• (SELECT
• stage,
• AVG(home_goal + away_goal) AS avg_goals
• FROM match
• WHERE season = '2012/2013'
• GROUP BY stage) AS s
• WHERE
• -- Filter the main query using the subquery
• s.avg_goals > (SELECT AVG(home_goal + away_goal)
• FROM match WHERE season = '2012/2013');
Exercise
Add a subquery in SELECT
In the previous exercise, you added a subquery to the FROM statement and selected the stages
where the number of average goals in a stage exceeded the overall average number of goals in
the 2012/2013 match season. In this final step, you will add a subquery in SELECT to compare
the average number of goals scored in each stage to the total.
Instructions:
• Create a subquery in SELECT that yields the average goals scored in the 2012/2013
season. Name the new column overall_avg.
• Create a subquery in FROM that calculates the average goals scored in each stage during
the 2012/2013 season.
• Filter the main query for stages where the average goals exceeds the overall average in
2012/2013.
• SELECT
• -- Select the stage and average goals from s
• s.stage,
• ROUND(s.avg_goals,2) AS avg_goal,
• -- Select the overall average for 2012/2013
• (SELECT AVG(home_goal + away_goal) FROM match WHERE sea
son = '2012/2013') AS overall_avg
• FROM
-
- Select the stage and average goals in 2012/2013 from matc h
(SELECT
stage,
AVG(home_goal + away_goal) AS avg_goals
FROM match
WHERE season = '2012/2013'
GROUP BY stage) AS s
WHERE
-- Filter the main query using the subquery
s.avg_goals > (SELECT AVG(home_goal + away_goal)
FROM match WHERE season = '2012/2013');
8. Correlated subqueries
• Uses values from the outer query to generate a result
• Re-run for every row generated in the final data set
• Used for advanced joining, filtering, and evaluating data
Correlated subqueries are a special kind of subquery that use values from the outer query in order
to generate the final results. The subquery is re-executed each time a new row in the final data
set is returned, in order to properly generate each new piece of information. Correlated
subqueries are used for special types of calculations, such as advanced joining, filtering, and
evaluating of data in the database.
8.1.A simple example
• Which match stages tend to have a higher-than-average number of goals scored?
You achieved this using 3 simple subqueries in the SELECT, FROM, and WHERE statements.
However, the same output can also be produced with a correlated subquery. Let's focus on the
subquery in the WHERE statement.
8.2.A correlated example
This query has only one difference -- instead of including a filter by season, the WHERE clause
filters for data where the outer table's match stage, pulled from the subquery in FROM, is
HIGHER than the overall average generated in the WHERE subquery. The entire WHERE
statement is saying, in essence, "return stages where the values in the subquery are higher than
the average."
Here are the results generated by this query. This may seem a bit complicated, but with a few
more examples and a bit of practice, you will start to get the hang of how useful correlated
subqueries can be.
8.3.Simple vs. correlated subqueries
Simple Subquery
• Can be run independently from the main query
• Evaluated once in the whole query
Correlated Subquery
• Dependent on the main query to execute
• Evaluated in loops
o Significantly slows down query runtime
Let's quickly walk through some key differences between simple and correlated subqueries.
Simple subqueries can be used in extracting, structuring or filtering information, and can run
independent of the main query. In contrast, a correlated subquery cannot be executed on its own
because it's dependent on values in the main query. Additionally, a simple subquery is evaluated
once in the entire statement. A correlated subquery is evaluated in loops -- once for each row
generated by the data set. This means that adding correlated subqueries will slow down your
query performance, since your query is recalculating information over and over. Be careful not to
include too many correlated subqueries -- or your query may take a long time to run!
8.4.Correlated subqueries
• What is the average number of goals scored in each country ?
Here's another, smaller example of a query in which you can use a correlated subquery. Let's
answer the question, "What is the average number of goals scored in each country across all
match seasons?" This is an an easy enough question, right? You simply join the match table to
the country table on the country's id, and extract the country's name, take an average of the goals
scored, and group the entire query by the country's name, yielding one row with an average value
per country.
A correlated subquery can be used here in lieu of a join. Take a look at the outer query first. The
name of the country is selected from the country table, aliased as "c". The second column
selected is a scalar subquery, selecting the average total goals scored across all seasons from the
match table. You'll notice that the WHERE clause asks SQL to return values where the inner,
match table's country_id column matches the c.id column in the outer query's country table. This
way, the entire join is replaced, and the results are identical.
Exercise
Basic Correlated Subqueries
Correlated subqueries are subqueries that reference one or more columns in the main query.
Correlated subqueries depend on information in the main query to run, and thus, cannot be
executed on their own.
Correlated subqueries are evaluated in SQL once per row of data retrieved -- a process that takes
a lot more computing power and time than a simple subquery.
In this exercise, you will practice using correlated subqueries to examine matches with scores
that are extreme outliers for each country -- above 3 times the average score!
Instructions:
• Select the country_id, date, home_goal, and away_goal columns in the main query.
• Complete the AVG value in the subquery.
• Complete the subquery column references, so that country_id is matched in the main and
subquery.
• SELECT
• -
- Select country ID, date, home, and away goals from match
• main.country_id,
• main.date,
• main.home_goal,
• main.away_goal
• FROM match AS main
• WHERE
• -- Filter the main query by the subquery
• (home_goal + away_goal) >
• (SELECT AVG((sub.home_goal + sub.away_goal) * 3)
• FROM match AS sub
• -- Join the main query to the subquery in WHERE
• WHERE main.country_id = sub.country_id);
Exercise
Correlated subquery with multiple conditions
Correlated subqueries are useful for matching data across multiple columns. In the previous
exercise, you generated a list of matches with extremely high scores for each country. In this
exercise, you're going to add an additional column for matching to answer the question -- what
was the highest scoring match for each country, in each season?
*Note: this query may take a while to load.
Instructions:
• Select the country_id, date, home_goal, and away_goal columns in the main query.
• Complete the subquery: Select the matches with the highest number of total goals.
• Match the subquery to the main query using country_id and season.
• Fill in the correct logical operator so that total goals equals the max goals recorded in the
subquery.
• SELECT
• -
- Select country ID, date, home, and away goals from match
• main.country_id,
• main.date,
• main.home_goal,
• main.away_goal
• FROM match AS main
• WHERE
• -
- Filter for matches with the highest number of goals score
d
• (home_goal + away_goal) =
• (SELECT MAX(sub.home_goal + sub.away_goal)
• FROM match AS sub
• WHERE main.country_id = sub.country_id
• AND main.season = sub.season);
9. Nested subqueries
9.1.Nested subqueries?
• Subquery inside another subquery
• Perform multiple layers of transformation
Nested subqueries are exactly as they sound -- subqueries nested inside other subqueries. As you
saw in the previous chapter, information in a database is often not in the format you need to
answer a question. Some types of questions you answer may require multiple layers of
transformation and filtering of data before you extracting it into the main query.
9.2.A subquery…
• How much did each country’s average differ from the overall average?
Let's start with an example. The query you see here is similar to a previous lesson where we
selected the average number of goals scored in a match within each country, and compared it to
the overall average using a subquery in SELECT. This third column calculates the difference
between each country, and the overall average.
The resulting table looks like this, with one row for each country, and one column for each of the
two calculations.
9.3.…inside a subquery!
• How does each month’s total goals differ from the average monthly total of goals scored?
Let's answer a similar question with an additional layer -- How does each month's total goals
differ from the monthly average of goals scored? The query here, similar to the previous one,
answers this question. Let's take some time to walk through the necessary steps to get this result.
9.4.Inner subquery
The subquery logic reads like this -- first, select the sum of goals scored in each month. The
month is queried using the EXTRACT function, FROM the date. Here are the results of that first,
inner subquery, which includes results for months 1 through 12.
9.5.Outer subquery
Next, you can place the subquery into the second, outer subquery to calculate an average of the
values generated in the previous table, giving you the average monthly goals scored. Since this
result is a scalar subquery, you can now place it in the main query for calculating the final data
set.
Finally, you can place the entire nested subquery in the SELECT statement, giving you a scalar
value to compare to the SUM of goals scored in each month. Here are the first 4 rows of the final
query, which generates a sum of goals scored in the month, and a column subtracting the goals
scored, from the overall monthly average.
It has a second, nested subquery inside the SELECT statement, and the outer subquery has a
statement correlating with the main query.
Exercise
Nested simple subqueries
Nested subqueries can be either simple or correlated.
Just like an unnested subquery, a nested subquery's components can be executed independently
of the outer query, while a correlated subquery requires both the outer and inner subquery to run
and produce results.
In this exercise, you will practice creating a nested subquery to examine the highest total number
of goals in each season, overall, and during July across all seasons.
Instructions:
• Complete the main query to select the season and the max total goals in a match for each
season. Name this max_goals.
• Complete the first simple subquery to select the max total goals in a match across all
seasons. Name this overall_max_goals.
• Complete the nested subquery to select the maximum total goals in a match played in
July across all seasons.
• Select the maximum total goals in the outer subquery. Name this entire subquery
july_max_goals.
SELECT
-- Select the season and max goals scored in a match
season,
MAX(home_goal + away_goal) AS max_goals,
-- Select the overall max goals scored in a match
(SELECT MAX(home_goal + away_goal) FROM match) AS overall_max
_goals,
-- Select the max number of goals scored in any match in July
(SELECT max(home_goal + away_goal)
FROM match
WHERE id IN (
SELECT id FROM match WHERE EXTRACT(MONTH FROM date) =
07)) AS july_max_goals
FROM match
GROUP BY season;
Exercise
Nest a subquery in FROM
What's the average number of matches per season where a team scored 5 or more goals? How
does this differ by country?
Let's use a nested, correlated subquery to perform this operation. In the real world, you will
probably find that nesting multiple subqueries is a task you don't have to perform often. In some
cases, however, you may find yourself struggling to properly group by the column you want, or
to calculate information requiring multiple mathematical transformations (i.e., an AVG of a
COUNT).
Nesting subqueries and performing your transformations one step at a time, adding it to a
subquery, and then performing the next set of transformations is often the easiest way to yield
accurate information about your data. Let's get to it!
Instructions 1/3:
• Generate a list of matches where at least one team scored 5 or more goals.
• -- Select matches where a team scored 5+ goals
• SELECT
• country_id,
• season,
id
FROM match
WHERE home_goal >=5 OR away_goal >=5;
Instructions 2/3:
• Turn the query from the previous step into a subquery in the FROM statement.
• COUNT the match ids generated in the previous step, and group the query by country_id
and season.
• -- Count match ids
• SELECT
• country_id,
• season,
• COUNT(subquery.id) AS matches
• -- Set up and alias the subquery
• FROM (
• SELECT
• country_id,
• season,
• id
• FROM match
• WHERE home_goal >= 5 OR away_goal >= 5)
• AS subquery
• -- Group by country_id and season
• GROUP BY country_id, season;
Instructions 3/3:
• Finally, declare the same query from step 2 as a subquery in FROM with the alias
outer_s.
• Left join it to the country table using the outer query's country_id column.
• Calculate an AVG of high scoring matches per country in the main query.
• SELECT
• c.name AS country,
• -- Calculate the average matches per season
• AVG(outer_s.matches) AS avg_seasonal_high_scores
• FROM country AS c
• -- Left join outer_s to country
• LEFT JOIN (
• SELECT country_id, season,
• COUNT(id) AS matches
• FROM (
• SELECT country_id, season, id
• FROM match
• WHERE home_goal >= 5 OR away_goal >= 5) AS inner_s
• -- Close parentheses and alias the subquery
• GROUP BY country_id, season) AS outer_s
• ON c.id = outer_s.country_id
• GROUP BY country;
10. Common Table Expressions
10.1. When adding subqueries…
• Query complexity increases quickly!
o Information can be difficult to keep track of
Solution: Common Table Expressions!
As you probably noticed, the queries we have been setting up are quickly becoming long and
complex. It can become difficult to clearly keep track of each piece of your query, why you need
it, and whether or not it's necessary. In this lesson, we'll cover a common method for improving
readability and accessibility of information in subqueries -- the common table expression.
Setting up CTEs
Common table expressions, or CTEs are a special type of subquery that is declared ahead of your
main query, just like you see here. Instead of wrapping subqueries inside, say the FROM
statement, you name it using the WITH statement, and then reference it by name later in the
FROM statement as if it were any other table in your database.
10.2. Take a subquery in FROM
Let's rewrite a query from an exercise that you completed in chapter 2, by using a CTE. The
query you see here uses a subquery, s, in the FROM statement to generate a list of country id's
and match IDs that meet a certain criteria -- specifically, we only wanted matches with 10 or
more goals scored in total. This subquery is then joined to the country table, and the number of
matches in the subquery is counted in the main query. Here are the results of that query -- a short
list of countries with very few high-scoring matches.
10.3. Place it at the beginning
In order to rewrite this query using a common table expression to represent the subquery, simply
take the subquery out of the FROM clause, place it at the beginning of your query, declare it
using the syntax WITH, followed by a CTE name, and AS. So, here we're starting our CTE, s, by
stating WITH s AS, and then placing the subquery inside parentheses. It's now a common table
expression!
10.4. Show me the CTE
Finally, complete the rest of the query the same way you would if the CTE were an existing table
in the database. You select the country name from the country table, count the number of
matches in the CTE "s", JOIN "s" to the country table, and then group the results by the country
name's alias.
The results -- you guessed it -- are identical to the previous query setup!
If you have multiple subqueries that you want to turn into a common table expression, you can
simply list them one after another, with a comma in between each CTE, and NO comma after the
last one. You can then retrieve the information you need into the main query -- just make sure
you properly join this second CTE as well!
10.5. Why use CTEs?
• Executed once
o CTE is then stored in memory
o Improves query performance
• Improving organization of queries
• Referencing other CTEs
• Referencing itself (SELF JOIN)
So why are we learning yet another method of producing the same result in a SQL query?
Common table expressions have numerous benefits over a subquery written inside your main
query. First, the CTE is run only once, and then stored in memory, so it often leads to an
improvement in the amount of time it takes to run your query. Second, CTEs are an excellent
tool for organizing long and complex CTEs. You can declare as many CTEs as you need, one
after another. You can also reference information in CTEs declared earlier. For example, if you
have 3 CTEs in a query, your third CTE can retrieve information from the first and second CTE.
Finally, a CTE can reference itself in a special kind of table called a recursive CTE. We'll briefly
discuss some more advanced applications of CTEs in the next lesson.
Exercise
Clean up with CTEs
In chapter 2, you generated a list of countries and the number of matches in each country with
more than 10 total goals. The query in that exercise utilized a subquery in the FROM statement
in order to filter the matches before counting them in the main query. Below is the query you
created:
SELECT
c.name AS country,
COUNT(sub.id) AS matches
FROM country AS c
INNER JOIN (
SELECT country_id, id
FROM match
WHERE (home_goal + away_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country;
You can list one (or more) subqueries as common table expressions (CTEs) by declaring them
ahead of your main query, which is an excellent tool for organizing information and placing it in
a logical order.
This time, let's expand on the exercise by looking at details about matches with very high scores
using CTEs. Just like a subquery in FROM, you can join tables inside a CTE.
Instructions:
• Declare your CTE, where you create a list of all matches with the league name.
• Select the league, date, home, and away goals from the CTE.
• Filter the main query for matches with 10 or more goals.
• -- Set up your CTE
• WITH match_list AS (
• -- Select the league, date, home, and away goals
• SELECT
• l.name AS league,
• m.date,
• m.home_goal,
• m.away_goal,
• (m.home_goal + m.away_goal) AS total_goals
• FROM match AS m
• LEFT JOIN league as l ON m.country_id = l.id)
• -
- Select the league, date, home, and away goals from the CT
E
• SELECT league, date, home_goal, away_goal
• FROM match_list
• -- Filter by total goals
• WHERE total goals >=10;
Exercise
CTEs with nested subqueries
If you find yourself listing multiple subqueries in the FROM clause with nested statement, your
query will likely become long, complex, and difficult to read.
Since many queries are written with the intention of being saved and re-run in the future, proper
organization is key to a seamless workflow. Arranging subqueries as CTEs will save you time,
space, and confusion in the long run!
Instructions:
• Declare a CTE that calculates the total goals from matches in August of the 2013/2014
season.
• Left join the CTE onto the league table using country_id from the match_list CTE.
• Filter the list on the inner subquery to only select matches in August of the 2013/2014
season.
• -- Set up your CTE
• WITH match_list AS (
• SELECT
• country_id,
• (home_goal + away_goal) AS goals
• FROM match
• -- Create a list of match IDs to filter data in the CTE
• WHERE id IN (
• SELECT id
• FROM match
• WHERE season = '2013/2014' AND EXTRACT(MONTH FROM da
te) = 08))
• -- Select the league name and average of goals in the CTE
• SELECT
• l.name,
• AVG(match_list.goals)
• FROM league AS l
• -- Join the CTE onto the league table
• LEFT JOIN match_list ON l.id = match_list.country_id
• GROUP BY l.name;
11. Deciding on techniques to use
11.1. Different names for the same thing?
• Considerable overlap…
Out of the 4 techniques we just discussed, this can be performed using subqueries, correlated
subqueries, and CTEs. Let's practice creating similar result sets using each of these 3 methods
over the next 3 exercises, starting with subqueries in FROM.
Instructions 1/2:
• Create a query that left joins team to match in order to get the identity of the home team.
This becomes the subquery in the next step.
• SELECT
• m.id,
• t.team_long_name AS hometeam
• -- Left join team to match
• FROM match AS m
• LEFT JOIN team as t
• ON m.hometeam_id = team_api_id;
Instructions 2/2:
• Add a second subquery to the FROM statement to get the away team name, changing
only the hometeam_id. Left join both subqueries to the match table on the id column.
Warning: if your code is timing out, you have probably made a mistake in the JOIN and tried to
join on the wrong fields which caused the table to be too big! Read the provided code and
comments carefully, and check your ON conditions!
SELECT
m.date,
-- Get the home and away team names
home.hometeam,
away.awayteam,
m.home_goal,
m.away_goal
FROM match AS m
This can easily be performed using correlated subqueries. But how might that impact the
performance of your query? Complete the following steps and let's find out!
Please note that your query will run more slowly than the previous exercise!
Instructions 1/2:
• Using a correlated subquery in the SELECT statement, match the team_api_id column
from team to the hometeam_id from match.
• SELECT
• m.date,
• (SELECT team_long_name
• FROM team AS t
• -- Connect the team to the match table
• WHERE team_api_id = m.hometeam_id) AS hometeam
• FROM match AS m;
Instructions 2/2:
• Create a second correlated subquery in SELECT, yielding the away team's name.
• Select the home and away goal columns from match in the main query.
• SELECT
• m.date,
• (SELECT team_long_name
• FROM team AS t
• WHERE t.team_api_id = m.hometeam_id) AS hometeam,
• -- Connect the team to the match table
• (SELECT team_long_name
• FROM team AS t
• WHERE t.team_api_id = m.awayteam_id) AS awayteam,
• -- Select home and away goals
• home_goal,
• away_goal
• FROM match AS m;
Exercise
Get team names with CTEs
You've now explored two methods for answering the question, How do you get both the home
and away team names into one final query result?
Let's explore the final method - common table expressions. Common table expressions are
similar to the subquery method for generating results, mainly differing in syntax and the order in
which information is processed.
Instructions 1/3:
• Select id from match and team_long_name from team. Join these two tables together on
hometeam_id in match and team_api_id in team.
• SELECT
• -- Select match id and team long name
• m.id,
• t.team_long_name AS hometeam
• FROM match AS m
• -- Join team to match using team_api_id and hometeam_id
• LEFT JOIN team AS t
• ON m.hometeam_id = t.team_api_id ;
Instructions 2/3:
• Declare the query from the previous step as a common table expression. SELECT
everything from the CTE into the main query. Your results will not change at this step!
• -- Declare the home CTE
• WITH home AS (
• SELECT m.id, t.team_long_name AS hometeam
• FROM match AS m
• LEFT JOIN team AS t
• ON m.hometeam_id = t.team_api_id)
• -- Select everything from home
• SELECT *
• FROM home;
Instructions 3/3
• Let's declare the second CTE, away. Join it to the first CTE on the id column.
• The date, home_goal, and away_goal columns have been added to the CTEs. SELECT
them into the main query.
• WITH home AS (
• SELECT m.id, m.date,
• t.team_long_name AS hometeam, m.home_goal
• FROM match AS m
• LEFT JOIN team AS t
• ON m.hometeam_id = t.team_api_id),
• -- Declare and set up the away CTE
• away AS (
• SELECT m.id, m.date,
• t.team_long_name AS awayteam, m.away_goal
• FROM match AS m
• LEFT JOIN team AS t
• ON m.awayteam_id = t.team_api_id)
• -- Select date, home_goal, and away_goal
• SELECT
• home.date,
• home.hometeam,
• away.awayteam,
• home.home_goal,
• away.away_goal
• -- Join away and home on the id column
• FROM home
• INNER JOIN away
• ON home.id = away.id;
12. It’s OVER
12.1. Working with aggregate values
• Requires you to use GROUP BY with all non-aggregate columns
Let's tackle another limitation you've likely encountered in SQL -- the fact that you have to
group results when using aggregate functions. If you try to retrieve additional information
without grouping by every single non-aggregate value, your query will return an error. Thus, you
can't compare aggregate values to non-aggregate data.
So what's a window function? How do you use it? Let's start with a query from chapter 2, where
we answered the question, "how many goals were scored in each match in 2011/2012, and how
did that compare to the average?" This query selects two columns from match table, and then
used a subquery in SELECT to pass the overall average along the data set without aggregating
the results.
The same results can be generated using the clause common to all window functions -- the
OVER clause. Instead of writing a subquery, calculate the AVG of home_goal and away_goal,
and follow it with the OVER clause. This clause tells SQL to "pass this aggregate value over this
existing result set." The results are identical to the previous statement that used a subquery in
SELECT, with a simpler syntax and faster processing time.
12.4. Generate a RANK
• What is the rank of matches based on number of goals scored?
Another simple type of column you can generate with a window function is a RANK. A RANK
simply creates a column numbering your data set from highest to lowest, or lowest to highest,
based on a column that you specify. Let's take the same query as the previous example, without
the window function, and use it to answer the question -- what is the RANK of matches based on
the number of goals scored?
We can answer this using the RANK window function. In order to set this up, let's add a new
column in SELECT as you see here. To create the rank, you start with the RANK function, using
parentheses, followed by the OVER clause. Inside the OVER clause, include the ORDER BY
clause, and the column or columns you want to use to generate the rank. By default, the RANK
function orders the results and ranking from smallest to largest values. In the case of our data set
here, this isn't particularly informative.
You can easily correct this by adding the DESC function to reverse the order of the rank, just as
you would if you were using ORDER BY at the end of your query. You'll notice that the RANK
function automatically ties identical values, such as the first 2 results, and then skips the next
value in the rank.
12.5. Key differences
• Processed after every part of query except ORDER BY
o Uses information in result set rather than database
• Available in PostgreSQL, Oracle, MySQL, SQL Server…
o …but NOT SQLite
There are a few key considerations when using window functions. First, window functions are
processed after the entire query except the final ORDER BY statement. Thus, the window
function uses the result set to calculate information, as opposed to using the database directly.
Second, it's important to know that window functions are available in PostgreSQL, Oracle,
MySQL, but not in SQLite.
Exercise
The match is OVER
The OVER() clause allows you to pass an aggregate function down a data set, similar to
subqueries in SELECT. The OVER() clause offers significant benefits over subqueries in select -
- namely, your queries will run faster, and the OVER() clause has a wide range of additional
functions and clauses you can include with it that we will cover later on in this chapter.
In this exercise, you will revise some queries from previous chapters using the OVER() clause.
Instructions:
• Select the match ID, country name, season, home, and away goals from the match and
country tables.
• Complete the query that calculates the average number of goals scored overall and then
includes the aggregate value in each row using a window function.
• SELECT
• -
- Select the id, country name, season, home, and away goals
• m.id,
• c.name AS country,
• m.season,
• m.home_goal,
• m.away_goal,
• -
- Use a window to include the aggregate average in each row
• AVG(m.home_goal + m.away_goal) OVER() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;
Exercise
What's OVER here?
Window functions allow you to create a RANK of information according to any variable you
want to use to sort your data. When setting this up, you will need to specify what
column/calculation you want to use to calculate your rank. This is done by including an ORDER
BY clause inside the OVER() clause. Below is an example:
SELECT
id,
RANK() OVER(ORDER BY home_goal) AS rank
FROM match;
In this exercise, you will create a data set of ranked matches according to which leagues, on
average, score the most goals in a match.
Instructions:
• Select the league name and average total goals scored from league and match.
• Complete the window function so it calculates the rank of average goals scored across all
leagues in the database.
• Order the rank by the average total of home and away goals scored.
• SELECT
• -- Select the league name and average goals scored
• l.name AS league,
• AVG(m.home_goal + m.away_goal) AS avg_goals,
• -- Rank each league according to the average goals
• RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal)) AS
league_rank
• FROM league AS l
• LEFT JOIN match AS m
• ON l.id = m.country_id
• WHERE m.season = '2011/2012'
• GROUP BY l.name
• -- Order the query by the rank you created
• ORDER BY league_rank DESC;
Exercise
Flip OVER your results
In the last exercise, the rank generated in your query was organized from smallest to largest. By
adding DESC to your window function, you can create a rank sorted from largest to smallest.
SELECT
id,
RANK() OVER(ORDER BY home_goal DESC) AS rank
FROM match;
Instructions
• Complete the same parts of the query as the previous exercise.
• Complete the window function to rank each league from highest to lowest average goals
scored.
• Order the main query by the rank you just created.
• SELECT
• -- Select the league name and average goals scored
• l.name AS league,
• AVG(m.home_goal + m.away_goal) AS avg_goals,
• -- Rank leagues in descending order by average goals
• RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal) DES
C) AS league_rank
• FROM league AS l
• LEFT JOIN match AS m
• ON l.id = m.country_id
• WHERE m.season = '2011/2012'
• GROUP BY l.name
• -- Order the query by the rank you created
• ORDER BY league_rank;
13. OVER with a PARTITION
• Calculate separate values for different categories
• Calculate different calculations in the same column
One important statement you can add to your OVER clause is PARTITION BY. A partition
allows you to calculate separate values for different categories established in a partition. This is
one way to calculate different aggregate values within one column of data, and pass them down a
data set, instead of having to calculate them in different columns. The syntax for a partition is
fairly simple. Just like before, use an aggregate function to compute a calculation, such as the
AVG of the home_goal column. You then add the OVER clause afterward, and inside the
parentheses, state PARTITION BY, followed by the column you want to partition the average
by. This will then return the overall average for, or PARTITIONed BY each season.
13.1. Partition your data
• How many goals were scored in each match, and how did that compare to the overall
average?
Let's take a look at how this works in a query. This is the example query from the previous
lesson, answering the question, "How many goals were scored in each match, and how did that
compare to the overall average?" This is accomplished using the OVER clause, and the query
returns the date, goals scored, and overall average.
• How many goals were scored in each match, and how did that compare to the season’s
average?
Let's expand on the previous question, and instead ask, "How many goals were scored in each
match, and how did that compare to the season's average?" We can do this by adding a
PARTITION BY clause to the OVER clause from the previous slide. Specifying, "PARTITION
BY season" returns each season's average on each row, in accordance to the season that each
record belongs to. As you can see, rows 1 and 2 are matches played in the 2011/2012 season, and
the season_avg column contains the 2011/2012 season average. Rows 3 and 4 are part of the
2012/2013 season, and return the 2012/2013 season average.
13.2. PARTITION by Multiple Columns
You can also use PARTITION to calculate values broken out by multiple columns. In the query
you see here, the OVER clause contains two columns to partition the AVG goals scored--season,
and country. The result set returns the average goals scored broken out by season and country. In
row 1, a match was played in Belgium in the 2011/2012 season, and had 1 goal scored
throughout the match. This is compared to the 2.88, which is the average goals scored in
Belgium in the 2011/2012 season.
13.3. PARTITION BY considerations
• Can partition data by 1 or more columns
• Can partition aggregate calculations, ranks, etc
PARTITION BY is a pretty straight forward addition to the OVER clause. You can partition
calculations by 1 or more columns as necessary to answer a question you may have.
Additionally, you can use a PARTITION with any kind of window function -- calculation, rank,
or others that we will discuss further in the following lesson.
Exercise
PARTITION BY a column
The PARTITION BY clause allows you to calculate separate "windows" based on columns you
want to divide your results. For example, you can create a single column that calculates an
overall average of goals scored for each season.
In this exercise, you will be creating a data set of games played by Legia Warszawa (Warsaw
League), the top ranked team in Poland, and comparing their individual game performance to the
overall average for that season.
Where do you see more outliers? Are they Legia Warszawa's home or away games?
Instructions
• Complete the two window functions that calculate the home and away goal averages.
Partition the window functions by season to calculate separate averages for each season.
• Filter the query to only include matches played by Legia Warszawa, id = 8673.
SELECT
date,
season,
home_goal,
away_goal,
CASE WHEN hometeam_id = 8673 THEN 'home'
ELSE 'away' END AS warsaw_location,
-
- Calculate the average goals scored partitioned by season
AVG(home_goal) OVER(PARTITION BY season) AS season_home
avg,
AVG(away_goal) OVER(PARTITION BY season) AS season_away
avg
FROM match
-- Filter the data set for Legia Warszawa matches only
WHERE
hometeam_id =8673
OR awayteam_id = 8673
ORDER BY (home_goal + away_goal) DESC;
Exercise
PARTITION BY multiple columns
The PARTITION BY clause can be used to break out window averages by multiple data points
(columns). You can even calculate the information you want to use to partition your data! For
example, you can calculate average goals scored by season and by country, or by the calendar
year (taken from the date column).
In this exercise, you will calculate the average number home and away goals scored Legia
Warszawa, and their opponents, partitioned by the month in each season.
Instructions:
• Construct two window functions partitioning the average of home and away goals by
season and month.
• Filter the dataset by Legia Warszawa's team ID (8673) so that the window calculation
only includes matches involving them.
• SELECT
• date,
• season,
• home_goal,
• away_goal,
• CASE WHEN hometeam_id = 8673 THEN 'home'
• ELSE 'away' END AS warsaw_location,
• -
- Calculate average goals partitioned by season and month
• AVG(home_goal) OVER(PARTITION BY season,
• EXTRACT(MONTH FROM date)) AS season_mo_home,
• AVG(away_goal) OVER(PARTITION BY season,
• EXTRACT(MONTH FROM date)) AS season_mo_away
• FROM match
• WHERE
• hometeam_id = 8673
• OR awayteam_id = 8673
• ORDER BY (home_goal + away_goal) DESC;
14. Sliding windows
In addition to calculating aggregate and rank information, window functions can also be used to
calculate information that changes with each subsequent row in a data set.
• Perform calculations relative to the current row
• Can be used to calculate running totals, sums, averages, etc
• Can be partitioned by one or more columns
These types of window functions are called sliding windows. Sliding windows are functions that
perform calculations relative to the current row of a data set. You can use sliding windows to
calculate a wide variety of information that aggregates one row at a time down your data set --
running totals, sums, counts, and averages in any order you need. A sliding window calculation
can also be partitioned by one or more columns, just like a non-sliding window.
14.1. Sliding window keywords
A sliding window function contains specific functions within the OVER clause to specify the
data you want to use in your calculations. The general syntax looks like this -- you use the phrase
ROWS BETWEEN to indicate that you plan on slicing information in your window function for
each row in the data set, and then you specify the starting and finishing point of the calculation.
For the start and finish in your ROWS BETWEEN statement, you can specify a number of
keywords as shown here. PRECEDING and FOLLOWING are used to specify the number of
rows before, or after, the current row that you want to include in a calculation. UNBOUNDED
PRECEDING and UNBOUNDED FOLLOWING tell SQL that you want to include every row
since the beginning, or the end, of the data set in your calculations. Finally, CURRENT ROW
tells SQL that you want to stop your calculation at the current row.
14.1.1. Example
For example, the sliding window in this query includes several key pieces of information in its
calculation. It first states that the goal is to calculate a sum of goals scored when Manchester City
played as the home team during the 2011/2012 season. It then tells you that you want to turn this
calculation into a running total, ordered by the date of the match from oldest to most recent and
calculated from the beginning of the data set to the current row. Your resulting data set looks like
this, with a column calculating the total number of goals scored across the season, with a final
total listed in the last row.
14.2. Sliding window frame
Using the PRECEDING statement, you also have the ability to calculate sliding windows with a
more limited frame. For example, the query you see here is similar to the previous one, with a
slightly modified sliding window. The phrase UNBOUNDED PRECEDING is replaced here
with the phrase 1 PRECEDING, which calculates the sum of Manchester City's goals in the
current and previous match. As you see in the data set here, the two rows in red are used to
calculate the sum on the second row, and the two rows in green are used to calculate the sum on
the third row.
Exercise
Slide to the left
Sliding windows allow you to create running calculations between any two points in a window
using functions such as PRECEDING, FOLLOWING, and CURRENT ROW. You can calculate
running counts, sums, averages, and other aggregate functions between any two points you
specify in the data set.
In this exercise, you will expand on the examples discussed in the video, calculating the running
total of goals scored by the FC Utrecht when they were the home team during the 2011/2012
season. Do they score more goals at the end of the season as the home or away team?
Instructions:
• Complete the window function by:
o Assessing the running total of home goals scored by FC Utrecht.
o Assessing the running average of home goals scored.
o Ordering both the running average and running total by date.
• SELECT
• date,
• home_goal,
• away_goal,
• -
- Create a running total and running average of home goals
• SUM(home_goal) OVER(ORDER BY date
• ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS running_total,
• AVG(home_goal) OVER(ORDER BY date
• ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS running_avg
• FROM match
• WHERE
• hometeam_id = 9908
• AND season = '2011/2012';
Exercise
Slide to the right
Now let's see how FC Utrecht performs when they're the away team. You'll notice that the total
for the season is at the bottom of the data set you queried. Depending on your results, this could
be pretty long, and scrolling down is not very helpful.
In this exercise, you will slightly modify the query from the previous exercise by sorting the data
set in reverse order and calculating a backward running total from the CURRENT ROW to the
end of the data set (earliest record).
Instructions:
• Complete the window function by:
o Assessing the running total of home goals scored by FC Utrecht.
o Assessing the running average of home goals scored.
o Ordering both the running average and running total by date, descending.
• SELECT
• -- Select the date, home goal, and away goals
• date,
• home_goal,
• away_goal,
• -
- Create a running total and running average of home goals
• SUM(home_goal) OVER(ORDER BY date DESC
• ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS running_total,
• AVG(home_goal) OVER(ORDER BY date DESC
• ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS running_avg
• FROM match
• WHERE
• awayteam_id = 9908
• AND season = '2011/2012';
15. Bringing it all together
15.1. What you’ve learned so far
• CASE statements
• Simple subqueries
• Nested and correlated subqueries
• Common table expressions
• Window functions
Throughout the course we've covered a wide variety of methods for transforming, manipulating,
and calculating data to answer a wide variety of questions in SQL. Specifically, you've learned
how to use CASE statements for categorizing, aggregating, and calculating information, and how
to use simple subqueries in SELECT, FROM, and WHERE clauses. You also learned how to use
nested and correlated subqueries, and common table expressions to extract, match, and organize
large amounts of data in order to generate a final table. Finally, you learned how to use some of
the many window functions available to you in SQL.
15.2. Let’s do a case study!
Who defeated Manchester United in the 2013/2014 season?
15.3. Steps to construct the query
• Get team names with CTEs
• Get match outcome with CASE statements
• Determine how badly they lost with a window function
In the following exercises, you will generate a data set that tackles one of the issues we've
explored during this course -- namely, that it's difficult to retrieve the names of teams who
played in a given match. Since this isn't feasible with joins, we will accomplish it with common
table expressions. We'll also be using CASE statements to categorize the outcomes of matches
based on whether or not Manchester United won a particular match. Finally, we'll be ranking
matches by the number of goals they lost the match using a window function.
15.4. Getting the database for yourself
If Manchester United happens to be a team that you favor, or if there are other European teams
you consider a rival to your favorite team, I encourage you to explore the European Soccer
Database for yourself and create similar, or completely different queries to answer your
questions.
Exercise
Setting up the home team CTE
In this course, we've covered ways in which you can use CASE statements, subqueries, common
table expressions, and window functions in your queries to structure a data set that best meets
your needs. For this exercise, you will be using all of these concepts to generate a list of matches
in which Manchester United was defeated during the 2014/2015 English Premier League season.
Your first task is to create the first query that filters for matches where Manchester United played
as the home team. This will become a common table expression in a later exercise.
Instructions:
• Create a CASE statement that identifies each match as a win, lose, or tie for Manchester
United.
• Fill out the logical operators for each WHEN clause in the CASE statement (equals,
greater than, less than).
• Join the tables on home team ID from match, and team_api_id from team.
• Filter the query to only include games from the 2014/2015 season where Manchester
United was the home team.
• SELECT
• m.id,
• t.team_long_name,
• -- Identify matches as home/away wins or ties
• CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
• WHEN m.home_goal < m.away_goal THEN 'MU Loss'
• ELSE 'Tie' END AS outcome
• FROM match AS m
• -- Left join team on the home team ID and team API id
• LEFT JOIN team AS t
• ON m.hometeam_id = t.team_api_id
• WHERE
• -
- Filter for 2014/2015 and Manchester United as the home te
am
• m.season = '2014/2015'
• AND t.team_long_name = 'Manchester United';
Exercise
Setting up the away team CTE
Great job! Now that you have a query identifying the home team in a match, you will perform a
similar set of steps to identify the away team. Just like the previous step, you will join the match
and team tables. Each of these two queries will be declared as a Common Table Expression in the
following step.
The primary difference in this query is that you will be joining the tables on awayteam_id, and
reversing the match outcomes in the CASE statement.
When altering CASE statement logic in your own work, you can reverse either the logical
condition (i.e., home_goal > away_goal) or the outcome in THEN -- just make sure you only
reverse one of the two!
Instructions:
• Complete the CASE statement syntax.
• Fill out the logical operators identifying each match as a win, loss, or tie for Manchester
United.
• Join the table on awayteam_id, and team_api_id.
• SELECT
• m.id,
• t.team_long_name,
• -- Identify matches as home/away wins or ties
• CASE WHEN m.home_goal > m.away_goal THEN 'MU Loss'
• WHEN m.home_goal < m.away_goal THEN 'MU Win'
• ELSE 'Tie' END AS outcome
-- Join team table to the match table
FROM match AS m
LEFT JOIN team AS t
ON m.awayteam_id = t.team_api_id
WHERE
-
- Filter for 2014/2015 and Manchester United as the away te am
m.season = '2014/2015'
AND t.team_long_name = 'Manchester United';
Exercise
Putting the CTEs together
Now that you've created the two subqueries identifying the home and away team opponents, it's
time to rearrange your query with the home and away subqueries as Common Table Expressions
(CTEs). You'll notice that the main query includes the phrase, SELECT DISTINCT. Without
identifying only DISTINCT matches, you will return a duplicate record for each game played.
Continue building the query to extract all matches played by Manchester United in the
2014/2015 season.
Instructions:
• Declare the home and away CTEs before your main query.
• Join your CTEs to the match table using a LEFT JOIN.
• Select the relevant data from the CTEs into the main query.
• Select the date from match, team names from the CTEs, and home/ away goals from
match in the main query.
• -- Set up the home team CTE
• WITH home AS (
• SELECT m.id, t.team_long_name,
• CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
• WHEN m.home_goal < m.away_goal THEN 'MU Loss'
Exercise
Add a window function
Fantastic! You now have a result set that retrieves the match date, home team, away team, and
the goals scored by each team. You have one final component of the question left -- how badly
did Manchester United lose in each match?
In order to determine this, let's add a window function to the main query that ranks matches by
the absolute value of the difference between home_goal and away_goal. This allows us to
directly
compare the difference in scores without having to consider whether Manchester United played
as the home or away team!
The equation is complete for you -- all you need to do is properly complete the window function!
Instructions:
• Set up the CTEs so that the home and away teams each have a name, ID, and score
associated with them.
• Select the date, home team name, away team name, home goal, and away goals scored in
the main query.
• Rank the matches and order by the difference in scores in descending order.