GroupByHavinginSQL
GroupByHavinginSQL
Aggregation is another name for summarizing your data points to get a single value.
For example, calculating the mean or the minimum. Sometimes, aggregating all your
data will result in a value that isn't useful.
For example, if you are exploring buying behavior in your store, and the people who come in
are a mix of poor students and rich professionals, it will be more informative to calculate the
mean spend for those groups separately. That is, you need to aggregate the amount spent,
grouped by different customer segments.
GROUP BY is a SQL command commonly used to aggregate the data to get insights from it.
There are three phases when you group data:
Split: the dataset is split up into chunks of rows based on the values of the
variables we have chosen for the aggregation
Apply: Compute an aggregate function, like average, minimum and maximum,
returning a single value
Combine: All these resulting outputs are combined in a unique table. In this way,
we’ll have a single value for each modality of the variable of interest.
SQL GROUP BY Example 1
We can begin by showing a simple example of GROUP BY. Suppose we want to find the top
ten countries with the highest number of Unicorn companies.
FROM companies
GROUP BY country
LIMIT 10
Here we have the results. You will probably not be surprised to find the US, China, and India
in the ranking. Let’s explain the decision behind this query:
First, notice that we used COUNT(*) to count the rows for each group, which
corresponds to the country. In addition, we also used the SQL alias to rename the
column into a more explainable name. This is possible by using the keyword AS,
followed by the new name. COUNT is covered in more depth in the COUNT() SQL
FUNCTION tutorial.
The fields were selected from the table companies, where each row corresponds to
a Unicorn company.
After, we need to specify the column name after GROUP BY to aggregate the data
based on the country.
ORDER BY is required to visualize the countries in the right order, from the highest
number to the lower number of companies.
We limit the results to 10 using LIMIT, which is followed by the number of rows
you want in the results.
SQL GROUP BY Example 2
Now, we will analyze the table with the sales. For each order number, we have the type of
client, the product line, the quantity, the unit price, the total, etc.
his time, we are interested in finding the average price per unit, the total number of orders,
and the total gain for each product line:
SELECT
product_line,
AVG(unit_price) AS avg_price,
SUM(quantity) AS tot_pieces,
SUM(total) AS total_gain
FROM sales
GROUP BY product_line
Let’s take the previous example again. Now, we want to put a condition to the query: we only
want to filter for the total number of orders higher than 40,000. Let's try the WHERE clause:
SELECT
product_line,
AVG(unit_price) AS avg_price,
SUM(quantity) AS tot_pieces,
SUM(total) AS total_gain
FROM sales
GROUP BY product_line
This error’s not possible to pass aggregated functions in the WHERE clause. We need a new
command to solve this issue.
Like WHERE, the HAVING clause filters the rows of a table. Whereas WHERE tried to filter the
whole table, HAVING filters rows within each of the groups defined by GROUP BY
Here's the previous example again, replacing the word WHERE with HAVING.
SELECT
product_line,
AVG(unit_price) AS avg_price,
SUM(quantity) AS tot_pieces,
SUM(total) AS total_gain
FROM sales
GROUP BY product_line
What else do you notice from the query? We didn’t pass the column alias to HAVING, but the
aggregation of the original field. Are you asking yourself why? You’ll unravel the mystery in
the next example.
As the last example, we will use the table called product_emissions, which contains the
emission of the products provided by the companies.
This time, we are interested in showing the average product carbon footprint (pcf) for each
company that belongs to the industry group “Technology Hardware & Equipment.”
Moreover, it would be helpful to see the number of products for each company to understand
if there is some relationship between the number of products and the carbon footprint. We
also again use HAVING to extract companies with an average carbon footprint of over 100.
FROM product_emissions AS pe
having avg_carbon_footprint_pcf>100
ORDER BY n_products
POWERED BY
An error appeared after trying to use the alias. For the HAVING clause, the new column’s name
doesn’t exist, so it won’t be able to filter the query. Let’s correct the request:
FROM product_emissions AS pe
having avg(carbon_footprint_pcf)>100
ORDER BY n_products
This time, the condition worked, and we can visualize the results from the table. We just
learned that column aliases can’t be used in HAVING because this condition is applied before
the SELECT. For this reason, it cannot recognize the fields from the new names.
SQL Order of Execution
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
But there is a question you need to ask yourself. In what order do SQL commands execute?
As humans, we often take for granted that the computer reads and interprets SQL from top to
down. But the reality is different from what it might look like. This is the right order of
execution:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
LIMIT
So, the query processor doesn’t start from SELECT, but it begins by selecting which tables to include,
and SELECT is executed after HAVING. This explains why HAVING doesn’t allow the use of ALIAS,
while ORDER BY doesn’t have problems with it. In addition to this aspect, this order of execution clarifies
the reason why HAVING is used together with GROUP BY to apply conditions on aggregated data,
while WHERE cannot.