Top Most Common SQL Coding Errors in Data Science
Top Most Common SQL Coding Errors in Data Science
Fail fast, make mistakes, and learn quickly but never repeat the same error. As data
science beginners, you are probably excited to start working with SQL. But, there are
some common SQL coding errors that we have identified many beginners make on our
platform.
We will look at some of these mistakes along with an example from the StrataScratch
platform where available so that you never repeat the same mistakes again.
At StrataScratch, we have over 1000 coding questions on SQL and Python, with
thousands of users solving these questions monthly. Due to such a strong community of
users, many different approaches are available to solve each question. We have
analyzed some of the solutions that our users posted and identified patterns in the
common coding errors you guys make.
This article will be helpful for people starting in the field of data science and who have
begun coding recently. This will cover things you should avoid when writing your SQL
query and some of the best practices to follow. Before moving into the topic right away,
let’s quickly see what the order of execution is for a generic SQL query.
Once you understand the above order of execution, you can avoid making some typical
SQL errors in your code. Now let’s focus on some of the most common SQL coding
errors beginner data science folks commit.
Link: https://platform.stratascratch.com/coding/10295-most-active-users-on-messenger
Dataset:
If you look at the dataset closely, you can see there are id, date, user1, user2, and
msg_count columns. While working on this question, we observed that a lot of
beginners make a similar coding error of using the SQL keyword “user” in the query.
Code With Error
SELECT id,
user1 user
FROM fb_messages;
In the above example, the candidate is trying to select the ID and the user_1 field from
the table but using user as an alias. User is a keyword in SQL and can’t be used like
that. To avoid this error, we can use the below query.
SELECT id,
user1 AS user
FROM fb_messages;
In the above query, we have used the AS keyword to give an alias to the column
user1. Thus, we need to use AS and can’t have any shortcuts when using SQL
keywords.
SELECT id,
user1 AS username
FROM fb_messages;
Thus, if you want to use the reserved keywords as aliases, you must use AS for giving
alias, and there shouldn’t be any shortcuts.
Suppose you have a table named it_problems. It’s a list of IT problems categorized as
internal or external.
id problem int
1 Forgotten password Internal
2 Slow performance Internal
3 Slow performance External
4 Application crashes External
5 Printer problems Internal
6 USB problems Internal
SELECT int,
COUNT(id) AS problem_count
FROM it_problems
GROUP BY int;
In MySQL, this would return an error. However, the query runs without the problem in
PostgreSQL and returns this output.
int problem_count
External 2
Internal 4
SELECT `int`,
COUNT(id) AS problem_count
FROM it_problems
GROUP BY `int`;
Or you could name the table before the reserved keyword column name.
SELECT it_problems.int,
COUNT(id) AS problem_count
FROM it_problems
GROUP BY it_problems.int;
Each DB has different reserved keywords; you can find those in the documentation.
- Postgre https://www.postgresql.org/docs/current/sql-keywords-appendix.html
- MySQL https://dev.mysql.com/doc/refman/8.0/en/keywords.html
- MS SQL Server
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/reserved-keywords-transa
ct-sql?view=sql-server-ver16
SQL Coding Error #3: Data De-Duplication
From the vast amount of solutions we have on our platform, we identified one of the
most common SQL coding errors while using a DISTINCT keyword in SQL queries.
Some questions usually ask to output a user or a product based on a certain condition
where a user/product might appear in multiple rows. Many users do not use the
DISTINCT keyword in the query, which results in duplicated user/product in the output.
An example is shown below:
Link: https://platform.stratascratch.com/coding/10322-finding-user-purchases
Dataset:
From the above question, let’s imagine you need to find the user IDs that either buy milk
or bread or both and output the user IDs in ascending order.
Code With Error (Semantic - Duplication in the Data)
SELECT user_id
FROM amazon_transactions
WHERE item IN ('milk','bread')
ORDER BY user_id;
From the above query, the output of the code will have repeated user IDs since one
user might have bought both milk and bread. Thus, in order to de-duplicate the data, we
need to use the DISTINCT keyword, as in the below example.
When writing your SQL queries, it’s critical to think about the question and understand
whether the results need to be de-duplicated or not. If you think yes, then use the
DISTINCT clause to avoid any duplicates in your output.
SQL Coding Error #4: Wrong Understanding of the DISTINCT Clause
In the above section, we saw the importance of using DISTINCT keywords in cases
where we don’t need any duplication. This DISTINCT keyword can be used for one
column or can be used for all the columns that the user selects.
Find matching hosts and guests in a way that they are both of the same
gender and nationality
Find matching hosts and guests pairs in a way that they are both of
the same gender and nationality. Output the host id and the guest id
of matched pair.
Link:
https://platform.stratascratch.com/coding/10078-find-matching-hosts-and-guests-in-a-w
ay-that-they-are-both-of-the-same-gender-and-nationality/discussion
Dataset:
airbnb_hosts
airbnb_guests
The questions asks us to find the hosts and guests pairs where they are both of the
same gender and nationality.
Below is an example of an incorrect query.
SELECT h.host_id,
DISTINCT g.guest_id
FROM airbnb_hosts h
INNER JOIN airbnb_guests g ON h.nationality = g.nationality
AND h.gender = g.gender;
In the above example, the code will result in an error. The DISTINCT clause should be
used at the beginning of listing the columns in the select query. Also, DISTINCT cannot
be applied only to one column, but it automatically applies to all the columns listed in the
select statement.
In the above example, the code will successfully run. The DISTINCT clause is used
right after the select statement. This doesn’t mean that DISTINCT is only applied to
column1 in the above example, but by default, it applies to all the columns in the select
statement (column1 and column2 in the above example).
Thus, the result of the above query will give unique values of the columns host_id and
guest_id, i.e., the unique combinations. If you want one column to be only unique
values and the other columns to be all values, you need two different
outputs/queries/results.
Link: https://platform.stratascratch.com/coding/10161-ranking-hosts-by-beds
Dataset:
Now let’s change this question slightly to understand this common SQL coding error. So
the new question would be: List the top 5 host IDs based on the number of beds. If
there are multiple hosts with the same number of beds, then display all host IDs.
SELECT host_id,
SUM(n_beds) AS number_of_beds
FROM airbnb_apartments
GROUP BY host_id
ORDER BY number_of_beds desc
LIMIT 5;
From the output, you can see the top 5 host IDs based on the number of beds. But, in
reality, there are more hosts with 4 beds, and in the solution, we can only see 1 at the
5th position. Thus, we need to use DENSE_RANK() to rank all the hosts and then filter
based on the rank for each host ID.
SELECT *
FROM
(
SELECT
host_id,
SUM(n_beds) AS number_of_beds,
DENSE_RANK() OVER(ORDER BY SUM(n_beds) DESC) AS rank
FROM airbnb_apartments
GROUP BY host_id
ORDER BY number_of_beds desc
)A
WHERE RANK <=5
From the output of the above query, we get a total of 7 rows because there are 3 hosts
with rank 5 in the dataset. Thus, using LIMIT would give us wrong results, and thus, it
should be used very carefully in such questions.
The RANK() and DENSE_RANK() are the window functions. It might be a good idea to
make yourself familiar with them in our ultimate guide to SQL window functions.
The WHERE clause is used to filter specific rows, while the HAVING clause is used to
filter specific groups. The HAVING clause is used when you need to filter based on a
certain aggregation in the data.
Now let’s look at an example where many users try to filter based on the aggregation
using the WHERE clause, but instead, they should be using a HAVING clause.
Positive Ad Channels
Find the advertising channel with the smallest maximum yearly spending
that still brings in more than 1500 customers each year.
Link: https://platform.stratascratch.com/coding/10013-positive-ad-channels
Dataset:
The query tries to find the distinct advertising channels with at least 1,500 customers
acquired through that channel.
SELECT advertising_channel
FROM uber_advertising
GROUP BY advertising_channel
HAVING MIN(customers_acquired) > 1500;
The above code will successfully run and show the following output.
In this case, we have used the aggregate function MIN() in the HAVING clause instead
of the WHERE clause. Thus, it's crucial to read the question carefully and deduce if the
condition needs to be satisfied on every occasion or only once. With practice, beginners
in data science must get familiar with WHERE and HAVING clauses and when to use
what.
You can find more on this topic in our database interview questions article here →
https://www.stratascratch.com/blog/database-interview-questions/.
SQL Coding Error #7: Float Division
This is another common SQL coding error beginners make when computing a division
between two integer values. Let’s take an example.
Consider you have three columns in the table sales_table – date, sales, and
orders.
You need to calculate a derived column sales_per_order. The columns sales and
orders are integers, but sales_per_order should be of floating type.
SELECT date,
sales,
orders,
sales/orders AS sales_per_order
FROM sales_table;
The above query will run and give an output, but the new column generated will have an
integer rather than a floating value.
SELECT date,
sales,
orders,
CAST(sales AS FLOAT)/orders AS sales_per_order
FROM sales_table;
From the above query, we will get the correct result for the derived column. In this, we
first converted the sales column into float using a CAST() function and then divided it
with the orders column. Even if there is only one float, the operation's output will result
in a float type.
Thus, by just transforming at least one column to a float type, the result of the division is
going to be a floating number. Another way to change the integer column to a float is by
multiplying the value with 1.0, which is similar to casting.
Let’s look at the example where we want to find the user IDs that have at least one
‘Refinance’ and one ‘InSchool’ submission. In other words, they need to have at
submissions of both categories.
Submission Types
Write a query that returns the user ID of all users that have created
at least one ‘Refinance’ submission and at least one ‘InSchool’
submission.
Link: https://platform.stratascratch.com/coding/2002-submission-types/discussion
Dataset:
Code With Error
SELECT user_id
FROM loans
WHERE type = 'Refinance' AND type = 'InSchool';
This code won’t return an error per se, but the output will be empty. Why? The WHERE
clause doesn’t have the context of other rows. The AND operator says the type has to
be ‘Refinance’ or ‘InSchool’, but this is never true – there’s no one row with the
‘Refinance’ and ‘InSchool’ values in the same row.
SELECT user_id
FROM loans
WHERE type = 'Refinance'
INTERSECT
SELECT user_id
FROM loans
WHERE type = 'InSchool';
The other approach could be to write two CTE. One that returns user IDs with the
‘Refinance’ submission, the other with ‘InSchool’.
Then write the SELECT statement where you JOIN the two CTEs and return the distinct
user IDs to remove duplicates.
WITH refinance AS (
SELECT user_id
FROM loans
WHERE type = 'Refinance'
),
inschool AS (
SELECT user_id
FROM loans
WHERE type = 'InSchool')
Summary
In this article, we covered the top SQL coding errors that Data Science beginners make
by analyzing the solutions submitted to our platform. This will help the readers
understand what these mistakes are, how to avoid them in the future, or what the
workarounds can be. Also, we discussed the order of execution of SQL queries to help
beginners understand what part executes first and what part executes last. We hope
this article will help you in your journey to become a data scientist.
Don’t be overwhelmed with the topics that we discussed today. Remember, Rome
wasn't built in one day, so stick with StrataScratch, and slowly and steadily you will get
to your desired position. All the best.