SQL Left Join
SQL Left Join
Attempter Specifications
CONFIDENTIAL INFORMATION
This document contains confidential and proprietary information intended solely for the use of the individual or entity to whom it
is disclosed.
Project Overview
Video Sessions
Task Attempt Workflow
Task flow
Step 1: Review the Natural Language question and the corresponding SQL query
Step 2: Verify if the SQL query executes, returns at least one result, and uses a left join
statement in an intuitive way.
Inappropriate Left Join Usage
Appropriate Left Join Usage
Step 3: Modify and paste the results of the modified query
Step 4: Determine whether or not the modified SQL query aligns with the provided natural
language question
Troubleshooting SQL query
Question alignment rubric
Appendix
1. Good vs Bad - NL questions + SQL Code Alignments
2. 😱 Sample task walkthroughs
FAQs
Nov 20, The use of limit 1000 is not allowed. See Customer
2024 “Important Note About Rows Limitation” in
the Step 2.
Added notes about SELECT * and pdt format in
“About SQL Syntax, Aliases, SELECT *” in
Step 2.
Project Overview
The goal of this project is to assist in the research and training of large language models (LLMs) to
improve their functionality working with SQL. Specifically, you will analyze an SQL query and its
Natural Language question and decide what changes need to be made to the SQL code so that it is
correctly executed, returned in at least one row, and used a left join intuitively. Also, if the Natural
Language question is not aligned with the SQL Query, you will rewrite it to fulfill the alignment.
Video Sessions
Onboarding Course: Onboarding SQL Left Join Course.mp4
Onboarding Session in English: SQL Left Join Tasks - Onboarding 10_30_24 (EN).mp4
Onboarding Session in Spanish: SQL Left Join Tasks - Onboarding 10_30_24 - LCs.mp4
How to check if the LEFT JOIN is ok:
https://www.loom.com/share/7a00c51b499e403aa0ba792cc3ccd6b7?sid=9c03d040-0640-
4310-bd08-d4ef2e4cfb4f
Task flow
⚠️Important Note:
SQL Code Requirements: Ensure each SQL question results in a solution using a LEFT
JOIN intuitively. Use the table below to distinguish between intuitive and non-intuitive LEFT
JOIN usage.
Identify Alignment: Determine whether the SQL query accurately aligns with the intent of
the natural language question.
Check if the SQL code is executable: As a preliminary check, be sure the code runs
correctly and returns at least one row.
Provide Justification for Misalignment: If the SQL query does not fully meet the
question’s intent:
Alternative Justification for Question Misalignment: If the SQL code is correct but the
question does not match the SQL code requirements:
Explain how the question fails to represent the SQL query accurately.
Highlight areas in the question that are ambiguous, incomplete, or misaligned with
the SQL code’s intent.
Step 1: Review the Natural Language question and the corresponding SQL
query
You will be given a natural language question paired with a SQL query. Analyze this question
alongside the SQL code, which may reference multiple tables, and ensure that the SQL code aligns
accurately with the intent of the question.
Once you fix the SQL code in step 2, you will need to make sure that the code aligns with the intent of
the natural language question. So consider where the natural language question and the query may
diverge.
Step 2: Verify if the SQL query executes, returns at least one result, and uses a
left join statement in an intuitive way.
⚠️About SQL Query Execution:
Avoid using execution results to compare INNER JOIN and LEFT JOIN to determine the
appropriate join type. However, always ensure that the SQL queries are executable and
return non-empty results.
1. If you are testing the query in the IDE Sphere Engine, only use this limit for testing
purposes, but remember to write the modified SQL query without using this limit
number.
If there is a defined limit in the query, this should be meaningful, you cannot ask for a Limit
70 if you are asking for all the US states (50 in total), for example.
This part of the task should be completed within the IDE environment in the task.
Check to see if the SQL executes and returns at least one row.
If not there are many potential problems. Please visit the troubleshooting SQL query section
of these instructions.
Once the SQL executes and returns one row, the next step is to ensure that it uses left
join functionality in an intuitive way. Please reference the below table to understand what is
an intuitive use of left join functionality and a non-intuitive use of left-join functionality.
You can reference the table taxonomies in this sheet
1. For example:
2. Even when the original SQL is correct, you must write a modified one with three
backticks around the project, dataset, and table names if necessary.
Avoid using a LEFT JOIN on a table if you do not In such cases, the LEFT JOIN is unnecessary
reference any columns from that table in your and should be removed to simplify the query
query. and improve performance.
Example below:
-- Incorrect:
```sql
SELECT
TableA.column1,
TableA.column2
FROM
TableA
LEFT JOIN
TableB
ON
TableA.id = TableB.id;
```
TableB is joined, but none of its columns are
used in the SELECT statement. This makes
the LEFT JOIN redundant. Removing it would
produce the same result without the extra join.
-- Correct:
```sql
SELECT
TableA.column1,
TableA.column2
FROM
TableA;
```
Since no columns from TableB are needed,
the join is removed, streamlining the query.
Retrieving all rows from the "left" table, regardless This is the core purpose of a LEFT JOIN. You
of matches. want everything from the first table, and
optionally the matching information from the
second.
Prioritizing the left table's data. When the left table holds the primary
information, and the right table provides
supplementary details.
Step 3: Edit the Query in the QUERY_TO_RUN variable DO NOT CHANGE ANY OTHER CODE
Step 4: Run and view output. Remember to limit results to 1K rows
Step 5: Before closing, be sure to copy your (potentially edited) SQL query so you can paste it back
into the task
Modify the SQL code to meet the natural language question. At this step, remember to
apply the previous table to distinguish between intuitive and non-intuitive LEFT JOIN usage.
Copy the modified query into the provided space.
Step 4: Determine whether or not the modified SQL query aligns with the
provided natural language question
1. The NLQ can be answered by one-only SQL query using, for example, only one
table, or there is no need to use a Left Join to answer it.
2. You have 2 tables and a valid Left Join code structure, but the columns of your
tables do not match.
3. You have one or more tables with empty data, generating an empty output when
you run the SQL query.
2. Ask in a natural language to retrieve all rows from the "left" table, regardless of matches.
Example: “Retrieve the list of all countries along with the count of unique debt
descriptions by the unique description count in descending order” instead
of “Retrieve the list of countries along with the count of unique debt descriptions for each
country. Ensure that all countries are included in the results, even those that may not have
any associated descriptions, and order the results by the unique description count in
descending order”.
3. We aim to write natural language questions that feel authentic, as if from a real user, rather
than detailed, step-by-step SQL instructions. Describe the task naturally and
conversationally, without emphasizing technical SQL terms like `LEFT JOIN` vs `INNER
JOIN`. There's no need to specify "include all records even though...", a `LEFT JOIN` will
be implied unless the question explicitly says to exclude records with `NULL` values or
similar constraints.
4. If the natural language question includes technical SQL terms, rewrite it to sound more
conversational and user-friendly. Replace technical words like `LEFT JOIN` or `NULL` with
everyday language that describes the task without SQL jargon. Focus on what the user
wants to achieve, rather than specifying how to do it in SQL.
5. Avoid using execution results to decide whether INNER JOIN or LEFT JOIN is more
appropriate in the SQL query. The natural language question does not need to include
phrases like "include all even if..." because real users would not typically phrase requests
this way. By default, assume that LEFT JOIN is appropriate unless the question explicitly
implies INNER JOIN. Ensure that the NL question is clear and does not unintentionally
suggest the need for an INNER JOIN.
This justification should detail why the SQL query does not meet the question's requirements,
highlighting any missing or misinterpreted elements. Alternatively, if the SQL code is correct
and the question itself does not meet the requirements of the code, explain how the question
falls short in representing the SQL query accurately.
Syntax errors:
If the query does not execute due to a syntax error, then check the provided error log
and correct any issues present.
Refer to the provided public table taxonomies below to determine the most likely
column name that the SQL query should be referencing.
In the case where the SQL executes but there are no results returned, check that the
data returned in the filtered or joined columns matches the format of the filters or
joins.
Example: the following query would return no results.
SELECT
top_rising_terms.term,
top_rising_terms.score
FROM
`bigquery-public-data`.`google_trends`.`international_top_rising_terms` AS top_rising_terms
LEFT JOIN
`bigquery-public-data`.`google_trends`.`international_top_terms` AS top_terms
ON top_rising_terms.term = top_terms.term
WHERE
top_terms.country_code = 'il'
AND top_rising_terms.score IS NOT NULL
ORDER BY
top_rising_terms.score DESC;
You can confirm this by running a quick check such as: select distinct
country_code from bigquery-public-
data.google_trends.international_top_rising_terms limit 5 which will return a
sample of what the data in that column will look like.
You can replicate this for any of the tables in the database by
utilizing SELECT DISTINCT [col_name] FROM bigquery-public-data.
[dataset_name].[table_name] to find out the distinct values in [table_name].
[col_name] and filter appropriately
IMPORTANT: Rewrites that improve the quality of the prompt are extremely
valuable. Marking bad prompts as “Perfect” to avoid rewrites will result in removal from
the project.
Be sure to assess whether the date range in the query is properly referenced in the
prompt: use this resource to determine how to refer to different date ranges.
SELECT Product_line, AVG(cogs) from t0 What is This date filter looks at the
WHERE Date BETWEEN the entirety of 2 months ago
DATE_SUB(DATE_TRUNC(CURRENT_DATE(), MONTH), average not “the last two months”
INTERVAL 2 MONTH) AND cogs for which implies that the
DATE_SUB(DATE_SUB(DATE_TRUNC(CURRENT_DATE() each previous month should be
, MONTH), INTERVAL 1 MONTH), INTERVAL 1 product included. This can be
DAY) GROUP BY Product_line; line in rewritten as “two months
the last ago”.
two
months?
SELECT AVG(cogs) from t0 WHERE Date BETWEEN What is The date filter does not
DATE_SUB(DATE_TRUNC(CURRENT_DATE(), YEAR), the look at the past two years.
INTERVAL 1 DAY) AND average This query looks from the
DATE_ADD(DATE_TRUNC(CURRENT_DATE(), YEAR), cogs for last day of the previous
INTERVAL 2 YEAR); the past year (inclusive) to the first
two day of the year after next
years? (inclusive). It can be
referred to as, “the last day
of last year to the first day
of the year after next” in
the rewrite.
SELECT AVG(Total) from t0 WHERE Date What is The date range in the
BETWEEN DATE_SUB(DATE_TRUNC(CURRENT_DATE(), the prompt should be referred
WEEK), INTERVAL 1 WEEK) AND average to as “last week” as
DATE_SUB(DATE_TRUNC(CURRENT_DATE(), WEEK), total for opposed to “the last week
INTERVAL 1 DAY); the last excluding today”. Stating
week excluding today is
excludin unnecessary/irrelevant
g today? and that wording implies
the last seven days
excluding today.
Appendix
1. Good vs Bad - NL questions + SQL Code Alignments
Example Task One - NL question does not match with the SQL code - SQL is not
executable:
SQL Query:
SELECT
INTERNATIONAL_TOP_RISING_TERMS.term, INTERNATIONAL_TOP_RISING_TERMS.score
FROM
`bigquery-public-data`.`google_trends`.`international_top_rising_terms`
LEFT JOIN
`bigquery-public-data`.`google_trends`.`international_top_terms`
ON
INTERNATIONAL_TOP_RISING_TERMS.term = INTERNATIONAL_TOP_TERMS.term
WHERE
INTERNATIONAL_TOP_TERMS.country_code = 'US'
NL Question:
Top international rising terms for a specific country
Step 1: Review the Natural Language question and the corresponding SQL query.
Step 2: Verify if the SQL query executes, returns at least one result, and uses a left
join statement in an intuitive way.
The original query does not return any data when executed
Using the tips provided in the troubleshooting SQL query section of the instructions, we can
modify the statement to be:
SELECT
top_rising_terms.term,
top_rising_terms.score
FROM
`bigquery-public-data`.`google_trends`.`international_top_rising_terms` AS top_rising_terms
LEFT JOIN
`bigquery-public-data`.`google_trends`.`international_top_terms` AS top_terms
ON top_rising_terms.term = top_terms.term
WHERE
top_terms.country_code = 'IL'
AND top_rising_terms.score IS NOT NULL
ORDER BY
top_rising_terms.score DESC
LIMIT 1000;
Specific changes employed:
Checked other country codes for data present and changed the filter
to top_terms.country_code = 'IL'
Copy the results and paste them into the provided space (the following image shows the
last 6 rows as an extract of the complete result).
Step 4: Determine whether or not the modified SQL query aligns with the provided
natural language question
The original question was, “Top international rising terms for a specific country”.
Remember: This question needs to relate to the MODIFIED SQL query produced in step 2.
The issues with this question are:
Example Task Two - NL question does not match with the SQL code - SQL is not
executable:
SQL Query:
SELECT
TOP_TERMS.term,
INTERNATIONAL_TOP_TERMS.region_name
FROM
bigquery-public-data.google_trends.top_terms
LEFT JOIN
bigquery-public-data.google_trends.international_top_terms
ON
TOP_TERMS.term = INTERNATIONAL_TOP_TERMS.term
WHERE
TOP_TERMS.week = '2023-03-05'
LIMIT 1000;
NL Question:
The most popular search terms in a given week, along with their region
Step 1: Review the Natural Language question and the corresponding SQL query.
Consider whether or not the filter conditions, return columns, etc. align between the query
and the response.
REMINDER: No specific action is needed at this step.
Step 2: Verify if the SQL query executes, returns at least one result, and uses a left
join statement in an intuitive way.
The original query executes but it does not rank terms by popularity within each region (the
following image shows the last 6 rows as an extract of the complete result).
Using the tips provided in the troubleshooting SQL query section of the instructions, we can
modify the statement to be:
SELECT
x.term,
x.region_name,
COUNT(*) AS popular
FROM (
SELECT
top_terms.term,
international_top_terms.region_name
FROM
`bigquery-public-data`.`google_trends`.`top_terms`
LEFT JOIN
`bigquery-public-data`.`google_trends`.`international_top_terms`
ON
top_terms.term = international_top_terms.term
WHERE
top_terms.week = '2023-03-05'
) AS x
GROUP BY
x.term,
x.region_name
ORDER BY
popular DESC
LIMIT 100;
The corrected query introduces two main changes to address the issues in the original
query that not only retrieve data but also rank terms by popularity within each region:
Copy the results and paste them into the provided space (the following image shows the
last 5 rows as an extract of the complete result).
Step 4: Determine whether or not the modified SQL query aligns with the provided
natural language question
The original question was, “The most popular search terms in a given week, along with
their region”
Remember: This question needs to relate to the MODIFIED SQL query produced in step 2.
The issues with this question are:
A quick-reference guide addressing common concerns and issues reviewers or recorders may face,
with clear instructions on how to resolve them
Question Answer
Do I need to select all the columns of a No. Please use "SELECT *" instead, as
table as follows if I want to show all follows. This is critical on training data
them in the results? especially. Teaching the model to list too
SELECT many columns might cause truncation of
assignment.rf_id, results.
assignment.file_id, SELECT
assignment.cname, assignment.*,
assignment.caddress_1, conveyance.convey_ty,
assignment.caddress_2, conveyance.employer_assign
assignment.caddress_3, FROM
assignment.caddress_4, `bigquery-public-
assignment.reel_no, data`.`uspto_oce_assignment`.`assignm
assignment.frame_no, ent` AS assignment
assignment.convey_text, LEFT JOIN
assignment.record_dt, `bigquery-public-
assignment.last_update_dt, data`.`uspto_oce_assignment`.`assignm
assignment.page_count, ent_conveyance` AS conveyance
assignment.purge_in, ON
conveyance.convey_ty, assignment.rf_id =
conveyance.employer_assign conveyance.rf_id
FROM LIMIT 10;
bigquery-public-
data.uspto_oce_assignment.assignme
nt AS assignment
LEFT JOIN
bigquery-public-
data.uspto_oce_assignment.assignme
nt_conveyance AS conveyance
ON
assignment.rf_id =
conveyance.rf_id
LIMIT 10;