0% found this document useful (0 votes)
2 views20 pages

E-SQL Direct Schema Linking Via Question Enrichment in

The document introduces E-SQL, a novel pipeline designed to enhance the Text-to-SQL translation task by directly linking database schemas through question enrichment and candidate predicate augmentation. E-SQL aims to improve the accuracy of SQL query generation by incorporating relevant database elements into natural language queries, thereby addressing challenges such as schema linking and ambiguity in user queries. Experimental results demonstrate that E-SQL achieves competitive performance, particularly in handling complex queries, while highlighting the limitations of traditional schema filtering techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views20 pages

E-SQL Direct Schema Linking Via Question Enrichment in

The document introduces E-SQL, a novel pipeline designed to enhance the Text-to-SQL translation task by directly linking database schemas through question enrichment and candidate predicate augmentation. E-SQL aims to improve the accuracy of SQL query generation by incorporating relevant database elements into natural language queries, thereby addressing challenges such as schema linking and ambiguity in user queries. Experimental results demonstrate that E-SQL achieves competitive performance, particularly in handling complex queries, while highlighting the limitations of traditional schema filtering techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

E-SQL: Direct Schema Linking via Question Enrichment in

Text-to-SQL
Hasan Alp Caferoğlu Özgür Ulusoy
[email protected] [email protected]
Bilkent University Bilkent University
Ankara, Turkey Ankara, Turkey

Abstract remains between the best-performing models and human-level ac-


Translating Natural Language Queries into Structured Query Lan- curacy, underscoring that even the most sophisticated pipelines are
guage (Text-to-SQL or NLQ-to-SQL) is a critical task extensively not yet suitable for real-world deployment as a natural language
arXiv:2409.16751v2 [cs.CL] 28 Jan 2025

studied by both the natural language processing and database interface to databases [28].
communities, aimed at providing a natural language interface to Prior to the emerge of LLMs [1–4, 7, 29, 33, 34, 41, 42], a wide
databases (NLIDB) and lowering the barrier for non-experts. Despite range of studies [6, 12, 16, 23, 30, 43, 44, 46, 54] focused on build-
recent advancements made through the use of Large Language Mod- ing encoder-decoder based neural network architectures utilizing
els (LLMs), significant challenges remain. These include handling recurrent neural networks [5, 17] and various pre-trained language
complex database schemas, resolving ambiguity in user queries, models [8, 9, 38, 51]. These early approaches established a founda-
and generating SQL queries with intricate structures that accu- tion but were often limited in handling complex queries or schemas.
rately reflect the user’s intent. In this work, we introduce E-SQL, LLMs have shown substantial potential in the Text-to-SQL task,
a novel pipeline specifically designed to address these challenges yielding impressive results across various benchmarks. To further
through direct schema linking and candidate predicate augmenta- enhance the reasoning capabilities of LLMs, a variety of in-context
tion. E-SQL enhances the natural language query by incorporating learning (ICL) techniques have been introduced, including chain-
relevant database items (i.e., tables, columns, and values) and condi- of-thought (CoT) prompting [48], question decomposition [20, 55],
tions directly into the question and SQL construction plan, bridging self-consistency [47], and others [18, 32, 49, 53]. Although many
the gap between the query and the database structure. The pipeline of these strategies have been successfully applied to Text-to-SQL
leverages candidate predicate augmentation to mitigate erroneous translation pipelines [13, 26, 35, 37, 40], improving LLM reasoning
or incomplete predicates in generated SQLs. Comprehensive eval- specifically from the perspective of question refinement remains
uations on the BIRD benchmark illustrate that E-SQL achieves relatively underexplored.
competitive performance, particularly excelling in complex queries Beyond in-context learning, LLM performance can also be im-
with a 66.29% execution accuracy on the test set. A further obser- proved through fine-tuning or training from scratch. However,
vation from our experiments reveals that incorporating schema these techniques are resource-intensive, requiring significant com-
filtering into the translation pipeline does not have a positive im- putational resources and large volumes of task-specific annotated
pact on performance when the most advanced proprietary LLMs data. While proprietary models are less frequently fine-tuned for
are used. Additionally, our experiments with small LLMs highlight Text-to-SQL [31], promising results have been achieved through the
the importance and positive impact of enriched questions on their fine-tuning of numerous open-source models [13, 26, 35, 36, 40, 45].
performance. Without fine-tuning, single-prompt SQL generation Figure 1 demonstrates the general pipeline and modules used
using enriched questions with DeepSeek Coder 7B Instruct 1.5v in prior works. A critical component of the Text-to-SQL task is
achieves 56.45% execution accuracy on the BIRD development set. schema linking, which involves connecting the sense of natural
language query to the database schema. Although various meth-
ods have been proposed to enhance schema linking, it remains a
Keywords
core challenge. Schema filtering, a technique commonly used to
Text-to-SQL, Large Language Model, Schema Linking, Question eliminate irrelevant database items, has been widely adopted to
Enrichment reduce noise for downstream tasks. While both neural network-
based [25, 26] and LLM-based [10, 22, 35, 37, 40] schema filtering
techniques have been explored, our findings align with those of
1 Introduction Maamari et al. [31], indicating that schema filtering can result in
The task of translating natural language queries into SQL (Text- performance degradation when the latest generation LLMs are em-
to-SQL) has garnered considerable attention due to its potential to ployed. Additionally, several studies [13, 31, 35–37, 40, 45] reveal
lower the technical barrier for non-experts and enhance the perfor- that providing database-related information in response to a query
mance of querying or recommendation systems. Situated at the in- significantly enhances performance.
tersection of natural language processing (NLP) and database man- In this work, we introduce E-SQL: Direct Schema Linking via
agement, this task aims to enable users to interact with databases Question Enrichment in Text-to-SQL1 , a novel pipeline designed to
through simple, natural language queries, without requiring ex-
tensive knowledge of SQL syntax or database schema structures. 1 Thecomplete code required to reproduce the reported results is publicly available
Despite advancements in utilizing large language models (LLMs) on our GitHub repositoryhttps://anonymous.4open.science/r/E-SQL_Direct_Schema_
for Text-to-SQL, a significant performance gap of around 20% still Linking
Hasan Alp Caferoğlu and Özgür Ulusoy

directly address the schema linking challenge through question paradigm for schema linking and prompt augmentation by leverag-
enrichment and candidate predicate augmentation. We explore the ing question enrichment and candidate predicate augmentation in
improvement of both LLM reasoning and schema linking from the the context of Text-to-SQL translation.
perspective of question enrichment. E-SQL enhances the natural The key contributions of our work can be summarized as follows:
language query representation by incorporating relevant database • We propose a new paradigm for schema linking through
elements—such as tables, columns, and values—directly into the question enrichment, which leads to direct schema link-
query, along with an associated SQL construction plan. This ap- ing by incorporating related database items and potential
proach is augmented by generating candidate predicates, which conditions into the natural language question. Fully en-
reduce the likelihood of erroneous or incomplete SQL predicates. riched queries further guide LLMs in SQL construction by
This methodology differs from traditional schema filtering tech- providing explicit logical steps.
niques, which have been commonly used to simplify the schema • To the best of our knowledge, we are the first to enhance
presented to the model. The E-SQL pipeline consists of four main both LLM reasoning capabilities and schema linking perfor-
modules: Candidate SQL Generation (CSG), Candidate Predicate mance through the use of a question enrichment module,
Generation (CPG), Question Enrichment (QE), and SQL Refinement which composes database integrated questions, in Text-to-
(SR). SQL translation task.
In the Candidate SQL Generation (CSG) module, an initial SQL • We propose a candidate predicate generation technique
query is generated. This query is then parsed to extract values and leveraging the LIKE operator and demonstrate its positive
operations from its predicates. The Candidate Predicate Generation impact by augmenting prompts with candidate predicates.
(CPG) module uses these extracted elements to find similar values • Our experiments also confirm the potential drawbacks of
from the database and constructs candidate predicates. Using the traditional schema filtering techniques when integrated
candidate predicates, the Question Enrichment (QE) module in- into a Text-to-SQL translation pipeline like ours, which
structs the LLM to incorporate relevant database items and possible leverages the most advanced proprietary LLMs.
predicates into the natural language question, while concurrently • We demonstrate the importance and positive impact of
formulating SQL construction steps as its reasoning process. These database-integrated questions, including logical SQL con-
steps are then utilized to produce a fully enriched query. Simulta- struction steps, on the performance of small LLMs in the
neously, the candidate SQL query is executed to identify potential task of Text-to-SQL translation, achieved without requiring
execution errors. Finally, in the SQL Refinement (SR) module, the fine-tuning.
candidate SQL query is either refined or a new SQL query is gener-
ated, utilizing the enriched question, candidate predicates, and any
2 Related Work
identified execution errors.
The impact of schema filtering, a widely adopted technique in Before the emergence of LLMs, supervised fine-tuning approaches
previous research, is also explored on our pipeline. We incorporate in Text-to-SQL translation focused on encoder - decoder architec-
an additional schema filtering module into our pipeline, where the tures that utilized recurrent neural networks (RNNs) [16, 43, 54],
LLM is instructed to select only the database tables and columns pre-trained language models (PLMs) [12, 23, 30], convolutional neu-
relevant to the query while eliminating others. Following this, the ral networks (CNNs) [6] and graph neural networks (GNNs) [46].
Filtered Schema Correction technique is applied to resolve any in- These methods encoded natural language questions alongside data-
consistencies between the filtered schema and the original database base schema to establish schema linking and generated SQL queries
schema. Our experiments demonstrate that schema filtering can through sequence generation [30, 54], grammar-based methods [6,
negatively affect performance when applied in conjunction with 16, 46], or sketch-based slot-filling strategies. These approaches pro-
the most advanced proprietary LLMs. Instead, direct schema linking vided a foundational understanding for Text-to-SQL tasks, paving
through question enrichment and candidate predicate augmenta- the way for more advanced solutions.
tion provides a more reliable strategy for accurate SQL generation, The emergence of both proprietary and open-source LLMs [3,
particularly in complex cases. 4, 15, 19, 34, 41] has marked a significant shift in the field. Thanks
Through an ablation study, we illustrate the effectiveness of each to their advanced reasoning and comprehension capabilities, the
module within the pipeline for the Text-to-SQL task. In particular, research community has increasingly focused on harnessing the
our question enrichment module significantly improves perfor- power of these models for Text-to-SQL tasks.
mance on challenging questions, yielding nearly a 5% increase in
accuracy. 2.1 LLM Reasoning
We evaluate E-SQL on the Spider [52] and BIRD [28] bench- Advanced reasoning techniques are crucial for improving LLM per-
marks, well-known standard datasets for the Text-to-SQL task, and formance on complex tasks. While prompt design is important,
demonstrate its ability to handle complex queries while maintain- methods that enhance intrinsic reasoning, such as breaking down
ing competitive performance with state-of-the-art methods. Our problems or refining formulations, have led to significant advance-
findings suggest that the integration of enriched questions, SQL ments in LLM capabilities.
generation steps, and candidate predicates leads to more accurate The Chain-of-Thought (CoT) approach [48] significantly im-
SQL generation, particularly for complex queries involving multiple proves the performance for multi-step reasoning tasks by guiding
conditions and joins. Therefore, our approach establishes a new LLMs to generate intermediate reasoning steps. Kojima et al. [21]
explored a simple yet effective prompting, "Let’s think step by
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

Figure 1: Overview of the general pipeline for the Text-to-SQL translation task, highlighting the key modules: Schema Filtering,
Question Decomposition, Entity Retrieval, and Query Generation. The modular design allows for variation in the usage of
these components, depending on the preferred pipeline configuration.

Figure 2: Overview of the proposed E-SQL pipeline with candidate predicate generation, question enrichment, SQL refinement
modules, and without schema filtering module.

step", to uncover the reasoning ability of the language models in structure of the database. This process is facilitated by schema
zero-shot mode. Decomposing complex questions into simpler sub- linking, which bridges the gap between the query and the database
questions has been explored [11, 20, 55]. Techniques like Self Con- schema, ensuring that words or phrases in the query are accurately
sistency [47] apply majority voting to select consistent answers, matched to the relevant database elements, such as tables, columns,
while Self-Improve [18] and Self-Refine [32] iteratively refine re- or values.
sponses through self-generated data and feedback, respectively. Previous works have shown that removing irrelevant database
Moving beyond answer generation, Xi et al. [49] demonstrated sig- elements can improve schema linking and enhance model perfor-
nificant gains by refining problem contexts and questions through mance in Text-to-SQL tasks by reducing the likelihood of schema-
the Self-Polish technique. based hallucinations. This process, known as schema filtering or
To the best of our knowledge, the refinement and generation of schema pruning, has been extensively studied. RESDSQL [25] and
high-quality user queries, particularly with embedded reasoning CodeS [26] address this by classifying schema items based on their
that facilitates the construction of correct SQL queries, has not relevance to the natural language query and then filtering them
been explored in the literature for the Text-to-SQL translation task. according to their classification probabilities after ranking. While
Our work addresses this gap by proposing a methodology that DIN-SQL [35] uses a single step to select only the question rele-
produces clear, schema-aware queries enriched with reasoning vant database tables and columns, C3 [10], CHESS [40], and MCS-
elements, explicitly guiding the generation of accurate SQL queries SQL [22] first filter database tables (table linking) and then select
and overcoming schema linking challenges. the most appropriate columns (column linking) from the previously
filtered tables, achieving better schema filtering. For schema link-
2.2 Schema Linking and Filtering ing and filtering, the TASL module of TA-SQL [37] generates a list
Translating a natural language query into SQL requires a clear of schema entities (tables and columns) from an LLM-generated
understanding of both the language used in the question and the
Hasan Alp Caferoğlu and Özgür Ulusoy

dummy SQL query, which is then used to create symbolic rep- we refine the SQL after enriching the question and augmenting the
resentations. Our approach also leverages the initially generated prompt with candidate predicates.
candidate SQL query; however, rather than using it for schema
filtering, we extract possible database values from conditions to 3 Methodology
generate potential predicates, which are then used in downstream
Our work focuses on bridging the gap between the database schema
tasks as part of data augmentation, explained in Section 3.3. Gao
and users’ natural language queries by enriching the questions with
et al. [13] explored the impact of different schema representations
database items through keyword mappings and expanding them
and demonstrated that representing the database schema as code,
with SQL generation steps and reasoning. Our approach can be
instead of natural language representation, leads to a better un-
characterized as direct schema linking via query enrichment with
derstanding of the question-to-SQL mapping, thereby improving
relevant database elements. To further minimize errors such as
schema linking. While previous works have demonstrated the pos-
incorrect table, column, or missing value usage in predicates, we
itive impact of schema filtering on schema linking, Maamari et
augment the prompt with all possible predicates after extracting
al. [31] argue that explicit schema filtering remains useful for less
them.
capable large language models (LLMs), but it becomes unneces-
Our proposed method consists of four main modules, as illus-
sary with the latest, more advanced LLMs such as GPT-4o. Our
trated in Figure 2: Candidate SQL Generation (CSG), Candidate
experiments with single-step schema filtering implemented in the
Predicate Generation (CPG), Question Enrichment (QE), and SQL
E-SQL pipeline corroborate this finding, indicating that explicit
Refinement (SR). Additionally, recognizing the mixed outcomes
schema filtering can be redundant and can result in performance
of database schema filtering observed in prior work, we include a
degradation when applied within pipelines leveraging advanced
Schema Filtering and Filtered Schema Correction module (SF). How-
LLMs such as GPT-4o and GPT-4o-mini [33].
ever, our experiments on our pipeline reveal that schema filtering
can lead to performance degradation when integrated into Text-
2.3 Data Augmentation to-SQL pipelines with the most advanced large language models
(LLMs), corroborating the findings of Maamari et al. [31]. Detailed
One essential aspect of data augmentation involves providing rele-
explanations of each module are provided in the following subsec-
vant content to the LLM through prompts. Augmenting prompts
tions.
with items pertinent to the query is crucial for improving Text-to-
SQL translation performance. Depending on the sub-task, prompts
are typically enriched with explanations of database items, data- 3.1 Database Description and Value Selection
base values selected based on similarity, sub-task specific exam- In real-world databases, description files serve as a valuable re-
ples, and database schema—represented either as code or natural source, offering detailed information about database items. Large-
language—that are filtered or unfiltered, along with decomposed scale databases often contain numerous description files with con-
queries and candidate SQL queries [13, 31, 35–37, 40, 45]. In our tent that can be extensive. Another crucial piece of information
work, we incorporate database item explanations and database for LLMs is the actual data values within the database. However,
values similar to the query in prompt. Additionally, we include augmenting the prompt with all available descriptions and data-
execution errors for candidate queries to provide valuable feedback base values is impractical due to the context window limitations
during query refinement. Unlike prior work, we propose a novel of LLMs and the associated computational costs. Another effective
approach to further reduce predicate errors and enhance LLM per- perspective to prompt augmentation involves carefully selecting
formance by introducing candidate conditions with exact database relevant database-related information while filtering out content
values, effectively bridging the gap between the natural language unrelated to the query, thereby minimizing noise and enhancing
query and the database. model performance.
Li et al. [27], Scholak et al. [39], and Li et al. [25] utilize the
Longest Common Substring (LCS) algorithm to identify database
2.4 SQL Query Generation and Correction values related to the query. However, the time complexity of the
The final step in Text-to-SQL tasks is generating an SQL query that LCS algorithm can significantly slow down the retrieval process. To
accurately answers the user’s natural language question. While address this issue, Li et al. [26] propose a "coarse-to-fine" matching
some approaches, like C3 [10], employ zero-shot generation, few- framework, which initially extracts hundreds of potentially relevant
shot prompting techniques are more commonly used to enhance values from the entire database using BM25, followed by the ap-
performance. Methods such as self-consistency [47] and multi- plication of the LCS algorithm to retrieve the most query-relevant
choice selection are employed by models like C3, DAIL-SQL, CHESS, values. Talaei et al. [40] employ a different approach by first extract-
and MSC-SQL [10, 13, 22, 31, 40]. However, these techniques lead to ing keywords from the natural language query with the assistance
high computational costs due to large number of generation steps of an LLM, and then retrieving similar values from the database us-
for a single query. To address missing or redundant keywords in gen- ing a Locality Sensitive Hashing (LSH) based hierarchical retrieval
erated SQLs, Pourreza and Rafiei [35] introduced a self-correction strategy. DIN-SQL [35] takes a more focused approach by extract-
module. With MAC-SQL, Wang et al. [45] proposed a refiner agent ing possible cell values directly from the query with the help of
that evaluates the syntactic correctness and executability of the gen- an LLM, similar to the keyword extraction phase of CHESS [40].
erated SQL, ensuring non-empty result generation. In our work, we In our work, we employ the BM25 ranking algorithm not only to
propose a SQL refinement module similar to MAC-SQL [45], where retrieve the most relevant database values but also to identify the
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

most pertinent database descriptions. We augment prompts with (3) The table and value are correct, but the wrong column is
the 10 most relevant database values for each column and the 20 selected in the predicate. In other words, another column
most relevant database description sentences, as determined by the in the selected table contains the value. Figure 4 shows the
BM25 ranking algorithm. Additionally, when providing database wrong column usage while the selected table and value are
values for each column, we ensure that any columns containing correct.
"NULL" values also include "NULL" as part of the selected values. (4) The correct table is selected, but both the column and the
This guarantees that the LLM is aware of potential null entries, value are incorrect or contain missing or extraneous parts.
allowing it to incorporate conditions such as "IS NOT NULL" when As shown in Figure 5, "T1.’County Name’ = ’Fresno’"
needed. is generated as part of the predicate, whereas it should be
"T1.’District Name’ = ’Fresno County Office of
3.2 Candidate SQL Generation Module (CSG) Education’".
When provided with appropriate information, LLMs can effectively (5) The value is generated correctly, but the wrong table and
establish strong connections between the database schema and the column are selected, meaning the generated value should
natural language query, resulting in the generation of mostly ac- be used but belongs to another table and one of its columns.
curate SQL queries. Although these SQL queries may occasionally (6) The table, column, and value are wrong and totally irrele-
contain errors, such as incorrect table or column usage or missing vant to the question.
values in predicates, they typically avoid incorporating completely To address the possible errors outlined in items (2) to (5), we pro-
irrelevant database items. To leverage this capability, we extract pose the Candidate Predicate Generation (CPG) module, as seen in
potentially useful information from the initially generated SQL the Figure 2, to enhance downstream tasks by augmenting prompts,
queries in subsequent steps to enhance data quality and, conse- enabling the LLM to be aware of possible predicates and generate
quently, improve model performance. Qu et al. [37] generate and correct ones. In the CPG module, we first parse the candidate SQL
employ dummy SQL for extracting schema entities. Further details query to extract the values and operations used in predicates. We
on how dummy SQL is utilized in our approach are provided in then utilize the LIKE operator to retrieve all potential values from
Section 3.3. the database by executing the following query:
Our candidate SQL generation module incorporates three ran- SELECT DISTINCT <COLUMN>
domly selected samples, each from a different database than the one FROM <TABLE>
associated with the current query, across various difficulty levels WHERE <COLUMN> LIKE '%<VALUE>%';
from the few-shot data. These samples are ordered to present sim- Here, <VALUE> represents a token extracted from the values in
pler question-SQL pairs first, followed by more challenging pairs. candidate SQL query. This process results in a list of possible predi-
After providing database schema code representation as suggested cates, formatted as "<table>.<column> <operation> <value>,"
in DAIL-SQL [13], the prompt is further augmented with selected which is used in downstream tasks.
database descriptions and relevant database values for each column
as described in the previous subsection 3.1. To enhance the LLM’s
reasoning capabilities, we use the phrase "Let’s think step by step" Question:
and instruct the LLM to generate reasoning steps, as proposed by Please list the zip code of all the charter schools in Fresno County
Office of Education.
Kojima et al. [21] and Wei et al. [48].
Predicted SQL:
3.3 Candidate Predicate Generation (CPG) SELECT T2.Zip
FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode
Determining the correct predicates to use in the SQL query is a WHERE T1.`District Name` = 'Fresno' AND T1.`Charter School (Y/N)` = 1

crucial step in Text-to-SQL translation. Successfully establishing Gold SQL:


the relationship between database items and the query is essential. SELECT T2.Zip
However, even the most advanced LLMs sometimes struggle to FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode
WHERE T1.`District Name` = 'Fresno County Office of Education' AND T1
generate accurate predicates. For a predicate in an LLM-generated .`Charter School (Y/N)` = 1
SQL query, assuming the correct operation is used, there are six
possible cases:
(1) The predicate is correct syntactically, schematically, and Figure 3: Example for the generation of incomplete value in
semantically. In other words, executing the "SELECT * the predicate explained in Section 3.3, case (2).
FROM <table> WHERE <table>.<column> <operation>
<value>" query would yield results without any execution
errors or an empty set.
(2) The table and column are correct, but the value used in the
3.4 Schema Filtering and Filtered Schema
predicate is incomplete or contains extraneous characters Correction Module (SF)
or words. As shown in Figure 3, the generated predicate In prior works, schema filtering—eliminating database tables and
uses "Fresno" instead of "Fresno County Office of columns irrelevant to the query—has been shown to improve model
Education" as a value, leading to an incorrect SQL despite performance by reducing the likelihood of generating irrelevant
the absence of execution errors. schema items in SQL queries. One approach involves instructing
Hasan Alp Caferoğlu and Özgür Ulusoy

our experiments show that incorporation of schema filtering into


Question: our pipeline result in performance decrease when used with the
Please list the zip code of all the charter schools in Fresno County
Office of Education.
latest proprietary LLMs, a finding that aligns with the work of
Maamari et al. [31]. Detailed experimental results are provided in
Predicted SQL: Section 4.3. Consequently, we have chosen not to include a schema
SELECT T2.Zip
FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode
filtering module in our E-SQL pipeline.
WHERE T1.`County Name` = 'Fresno County Office of Education' AND T1.`
Charter School (Y/N)` = 1

Gold SQL:
3.5 Question Enrichment Module (QE)
SELECT T2.Zip To enhance schema linking, various LLM reasoning improvement
FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode
WHERE T1.`District Name` = 'Fresno County Office of Education' AND T1
techniques, such as chain-of-thought (CoT) [48], question decom-
.`Charter School (Y/N)` = 1 position [11, 20, 55], and self-consistency [47] have been applied
to Text-to-SQL translation tasks. Almost all prior works leverag-
ing LLMs utilize chain-of-thought reasoning. Question decom-
position has been applied to Text-to-SQL tasks by Pourreza et
Figure 4: Example for correct table and value but wrong col-
al. [35] and Wang et al. [45]. Self-consistency and multi-choice
umn in the predicate explained the Section 3.3, case (3).
selection techniques have been employed by models like C3[10],
DAIL-SQL [13], CHESS [40], and MSC-SQL [22]. Another key ap-
proach is schema filtering, which eliminates irrelevant database
Question: items and augments the prompt with database values related to
Please list the zip code of all the charter schools in Fresno County the query, thereby narrowing the gap between the query and the
Office of Education.
database schema [10, 22, 37, 40]. Previous paradigms have largely
Predicted SQL: overlooked enhancing the reasoning and schema linking capabil-
SELECT T2.Zip ities of LLMs through question reformulation. Focusing on this
FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode
WHERE T1.`County Name` = 'Fresno' AND T1.`Charter School (Y/N)` = 1 aspect, we propose a novel question enrichment strategy that di-
rectly links natural language questions to the database schema by
Gold SQL:
SELECT T2.Zip
expanding the question with relevant database items and incorpo-
FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode rating logical steps to generate accurate SQL queries as shown in
WHERE T1.`District Name` = 'Fresno County Office of Education' AND T1 Figure 6.
.`Charter School (Y/N)` = 1
In the Question Enrichment module (QE), we instruct the LLM
to refine the original question by incorporating relevant database
items (tables, columns, and values) and conditions. This process
Figure 5: Example for correct table but wrong column and makes the question more understandable, coherent, well-aligned
value selection in the predicate explained in Section 3.3, case with the database schema. Additionally, an SQL construction plan,
(4). generated during the question-database integration as part of the
reasoning process, is appended to the enriched question. Together,
this plan and the enriched question form the fully enriched question
the LLM to first select the relevant database tables and then choose which explicitly incorporates the necessary SQL components and
the most relevant columns from those tables [10, 22, 40]. Another logical steps, guiding the LLM to generate accurate SQL queries.
approach filters the database schema by considering both tables and The creation process of a fully enriched question, which combines
columns simultaneously in a single step [35]. Additionally, some the original question, the enriched question, and the enrichment rea-
methods leverage dummy SQL queries to extract relevant database soning, is illustrated in Figure 6b. A few-shot strategy is applied to
entities for schema filtering [37]. Li et al. propose CODES [26] generate refined questions, using randomly sampled question-SQL
and train a schema classifier to predict relevance scores following pairs from the development set that have been manually annotated
RESDSQL [25]. In our work, we adopt a single-step schema filtering and made publicly available. We annotate 12 questions for each
approach, extracting only the relevant tables and their associated difficulty level: simple, moderate, and challenging. Since the manu-
columns. ally annotated questions are taken from the development set, we
However, we observed that the filtered schema may not always ensure that the randomly selected 3 examples from each difficulty
be compatible with the original schema, where a selected column level provided in the question enrichment prompt are related to a
might belong to a different table according to the original database database different from the one associated with the current query.
schema. When such issues arise, they can negatively impact the This approach prevents enriched question examples directly related
SQL generation process and lead to a decline in performance. To ad- to the database of the considered query from being included in the
dress this, we propose a filtered schema correction strategy, which prompt. In the few-shot examples, we include both the enriched
checks for mismatches between the filtered schema and the original question and the enrichment reasoning, manually prepared consid-
schema, subsequently correcting them accordingly. While previ- ering the chain-of-thought technique, to fully leverage reasoning
ous research has demonstrated the benefits of schema filtering for capabilities of the model and explicitly outline the SQL logical steps.
schema linking and overall Text-to-SQL translation performance, The question enrichment prompt also includes the database schema,
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

relevant database descriptions and values, and candidate predicates The prompt skeletons used for each module in our pipeline are
generated by the Candidate Predicate Generation (CPG) module. provided in our Github repository. Furthermore, a detailed example
The fully enriched question is generated in a single prompt, with- illustrating the complete workflow of the pipeline for a sample user
out iterative refinement. Iterative enrichment is left as a potential query is also provided for reference.
direction for future work.
4 Experiments and Results
4.1 Experiment Settings
4.1.1 Dataset. We conduct our experiments on the Spider [52]
dataset and the most challenging cross-domain BIRD dataset [28].
Spider benchmark includes 10,181 Text-to-SQL pairs from 200
databases spanning 138 domains, each containing multiple tables.
For our experiments, we used the test split of the Spider dataset,
which consists of 2,147 Text-to-SQL pairs. BIRD focuses on the
real-word complex and large databases, and it provides external
knowledge to enhance the capability of large language models to
understand both the question and the database structure and values
better. BIRD dataset spans 37 professional domains such as foot-
ball, formula 1, blockchain, healthcare, and education, etc., and it
contains 12,751 text-to-SQL pairs from 95 databases with a size of
33.4 GB. The training set consists of 9,428 Text-to-SQL pairs, while
(a) Question enrichment example the development and test sets consist of 1,534 and 1,789 instances,
respectively. The test set of the BIRD dataset is concealed.
4.1.2 Evaluation Metrics. In our experiments, we evaluate model
performance using the following metrics: Execution Accuracy (EX),
the Reward-based Valid Efficiency Score (R-VES) and Soft F1-score,
(b) Concatenation of original question, enrichment reasoning, and as defined by the BIRD benchmark [28]. Execution Accuracy (EX)
enriched question. measures the correctness of the predicted SQL queries by comparing
their execution results to those of the ground-truth SQLs. This
Figure 6: (a) Question enrichment example and (b) Fully en- metric accounts for the fact that SQL queries can take multiple
riched question construction. forms but still produce identical outcomes, thus emphasizing result
accuracy over syntactic similarity.
In addition to EX, in the latest version, the BIRD benchmark in-
troduces a Reward-based Valid Efficiency Score (R-VES) to capture
3.6 Predicate and Error-Aware SQL Refinement
the execution efficiency of correctly predicted SQL queries. R-VES
Module (SR) is an adjusted version of the previously proposed Valid Efficiency
Since the generated SQL queries may contain minor mistakes, a Score (VES). R-VES evaluates the model considering both the accu-
common approach to address these is to use a correction mod- racy and the runtime performance of the SQL queries. Calculation
ule, as shown in Figure 1. Pourreza et al. [35] proposed a novel of R-VES is provided below [28].
self-correction module that relies heavily on LLMs to identify and
correct potential errors without executing the SQL queries. Another 
 1.25 if 𝑦ˆ is correct and 𝜏 ≥ 2
approach employed by C3 [10], DAIL-SQL [13], CHESS [40], and

1 if 𝑦ˆ is correct and 1 ≤ 𝜏 < 2


MCS-SQL [22] is to generate multiple SQL queries for a natural


 0.75 if 𝑦ˆ is correct and 0.5 ≤ 𝜏 < 1


language question and select the most consistent one as the final R-VES =

predicted SQL, rather than relying solely on the greedily gener- 

 0.5 if 𝑦ˆ is correct and 0.25 ≤ 𝜏 < 0.5
ated SQL. The refiner agent proposed by Wang et al. [45] verifies  0.25 if 𝑦ˆ is correct and 𝜏 < 0.25



the syntactic correctness and executability of the generated SQL,

0 if 𝑦ˆ is incorrect

triggering a correction operation if the SQL fails these checks. The

Where:
refiner agent performs these actions iteratively until the result is
• 𝑦ˆ represents the predicted SQL.
correct or the maximum number of iterations is reached. Although Ground truth SQL run time
this design enhances performance, it also increases both the cost • 𝜏 = Predicted SQL run time represents the time ratio. 𝜏 is
and time required to produce the final SQL query. In our work, as calculated by running the SQL 100 times, taking the average,
illustrated in Figure 2, we execute the candidate SQL query and de- and dropping any outliers.
tect any execution errors. Using this error information, along with Moreover, the BIRD benchmark introduced the Soft F1-score
candidate predicates, the enriched question, and database schema, metric, designed to address the limitations of evaluation metrics
we instruct the LLM to either generate a new SQL query or refine such as EX. The Soft F1-score allows for a more lenient assessment
the existing candidate SQL query. by mitigating the impact of minor discrepancies, such as column
Hasan Alp Caferoğlu and Özgür Ulusoy

reordering or missing values, in the table outputs. This makes it a are provided in Sections 4.4 and 4.5. These results highlight the
complementary metric to EX, providing a more flexible evaluation efficacy of question enrichment, candidate predicate augmentation,
of the model’s performance in real-world scenarios. and schema refinement techniques, especially in handling complex
queries.
4.1.3 Models. In this work, we employed the latest proprietary
models, (GPT-4o-mini and GPT-4o) and small open-source LLMs Table 1: Performance of E-SQL on BIRD development and
(Qwen2.5 Coder [19], DeepSeek Coder [15]) without fine-tuning test set. E-SQL consists of CSG, CPG, QE, and SR modules as
as the backbone for our experiments. Exploring other models or illustrated in Figure 2.
fine-tuning open-source LLMs are left as future work. For pro-
prietary models, the majority of the experiments were conducted Method Dev EX Test EX Test R-VES
using GPT-4o-mini, which is approximately 30% more cost-effective Undisclosed
than GPT-4o. Despite its lower cost, GPT-4o-mini demonstrated OpenSearch-SQL, v2 + GPT-4o 69.30 72.28 69.36
robust capabilities, making it an excellent choice for balancing Distillery + GPT-4o 67.21 71.83 67.41
performance and budget constraints. ExSL + granite-34b-code 67.47 70.37 68.79
Insights AI 72.16 70.26 66.39
PURPLE + RED + GPT-4o 68.12 70.21 65.62
4.1.4 Hyperparameters. We standardized the experimental settings ByteBrain 65.45 68.87 -
across all tests. The temperature was set to 0.0 to promote deter- ExSL + granite-20b-code 65.38 67.86 66.25
ministic outputs, while the top_p parameter, which refers to the Arcwise + GPT-4o 67.99 66.21 63.68
SCL-SQL 64.73 65.23 61.28
nucleus sampling technique, was set to 1.0. The number of chat
OpenSearch-SQL,v1 + GPT-4 61.34 64.95 -
completion choices was fixed at 1; however, this could be increased PB-SQL, v1 60.50 64.84 60.36
in future work by employing techniques like multiple-choice selec- Published
tion or self-consistency. The maximum token limit was set to 2048. CHESS 65.00 66.69 62.77
For each module of the E-SQL pipeline, 9-shot examples were pro- MCS-SQL + GPT-4 [22] 63.36 65.45 61.23
vided, with 3 randomly selected examples for each difficulty level. SuperSQL [24] 58.50 62.66 -
SFT CodeS-15B [26] 58.47 60.37 61.37
Regarding database column values, the 10 most relevant distinct DTS-SQL + DeepSeek 7B [36] 55.80 60.31 -
values to the question were selected and used. Additionally, the 20 MAC-SQL + GPT-4 [45] 57.56 59.59 57.60
most relevant database description sentences were identified and SFT CodeS-7B [26] 57.17 59.25 55.69
TA-SQL + GPT-4 [37] 56.19 59.14 -
incorporated. DAIL-SQL + GPT-4 [13] 54.76 57.41 54.02
E-SQL + GPT-4o (Ours) 65.58 66.29 62.43
4.2 Results E-SQL + GPT-4o-mini (Ours) 61.60 59.81 55.64
E-SQL + Qwen2.5 Coder 7B Instruct (Ours) 53.59 - -
We evaluate the proposed E-SQL pipeline primarily on the Spi-
der test set and the BIRD development set, as the BIRD test set
is not publicly available. We conduct most of our experiments us-
Table 2: Performance of E-SQL on Spider test set. E-SQL con-
ing GPT-4o-mini due to its cost-effectiveness. We also experiment
sists of CSG, CPG, QE, and SR modules as illustrated in Fig-
with small open-source models without fine-tuning to evaluate the
impact of the pipeline and highlight the importance of database- ure 2. † symbol denotes methods utilizing fine-tuned LLMs.
aligned questions. Table 1 compares the performance of E-SQL and
several baseline models on the BIRD dataset, demonstrating com- Method Model Test EX
petitive results. Additionally, we evaluate the performance of our
Proprietary Models
method across various difficulty levels on BIRD dataset. Table 2 pro-
vides insights into the performance of E-SQL utilizing cost-effective DAIL-SQL [13] GPT-4 86.6
proprietary LLMs and small open-source LLMs on the Spider test DIN-SQL [35] GPT-4 85.3
dataset. TA-SQL [37] GPT-4 85.0
MAC-SQL [45] GPT-3.5-Turbo 75.5
E-SQL shows a notable improvement, particularly for challeng-
- GPT4 74.0
ing questions in BIRD dataset, largely due to the incorporation of
the Question Enrichment (QE) and SQL Refinement (SR) modules. E-SQL (ours) GPT-4o-mini 74.75
With GPT-4o, the pipeline achieved an Execution Accuracy (EX) Small Open Source Models
of 66.29 and a Soft F1-Score of 67.93 on the BIRD test set. When
using the more cost-effective model, GPT-4o-mini, E-SQL achieved DTS-SQL† [36] Mistral-7B 77.1
59.81 EX and 61.59 Soft F1-Score on the BIRD test set. Addition- MSc-SQL† [14] Gemma-2-9B 69.30
ally, using the small open-source LLM Qwen2.5 Coder 7B, E-SQL
E-SQL (ours) Qwen2.5 Coder 7B Instruct 58.64
achieved 53.59 EX on the BIRD development set. On the Spider test
set, E-SQL achieved 74.75 EX score using GPT-4o-mini and 58.64 EX
score using Qwen2.5 Coder 7B Instruct. The detailed breakdown of
performance across different datasets and query complexity levels 4.3 Schema Filtering
can be found in Table 3 and Table 6. Further insights into the im- Schema filtering is a technique aimed at minimizing the model’s
pact of question enrichment and candidate predicate augmentation reliance on irrelevant database schema elements by removing such
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

Table 3: Detailed Performance of E-SQL on BIRD Test Set Across Query Complexity Levels

Overall Simple Moderate Challenging


Pipeline
EX Soft F1 R-VES EX Soft F1 R-VES EX Soft F1 R-VES EX Soft F1 R-VES
E-SQL (GPT-4o) 66.29 67.93 62.43 73.02 73.91 68.68 64.14 66.17 60.46 48.07 51.45 54.48
E-SQL (GPT-4o-mini) 59.81 61.59 55.64 67.44 68.80 62.53 56.94 58.77 53.11 40.00 43.04 37.60

Table 4: Performance of different pipelines on BIRD development set. SF represents the Schema Filtering module, QE represents
the Basic Question Enrichment module, and G represents the Basic SQL Generation module. The arrows indicate performance
improvement (↑) or decline (↓) compared to the G baseline. Experiments are conducted using GPT-4o-mini.

Pipeline Overall Dev EX Simple EX Moderate EX Challenging EX


SF-G 49.48 (↓ 8.21) 58.16 (↓ 6.38) 36.85 (↓ 12.28) 34.48 (↓ 8.96)
SF-QE-G 55.34 (↓ 2.35) 62.27 (↓ 2.27) 46.12 (↓ 3.01) 40.68 (↓ 2.76)
QE-G 58.80 (↑ 1.11) 64.43 (↓ 0.11) 51.07 (↑ 1.94) 47.58 (↑ 4.14)
G 57.69 64.54 49.13 43.44

Table 5: Ablation study using GPT-4o-mini with EX and Soft F1 metrics on the BIRD development set. The arrows indicate
performance improvement (↑) or decline (↓) compared to the base E-SQL.

Overall Simple Moderate Challenging


Pipeline
EX Soft F1 EX Soft F1 EX Soft F1 EX Soft F1
E-SQL 61.60 65.61 68.00 71.54 53.23 58.34 47.59 51.02
w/o QE 59.71 (↓ 1.89) 63.84 (↓ 1.77) 66.05 (↓ 1.95) 69.86 (↓ 1.68) 52.37 (↓ 0.86) 57.27 (↓ 1.07) 42.75 (↓ 4.84) 46.52 (↓ 4.50)
w/o CPG 59.58 (↓ 2.02) 63.61 (↓ 2.00) 65.51 (↓ 2.49) 69.16 (↓ 2.38) 51.29 (↓ 1.94) 56.27 (↓ 2.07) 48.27 (↑ 0.68) 51.68 (↑ 0.66)
w/o QE & CPG 58.34 (↓ 3.26) 62.41 (↓ 3.20) 64.22 (↓ 3.78) 67.91 (↓ 3.63) 51.29 (↓ 1.94) 55.66 (↓ 2.68) 43.45 (↓ 4.14) 48.89 (↓ 2.13)
w/o SR (w/o QE & CPG & SR) 58.03 (↓ 3.57) 61.88 (↓ 3.73) 63.89 (↓ 4.11) 67.33 (↓ 4.21) 50.86 (↓ 2.37) 55.13 (↓ 3.21) 44.13 (↓ 3.46) 48.71 (↓ 2.31)
w/ SF 56.06 (↓ 5.54) 59.93 (↓ 5.68) 62.70 (↓ 5.3) 66.53 (↓ 5.01) 47.63 (↓ 5.60) 51.55 (↓ 6.79) 40.68 (↓ 6.91) 44.62 (↓ 6.40)

items, and has been applied in numerous previous works. To eval- 4.4 Ablation Study
uate the impact of schema filtering on a basic pipeline utilizing We conducted an ablation study to analyze the contributions of the
most advanced large language models, we conducted various ex- Question Enrichment, Candidate Predicate Augmentation, and SQL
periments both with and without a schema filtering module, which Refinement modules. The results of this study are summarized in
includes a filtered schema correction step. The results of these ex- Table 5, showing the impact of individual modules from the E-SQL
periments are shown in Table 4. As demonstrated in the Table 4, the pipeline.
effect of schema filtering varies depending on its placement within
the pipeline. Nevertheless, regardless of its position, incorporating
Table 6: Effect of QE and CPG modules in E-SQL pipeline
the schema filtering module in a basic pipeline consistently led to a
with small open-source LLMs.
performance decline across all difficulty levels. Detailed results of
the question enrichment module are discussed in the next section.
The inclusion of schema filtering resulted in an overall drop of Pipeline Model Spider Test EX Bird Dev EX
up to 8.21% in Execution Accuracy (EX) on the development set. E-SQL Qwen 2.5 Coder 7B 58.64 53.52
Specifically, the performance decreased by 6.38%, 12.28%, and 8.96%
w/o QE Qwen 2.5 Coder 7B 55.84 (↓ 2.80) 50.78 (↓ 2.74)
for simple, moderate, and challenging questions, respectively as w/o CPG Qwen 2.5 Coder 7B 57.10 (↓ 1.54) 50.72 (↓ 2.80)
shown in Table 4. Although the E-SQL pipeline does not inherently w/o SR Qwen 2.5 Coder 7B 57.14 (↓ 1.50) 48.24 (↓ 5.28)
include a schema filtering module, we integrated it into the pipeline
for our ablation study, as presented in Table 5. This integration re-
sulted in a 5.54 drop in EX and a 5.68 decrease in Soft F1 score on the
4.4.1 Question Enrichment. The Question Enrichment module,
development set. These findings align with previous research [31],
which facilitates direct schema linking by injecting database items,
which suggests that advanced LLMs can manage schema linking ef-
SQL components, conditions, and SQL generation steps into the
fectively without requiring explicit filtering. Consequently, schema
question, improved the performance as shown in Table 5 and Ta-
filtering was excluded from the final E-SQL pipeline, as its negative
ble 6 on both advance proprietary and small open-source LLMs.
impact outweighed the potential benefits.
Its impact was particularly significant on challenging questions.
Hasan Alp Caferoğlu and Özgür Ulusoy

The absence of the question enrichment technique, especially when Table 7: Analysis of SQL Refinement (SR) Module on the
combined with the removal of the Candidate Predicate Augmenta- BIRD Development Set
tion module (CPG), led to a further decrease in overall performance.
These results demonstrate that direct schema linking, achieved E-SQL SR
Metric
through question reformulation, effectively bridges the gap between GPT-4o-mini GPT-4o
the natural language query and the database schema, resulting in
Changed Queries (%) 49.48 23.20
more accurate SQL generation.
Non-Executable to Executable (%) 6.58 0.39
Non-Executable to Correct (%) 3.19 0.13
4.4.2 Possible Predicate Augmentation. The Candidate Predicate Wrong to Correct (%) 5.35 1.83
Augmentation (CPG) module enhances the pipeline by augment-
ing potential predicates extracted from the database with the help
of the LIKE operator and the candidate SQL query. As shown in 4.5 Impact of Enriched Questions on Small
the Table 5 and Table 6, removing the CPG module resulted in a Large Language Models
nearly 2% drop in overall model performance. However, its removal To evaluate the performance impact of database-integrated ques-
slightly improved the performance on challenging questions of tions on small LLMs [15, 19, 50], we conducted an experiment
BIRD dev set, suggesting that the CPG module may introduce un- comparing their performance on default questions versus enriched
necessary complexity in some cases. The slight negative effect of questions. In this experiment, SQL queries for a given question were
the CPG module on challenging questions is negligible, as it sub- generated using a single-prompt approach. The prompt template
stantially enhances overall performance, especially when compared used in this step is similar to that of the Candidate SQL Generation
to the significant gains achieved through the Question Enrichment (CSG) module in the E-SQL pipeline, excluding data augmenta-
module. tion components such as database descriptions, value samples, and
few-shot examples. This approach allows us to isolate the effect of
4.4.3 SQL Refinement. The SQL Refinement (SR) module plays a enriched questions, as the single prompt relies solely on instruc-
crucial role in correcting minor errors in the generated SQL queries. tions and questions without additional context. Enriched questions
Without SR, we observed a 3.57 drop in EX and a 3.73 decrease were extracted from the outputs of Question Enrichment Module
in Soft F1 across the BIRD development set. This demonstrates (QE) of the E-SQL pipeline utilized with GPT-4o to ensure that they
that the SQL refinement step significantly boosts the final query incorporated database items and SQL construction plans.
accuracy by detecting and correcting SQL execution errors. The results, presented in Table 8, indicate that even small LLMs
To further evaluate the impact of the SR module within the E- can achieve competitive performance without task-specific fine-
SQL pipeline, we conducted the following analyses, with results tuning when provided with high-quality, database-integrated natu-
presented in Table 7: ral language queries. These findings underscore the critical role of
database-integrated enriched questions, which include logical SQL
• The proportion of initially generated candidate SQL queries construction steps.
that were altered by the SR module. A candidate SQL query
is counted as changed if the final predicted SQL query dif- 4.6 Computational Cost Analysis
fers from the original candidate SQL. Understanding computational expenses is critical for assessing the
• The proportion of initially non-executable candidate SQL practical scalability and applicability of the framework, especially
queries that were modified by the SR module to become given the reliance on large language models (LLMs) and their re-
executable. source demands. Table 9 highlights the impact of question enrich-
• The proportion of non-executable candidate SQL queries ment on token count2 , while Table 10 provides details of average
that were corrected to executable and accurate SQL queries token usage for both prompt and completion stages of key pipeline
by the SR module. components on the BIRD development set. The average number of
• The proportion of incorrect candidate SQL queries, includ- tokens in prompts is inherently high due to well-defined instruc-
ing non-executable ones, that were corrected to accurate tions, data augmentations, including few-shot examples, database
SQL queries by the SR module. descriptions, and database values, as commonly employed in most
Text-to-SQL approaches. Consequently, the increase in the question
As shown in Table 7, our detailed analysis of the SQL Refine- token count due to enrichment is relatively insignificant compared
ment (SR) module demonstrates that 5.35% and 1.83% of the initially to the total number of prompt tokens. Despite the computational
incorrect SQL queries generated by GPT-4o-mini and GPT-4o, re- overhead introduced by question enrichment, which increases the
spectively, were successfully corrected by the SR module. While average token count of natural language questions and their rea-
the SQL refinement technique positively impacts both models, the soning, E-SQL demonstrates computational superiority over meth-
effect is more noticeable on less capable models like GPT-4o-mini. ods [10, 13, 22, 31, 40] that generate multiple SQL queries for a single
These results highlight the module’s ability to enhance query accu- user question. These methods incur significant computational costs
racy, especially for models with lower initial performance, making due to the repeated SQL generation, subsequent correction of these
SQL refinement a critical component for improving the overall SQL queries, and the selection process. Executing each E-SQL mod-
system robustness. ule only once minimizes the computational costs associated with
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

Table 8: Effect of Enriched Questions on the performance In our pipeline, the initially generated candidate SQL query is
of small open-source LLMs without fine-tuning on BIRD de- executed to identify execution errors and enhance the large lan-
velopment set. Enriched questions, generated using GPT-4o, guage model’s error awareness. While this step contributes to the
were utilized to evaluate the impact of high-quality question overall pipeline latency, with an average execution time of 49.936
enrichment on the performance of small open-source mod- milliseconds per query, it plays a crucial role in ensuring accurate
els. † symbol denotes methods utilizing fine-tuned LLMs. SQL refinement by providing valuable feedback on execution errors.
It is important to note that the overall pipeline response time varies
Model Level
Dev EX based on several factors, including the complexity of the natural
Default Enriched language question, the length of the prompt, and the API response
Overall 20.92 50.84(↑ 29.92) time of proprietary LLMs, which is influenced by server load and
DeepSeek Coder 1.3B Instruct
Simple 28.43 62.38(↑ 33.95) volume. These factors collectively contribute to the latency of the
Moderate 10.34 36.42(↑ 26.08) system.
Challenging 6.90 23.45(↑ 16.55)
Overall 11.21 36.90 (↑ 25.69)
Simple 14.27 44.10(↑ 29.83)
Qwen2.5 Coder 1.5B Instruct
Moderate 6.68 28.23(↑ 21.55)
Challenging 6.20 18.62(↑ 12.42) 5 Discussion and Limitations
Overall 37.02 56.45(↑ 19.43) The results from our experiments highlight the significant influence
DeepSeek Coder 7B Instruct 1.5v
Simple 44.65 64.64(↑ 19.99) of question enrichment and candidate predicate augmentation on
Moderate 26.52 45.47(↑ 18.95) the performance of the E-SQL pipeline. The question enrichment
Challenging 22.06 39.31(↑ 17.25)
module, which bridges the gap between the natural language query
Overall 31.25 40.22(↑ 8.97)
and the database schema, was pivotal in improving query accuracy,
Simple 40.43 50.70(↑ 10.27)
Qwen2.5 Coder 7B Instruct particularly for challenging questions. By enriching the natural
Moderate 17.88 26.07(↑ 8.19)
Challenging 15.17 18.62(↑ 3.45) language question with database items, conditions, and SQL gener-
ExSL + granite-20b-code Overall 51.69 - ation steps, the module enhanced direct schema linking, ensuring
DTS-SQL + DeepSeek 7B † [36] Overall 55.8 - that the generated SQL queries were more aligned with the data-
SFT CodeS-7B † [26] Overall 57.17 - base’s structure. This improvement is evidenced by an ablation
SFT CodeS-15B † [26] Overall 58.47 -
study, underscoring the efficacy of this approach.
One notable observation in our evaluation is the inconsistency in
performance between the development and test sets when using dif-
repeated calls, ensuring greater efficiency. This balance between ferent models. Specifically, when employing GPT-4o, the pipeline’s
cost and performance underscores the scalability and efficiency of performance showed an improvement on the test set compared
our pipeline for large-scale deployment. to the development set. However, this trend reversed with GPT-
4o-mini, where performance decreased on the test set relative to
Table 9: Comparison of Default and Enriched Natural Lan- the development set. Due to the BIRD test set not being publicly
guage Questions in Bird Development Set available, we were unable to analyze it directly to identify poten-
tial causes for this variation. Additionally, large language models
Text Avg. Tokens are known to exhibit variability in performance across multiple
runs, which might further contribute to this inconsistency. Thus,
Question 18.36 while the exact reasons behind these performance fluctuations re-
Enriched Question 81.51 main unclear, they underline the need for further exploration under
Enrichment Reasoning 191.34 controlled conditions.
Fully Enriched Question 291.21 The prompt design plays a critical role in influencing model
performance. The prompt templates utilized for each module of
the E-SQL pipeline are publicly available in our GitHub repository.
This study primarily emphasizes schema linking through question
Table 10: Analysis of Computational Costs by Module on the enrichment and data augmentation, deliberately leaving the explo-
Bird Development Set ration of alternative prompt templates beyond its scope.
Despite the advancements, there are some limitations to our
Avg. Prompt Avg. Completion approach. Due to hardware and cost constraints, almost all experi-
Module ments were conducted using GPT-4o-mini and small open-source
Token Count Token Count
LLMs without fine-tuning. Among the small open-source LLMs, the
CSG 12612 199
whole E-SQL pipeline was executed only with Qwen2.5 Coder 7B
QE 16550 292
Instruct since the context length of the other small LLMs is not suf-
SR 7403 267
ficient to run and observe the effect of E-SQL pipeline. Developing
2 Token
more efficient schema linking techniques that operate effectively
counts were computed using the tiktoken Python package, developed by Ope-
nAI, which provides a programmatic interface for tokenizing text with OpenAI model- with small LLMs and limited context lengths represents a promising
specific tokenizers. The package is available at [https://github.com/openai/tiktoken]. direction for future work.
Hasan Alp Caferoğlu and Özgür Ulusoy

6 Conclusions Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/


1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
In this study, we introduced E-SQL, a novel pipeline designed to [4] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de
address key challenges in Text-to-SQL translation by leveraging Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg
Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf,
direct schema linking via question enrichment and incorporating Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail
candidate predicates. Our experiments demonstrated that the ques- Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
tion enrichment module, which integrates natural language queries Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fo-
tios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex
with relevant database elements and logical steps, significantly Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji,
enhances query accuracy, particularly for complex queries. Addi- Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike,
tionally, the proposed candidate predicate augmentation technique Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight,
Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario
further improves the performance of the pipeline. Moreover, our Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Eval-
additional experiments reveals the importance and positive impact uating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
https://arxiv.org/abs/2107.03374
of enriched questions on the performance of small open-source [5] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,
LLMs with limited context lengths. Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase
While some prior works have highlighted the utility of schema Representations using RNN Encoder–Decoder for Statistical Machine Translation.
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
filtering, our findings reveal that incorporating schema filtering into Processing (EMNLP), Alessandro Moschitti, Bo Pang, and Walter Daelemans
a text-to-SQL translation pipeline that leverages advanced LLMs (Eds.). Association for Computational Linguistics, Doha, Qatar, 1724–1734. https:
results in performance degradation. This supports the notion that //doi.org/10.3115/v1/D14-1179
[6] DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, and Dong Ryeol Shin.
explicit schema filtering can be redundant in modern architectures 2021. RYANSQL: Recursively Applying Sketch-based Slot Fillings for Complex
that utilizes the latest LLMs. Text-to-SQL in Cross-Domain Databases. Computational Linguistics 47, 2 (June
2021), 309–332. https://doi.org/10.1162/coli_a_00403
By focusing on question enrichment, data augmentation and [7] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
SQL refinement, E-SQL achieved competitive results on the BIRD Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas-
benchmark. Specifically, E-SQL combined with GPT-4o achieved tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez,
Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran,
65.58% and 66.29% execution accuracy on the development and test Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin,
sets, respectively. These results underscore E-SQL’s effectiveness in Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay
handling complex queries and present it as a promising approach Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin
Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek
for future Text-to-SQL tasks. Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
Despite a minor computational overhead due to increased token Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana
Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr
counts in question enrichment, its impact is negligible compared Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz,
to the overall token usage and is outweighed by the significant Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck,
performance gains. Additionally, E-SQL ensures cost-efficiency Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. PaLM: Scaling Language Modeling
with Pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
by executing each module only once, avoiding the excessive re- http://jmlr.org/papers/v24/22-1144.html
source demands of repeated query generation and correction. This [8] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020.
balance highlights E-SQL’s scalability and suitability for resource- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
In ICLR. https://openreview.net/pdf?id=r1xMH1BtvB
constrained deployments. [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Further exploration of fine tuning, iterative or multiple question Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Association
refinements and schema linking techniques optimized for small for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
LLMs with limited context lengths is left for future work. Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association
for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.
org/10.18653/v1/N19-1423
Acknowledgments [10] Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, lu Chen, Jin-
shu Lin, and Dongfang Lou. 2023. C3: Zero-shot Text-to-SQL with ChatGPT.
We would like to express our sincere gratitude to Dr. Arif Usta arXiv:2307.07306 [cs.CL] https://arxiv.org/abs/2307.07306
and Ekrem Polat for their invaluable insights and constructive [11] Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. 2022. Suc-
discussions, which greatly contributed to the development of this cessive Prompting for Decomposing Complex Questions. In Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing, Yoav
work. Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Compu-
tational Linguistics, Abu Dhabi, United Arab Emirates, 1251–1265. https:
//doi.org/10.18653/v1/2022.emnlp-main.81
References [12] Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, and Jianling Sun. 2023. CatSQL:
[1] Meta AI. 2024. The LLaMA 3 Herd of Models. https://ai.meta.com/research/ Towards Real World Natural Language to SQL Applications. Proc. VLDB Endow.
publications/the-llama-3-herd-of-models/. Accessed: 2024-08-28. 16, 6 (feb 2023), 1534–1547. https://doi.org/10.14778/3583140.3583165
[2] Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, [13] Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding,
Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, and Jingren Zhou. 2024. Text-to-SQL Empowered by Large Language Models:
and et al. 2023. PaLM 2 Technical Report. ArXiv abs/2305.10403 (2023). https: A Benchmark Evaluation. Proc. VLDB Endow. 17, 5 (may 2024), 1132–1145.
//arxiv.org/abs/2305.10403 arXiv:2305.10403. https://doi.org/10.14778/3641204.3641221
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [14] Satya Krishna Gorti, Ilan Gofman, Zhaoyan Liu, Jiapeng Wu, Noël Vouitsis,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Guangwei Yu, Jesse C. Cresswell, and Rasa Hosseinzadeh. 2024. MSc-SQL:
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation.
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, arXiv:2410.12916 [cs.CL] https://arxiv.org/abs/2410.12916
Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin [15] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant-
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya ing Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang.
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming –
In Advances in Neural Information Processing Systems, H. Larochelle, M. Ran- The Rise of Code Intelligence. https://arxiv.org/abs/2401.14196
zato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates,
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

[16] Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan
Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database He, and Yang Liu (Eds.). Association for Computational Linguistics, Online,
with Intermediate Representation. In Proceedings of the 57th Annual Meeting of 4870–4888. https://doi.org/10.18653/v1/2020.findings-emnlp.438
the Association for Computational Linguistics, Anna Korhonen, David Traum, and [31] Karime Maamari, Fadhil Abubaker, Daniel Jaroslawicz, and Amine Mhedhbi.
Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 2024. The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned
4524–4535. https://doi.org/10.18653/v1/P19-1444 Language Models. arXiv:2408.07702 [cs.CL] https://arxiv.org/abs/2408.07702
[17] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. [32] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah
Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9. Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank
8.1735 Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir
[18] Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with
and Jiawei Han. 2023. Large Language Models Can Self-Improve. In Proceedings Self-Feedback. In Thirty-seventh Conference on Neural Information Processing
of the 2023 Conference on Empirical Methods in Natural Language Processing, Systems. https://openreview.net/forum?id=S37hOerQLB
Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational [33] OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).
Linguistics, Singapore, 1051–1068. https://doi.org/10.18653/v1/2023.emnlp- arXiv:2303.08774.
main.67 [34] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
[19] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John
Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Report. arXiv preprint arXiv:2409.12186 (2024). Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022.
[20] Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Training language models to follow instructions with human feedback. In
Clark, and Ashish Sabharwal. 2023. Decomposed Prompting: A Modular Ap- Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed,
proach for Solving Complex Tasks. In The Eleventh International Conference on A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates,
Learning Representations. https://openreview.net/forum?id=_nGgzQjzaRy Inc., 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/
[21] Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and b1efde53be364a73914f58805a001731-Paper-Conference.pdf
Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In [35] Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decom-
Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, posed In-Context Learning of Text-to-SQL with Self-Correction. In Advances
A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, in Neural Information Processing Systems, A. Oh, T. Naumann, A. Glober-
Inc., 22199–22213. https://proceedings.neurips.cc/paper_files/paper/2022/file/ son, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates,
8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf Inc., 36339–36348. https://proceedings.neurips.cc/paper_files/paper/2023/file/
[22] Dongjun Lee, Choongwon Park, Jaehyuk Kim, and Heesoo Park. 2024. MCS-SQL: 72223cc66f63ca1aa59edaec1b3670e6-Paper-Conference.pdf
Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL [36] Mohammadreza Pourreza and Davood Rafiei. 2024. DTS-SQL: Decomposed
Generation. arXiv:2405.07467 [cs.CL] https://arxiv.org/abs/2405.07467 Text-to-SQL with Small Large Language Models. arXiv:2402.01117 [cs.CL]
[23] Wenqiang Lei, Weixin Wang, Zhixin Ma, Tian Gan, Wei Lu, Min-Yen Kan, and https://arxiv.org/abs/2402.01117
Tat-Seng Chua. 2020. Re-examining the Role of Schema Linking in Text-to- [37] Ge Qu, Jinyang Li, Bowen Li, Bowen Qin, Nan Huo, Chenhao Ma, and Reynold
SQL. In Proceedings of the 2020 Conference on Empirical Methods in Natural Cheng. 2024. Before Generation, Align it! A Novel and Effective Strategy for Mit-
Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and igating Hallucinations in Text-to-SQL Generation. In Findings of the Association
Yang Liu (Eds.). Association for Computational Linguistics, Online, 6943–6954. for Computational Linguistics ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek
https://doi.org/10.18653/v1/2020.emnlp-main.564 Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand
[24] Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The and virtual meeting, 5456–5471. https://aclanthology.org/2024.findings-acl.324
Dawn of Natural Language to SQL: Are We Fully Ready?. In Proceedings of the [38] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
VLDB Endowment (PVLDB), Vol. 17. https://doi.org/10.14778/3681954.3682003 Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
[25] Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen. 2023. RESDSQL: decoupling limits of transfer learning with a unified text-to-text transformer. Journal of
schema linking and skeleton parsing for text-to-SQL. In Proceedings of the Thirty- Machine Learning Research 21, 1, Article 140 (jan 2020), 67 pages.
Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference [39] Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD:
on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Parsing Incrementally for Constrained Auto-Regressive Decoding from Language
Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23). AAAI Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Press, Article 1466, 9 pages. https://doi.org/10.1609/aaai.v37i11.26535 Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and
[26] Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and
Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024. CodeS: Towards Building Punta Cana, Dominican Republic, 9895–9901. https://doi.org/10.18653/v1/2021.
Open-source Language Models for Text-to-SQL. Proc. ACM Manag. Data 2, 3, emnlp-main.779
Article 127 (may 2024), 28 pages. https://doi.org/10.1145/3654930 [40] Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and
[27] Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, Chenhao Ma, Nan Huo, Fei Amin Saberi. 2024. CHESS: Contextual Harnessing for Efficient SQL Synthesis.
Huang, Wenyu Du, Luo Si, and Yongbin Li. 2023. Graphix-T5: mixing pre-trained arXiv preprint arXiv:2405.16755 (2024). https://arxiv.org/abs/2405.16755 Preprint
transformers with graph-aware layers for text-to-SQL parsing. In Proceedings of arXiv:2405.16755.
the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Con- [41] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
ference on Innovative Applications of Artificial Intelligence and Thirteenth Sympo- Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro,
sium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23). Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil-
AAAI Press, Article 1467, 9 pages. https://doi.org/10.1609/aaai.v37i11.26536 laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models.
[28] Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, arXiv:2302.13971 [cs.CL] https://arxiv.org/abs/2302.13971
Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can llm already serve as a [42] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas-
database interface? a big bench for large-scale database grounded text-to-sqls. mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
Advances in Neural Information Processing Systems 36 (2024). ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem
[29] Raymond Li, Loubna Ben allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Chenghao Mou, Marc Marone, Christopher Akiki, Jia LI, Jenny Chim, Qian Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar
Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Joel Lamy-Poirier, Joao Monteiro, Nicolas Gontier, Ming-Ho Yee, Logesh Kumar Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux,
Umapathi, Jian Zhu, Ben Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier
Murthy, Jason T Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew
Zocca, Manan Dey, Zhihan Zhang, Urvashi Bhattacharyya, Wenhao Yu, Sasha Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan
Luccioni, Paulo Villegas, Fedor Zhdanov, Tony Lee, Nadav Timor, Jennifer Ding, Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang,
Claire S Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan
Gu, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien
Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama
Sean Hughes, Thomas Wolf, Arjun Guha, Leandro Von Werra, and Harm de Vries. 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
2023. StarCoder: may the source be with you! Transactions on Machine Learning https://arxiv.org/abs/2307.09288
Research (2023). https://openreview.net/forum?id=KoFOg41haE Reproducibility [43] Arif Usta, Akifhan Karakayali, and Özgür Ulusoy. 2021. DBTagger: multi-task
Certification. learning for keyword mapping in NLIDBs using Bi-directional recurrent neural
[30] Xi Victoria Lin, Richard Socher, and Caiming Xiong. 2020. Bridging Textual and networks. Proc. VLDB Endow. 14, 5 (jan 2021), 813–821. https://doi.org/10.14778/
Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing. In Findings of 3446095.3446103
Hasan Alp Caferoğlu and Özgür Ulusoy

[44] Arif Usta, Akifhan Karakayali, and Özgür Ulusoy. 2023. xDBTagger: explainable
natural language interface to databases using keyword mappings and schema
graph. The VLDB Journal 33, 2 (aug 2023), 301–321. https://doi.org/10.1007/
s00778-023-00809-w
[45] Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai,
Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, and Zhoujun Li. 2024. MAC-SQL: A
Multi-Agent Collaborative Framework for Text-to-SQL. arXiv:2312.11242 [cs.CL]
[46] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew
Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for
Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel
Tetreault (Eds.). Association for Computational Linguistics, Online, 7567–7578.
https://doi.org/10.18653/v1/2020.acl-main.677
[47] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan
Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Im-
proves Chain of Thought Reasoning in Language Models. In The Eleventh Inter-
national Conference on Learning Representations. https://openreview.net/forum?
id=1PL1NIMMrw
[48] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei
Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2024. Chain-of-thought prompting
elicits reasoning in large language models. In Proceedings of the 36th International
Conference on Neural Information Processing Systems (New Orleans, LA, USA)
(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 1800, 14 pages.
[49] Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Jia Liu, Tao Gui,
Qi Zhang, and Xuanjing Huang. 2023. Self-Polish: Enhance Reasoning in Large
Language Models via Problem Refinement. In Findings of the Association for
Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika
Bali (Eds.). Association for Computational Linguistics, Singapore, 11383–11406.
https://doi.org/10.18653/v1/2023.findings-emnlp.762
[50] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng-
peng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei,
Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang,
Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang,
Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang,
Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai,
Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng,
Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang
Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui,
Zhenru Zhang, and Zhihao Fan. 2024. Qwen2 Technical Report. arXiv preprint
arXiv:2407.10671 (2024).
[51] Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, bailin wang, Yi Chern Tan, Xinyi Yang,
Dragomir Radev, richard socher, and Caiming Xiong. 2021. Gra{PP}a: Grammar-
Augmented Pre-Training for Table Semantic Parsing. In International Conference
on Learning Representations. https://openreview.net/forum?id=kyaIeYj4zZ
[52] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li,
James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir
Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and
Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, Ellen Riloff,
David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for
Computational Linguistics, Brussels, Belgium, 3911–3921. https://doi.org/10.
18653/v1/D18-1425
[53] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic
Chain of Thought Prompting in Large Language Models. In The Eleventh Inter-
national Conference on Learning Representations. https://openreview.net/forum?
id=5NTt8GFjUHkr
[54] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generat-
ing Structured Queries from Natural Language using Reinforcement Learning.
arXiv:1709.00103 [cs.CL] https://arxiv.org/abs/1709.00103
[55] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi
Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi.
2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language
Models. In The Eleventh International Conference on Learning Representations.
https://openreview.net/forum?id=WZH7099tgfM
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

A Prompt Templates
In this section, exact prompt templates used for each module in the E-SQL pipeline are provided.

A.1 Full Prompt Template for Candidate SQL Generation (CSG)


### You are an excellent data scientist. You can capture the link between the question and corresponding database and perfectly
generate valid SQLite SQL query to answer the question. Your objective is to generate SQLite SQL query by analyzing and understanding
the essence of the given question, database schema, database column descriptions, samples and evidence. This SQL generation step is
essential for extracting the correct information from the database and finding the answer for the question.

### Follow the instructions below:


# Step 1 - Read the Question and Evidence Carefully: Understand the primary focus and specific details of the question. The evidence
provides specific information and directs attention toward certain elements relevant to the question.
# Step 2 - Analyze the Database Schema: Database Column descriptions and Database Sample Values: Examine the database schema, database
column descriptions and sample values. Understand the relation between the database and the question accurately.
# Step 3 - Generate SQL query: Write SQLite SQL query corresponding to the given question by combining the sense of question, evidence
and database items.

{FEWSHOT_EXAMPLES}

### Task: Given the following question, database schema and evidence, generate SQLite SQL query in order to answer the question.
### Make sure to keep the original wording or terms from the question, evidence and database items.
### Make sure each table name and column name in the generated SQL is enclosed with backtick separately.
### Ensure the generated SQL is compatible with the database schema.
### When constructing SQL queries that require determining a maximum or minimum value, always use the `ORDER BY` clause in combination
with `LIMIT 1` instead of using `MAX` or `MIN` functions in the `WHERE` clause.Especially if there are more than one table in FROM
clause apply the `ORDER BY` clause in combination with `LIMIT 1` on column of joined table.
### Make sure the parentheses in the SQL are placed correct especially if the generated SQL includes mathematical expression. Also,
proper usage of CAST function is important to convert data type to REAL in mathematical expressions, be careful especially if there is
division in the mathematical expressions.
### Ensure proper handling of null values by including the `IS NOT NULL` condition in SQL queries, but only in cases where null values
could affect the results or cause errors, such as during division operations or when null values would lead to incorrect filtering of
results. Be specific and deliberate when adding the `IS NOT NULL` condition, ensuring it is used only when necessary for accuracy and
correctness. This is crucial to avoid errors and ensure accurate results. This is crucial to avoid errors and ensure accurate results.
You can leverage the database sample values to check if there could be potential null value.

{SCHEMA}
{DB_DESCRIPTIONS}
{DB_SAMPLES}
{QUESTION}
{EVIDENCE}

### Please respond with a JSON object structured as follows:

{"chain_of_thought_reasoning": "Explanation of the logical analysis and steps that result in the final SQLite SQL query.", "SQL": "
Generated SQL query as a single string"}

Let's think step by step and generate SQLite SQL query.


Hasan Alp Caferoğlu and Özgür Ulusoy

A.2 Full Prompt Template for Quesiton Enrichment (QE)


### You are excellent data scientist and can link the information between a question and corresponding database perfectly. Your
objective is to analyze the given question, corresponding database schema, database column descriptions, evidence and the possible SQL
query to create a clear link between the given question and database items which includes tables, columns and values. With the help of
link, rewrite new versions of the original question to be more related with database items, understandable, clear, absent of irrelevant
information and easier to translate into SQL queries. This question enrichment is essential for comprehending the question's intent
and identifying the related database items. The process involves pinpointing the relevant database components and expanding the
question to incorporate these items.

### Follow the instructions below:


# Step 1 - Read the Question Carefully: Understand the primary focus and specific details of the question. Identify named entities (
such as organizations, locations, etc.), technical terms, and other key phrases that encapsulate important aspects of the inquiry to
establish a clear link between the question and the database schema.
# Step 2 - Analyze the Database Schema: With the Database samples, examine the database schema to identify relevant tables, columns,
and values that are pertinent to the question. Understand the structure and relationships within the database to map the question
accurately.
# Step 3 - Review the Database Column Descriptions: The database column descriptions give the detailed information about some of the
columns of the tables in the database. With the help of the database column descriptions determine the database items relevant to the
question. Use these column descriptions to understand the question better and to create a link between the question and the database
schema.
# Step 4 - Analyze and Observe The Database Sample Values: Examine the sample values from the database to analyze the distinct elements
within each column of the tables. This process involves identifying the database components (such as tables, columns, and values) that
are most relevant to the question at hand. Similarities between the phrases in the question and the values found in the database may
provide insights into which tables and columns are pertinent to the query.
# Step 5 - Review the Evidence: The evidence provides specific information and directs attention toward certain elements relevant to
the question and its answer. Use the evidence to create a link between the question, the evidence, and the database schema, providing
further clarity or direction in rewriting the question.
# Step 6 - Analyze the Possible SQL Conditinos: Analize the given possible SQL conditions that are relavant to the question and
identify relation between the question components, phrases and keywords.
# Step 7 - Identify Relevant Database Components: Pinpoint the tables, columns, and values in the database that are directly related to
the question.
# Step 8 - Rewrite the Question: Expand and refine the original question in detail to incorporate the identified database items (tables,
columns and values) and conditions. Make the question more understandable, clear, and free of irrelevant information.

{FEWSHOT_EXAMPLES}

### Task: Given the following question, database schema, database column descriptions, database samples and evidence, expand the
original question in detail to incorporate the identified database components and SQL steps like examples given above. Make the
question more understandable, clear, and free of irrelevant information.
### Ensure that question is expanded with original database items. Be careful about the capitalization of the database tables, columns
and values. Use tables and columns in database schema.

{SCHEMA}
{DB_DESCRIPTIONS}
{DB_SAMPLES}
{POSSIBLE_CONDITIONS}
{QUESTION}
{EVIDENCE}

### Please respond with a JSON object structured as follows:

```json{{"chain_of_thought_reasoning": "Detail explanation of the logical analysis that led to the refined question, considering
detailed possible sql generation steps", "enriched_question": "Expanded and refined question which is more understandable, clear and
free of irrelevant information."}}```

Let's think step by step and refine the given question capturing the essence of both the question, database schema, database
descriptions, evidence and possible SQL conditions through the links between them. If you do the task correctly, I will give you 1
million dollars. Only output a json as your response.
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

A.3 Full Prompt Template for SQL Refinement (SR)


### You are an excellent data scientist. You can capture the link between the question and corresponding database and perfectly
generate valid SQLite SQL query to answer the question. Your objective is to generate SQLite SQL query by analyzing and understanding
the essence of the given question, database schema, database column descriptions, evidence, possible SQL and possible conditions. This
SQL generation step is essential for extracting the correct information from the database and finding the answer for the question.

### Follow the instructions below:


# Step 1 - Read the Question and Evidence: Understand the primary focus and specific details of the question. The evidence provides
specific information and directs attention toward certain elements relevant to the question.
# Step 2 - Analyze the Database Schema, Database Column descriptions: Examine the database schema, database column descriptions which
provides information about the database columns. Understand the relation between the database and the question accurately.
# Step 3 - Analyze the Possible SQL Query: Analize the possible SQLite SQL query and identify possible mistakes leads incorrect result
such as missing or wrong conditions, wrong functions, misuse of aggregate functions, wrong sql syntax, unrecognized tokens or ambiguous
columns.
# Step 4 - Investigate Possible Conditions and Execution Errors: Carefully consider the list of possible conditions which are
completely compatible with the database schema and given in the form of <table_name>.<column_name><operation><value>. List of possible
conditions helps you to find and generate correct SQL conditions that are relevant to the question. If the given possible SQL query
gives execution error, it will be given. Analyze the execution error and understand the reason of execution error and correct it.
# Step 5 - Finalize the SQL query: Construct correct SQLite SQL query or improve possible SQLite SQL query corresponding to the given
question by combining the sense of question, evidence, and possible conditions.
# Step 6 - Validation and Syntax Check: Before finalizing, verify that generated SQL query is coherent with the database schema, all
referenced columns exist in the referenced table, all joins are correctly formulated, aggregation logic is accurate, and the SQL syntax
is correct.

### Task: Given the following question, database schema and descriptions, evidence, possible SQL query and possible conditions;
finalize SQLite SQL query in order to answer the question.
### Ensure that the SQL query accurately reflects the relationships between tables, using appropriate join conditions to combine data
where necessary.
### When using aggregate functions (e.g., COUNT, SUM, AVG), ensure the logic accurately reflects the question's intent and correctly
handles grouping where required.
### Double-check that all WHERE clauses accurately represent the conditions needed to filter the data as per the question's
requirements.
### Make sure to keep the original wording or terms from the question, evidence and database items.
### Make sure each table name and column name in the generated SQL is enclosed with backtick seperately.
### Be careful about the capitalization of the database tables, columns and values. Use tables and columns in database schema. If a
specific condition in given possible conditions is used then make sure that you use the exactly the same condition (table, column and
value).
### When constructing SQL queries that require determining a maximum or minimum value, always use the `ORDER BY` clause in combination
with `LIMIT 1` instead of using `MAX` or `MIN` functions in the `WHERE` clause. Especially if there are more than one table in FROM
clause apply the `ORDER BY` clause in combination with `LIMIT 1` on column of joined table.
### Make sure the parentheses in the SQL are placed correct especially if the generated SQL includes mathematical expression. Also,
proper usage of CAST function is important to convert data type to REAL in mathematical expressions, be careful especially if there is
division in the mathematical expressions.
### Ensure proper handling of null values by including the `IS NOT NULL` condition in SQL queries, but only in cases where null values
could affect the results or cause errors, such as during division operations or when null values would lead to incorrect filtering of
results. Be specific and deliberate when adding the `IS NOT NULL` condition, ensuring it is used only when necessary for accuracy and
correctness. . This is crucial to avoid errors and ensure accurate results.

{SCHEMA}
{DB_DESCRIPTIONS}
{QUESTION}
{EVIDENCE}
{POSSIBLE_CONDITIONS}
{POSSIBLE_SQL_Query}
{EXECUTION_ERROR}

### Please respond with a JSON object structured as follows:

```json{{"chain_of_thought_reasoning": "Explanation of the logical analysis and steps that result in the final SQLite SQL query.", "
SQL": "Finalized SQL query as a single string"}}```

Let's think step by step and generate SQLite SQL query.


Hasan Alp Caferoğlu and Özgür Ulusoy

A.4 Full Prompt Template for Schema Filtering (SF)

### You are an excellent data scientist. You can capture the link between a question and corresponding database and determine the
useful database items (tables and columns) perfectly. Your objective is to analyze and understand the essence of the given question,
corresponding database schema, database column descriptions, samples and evidence and then select the useful database items such as
tables and columns. This database item filtering is essential for eliminating unnecessary information in the database so that
corresponding structured query language (SQL) of the question can be generated correctly in later steps.

### Follow the instructions below step by step:


# Step 1 - Read the Question Carefully: Understand the primary focus and specific details of the question. Identify named entities (
such as organizations, locations, etc.), technical terms, and other key phrases that encapsulate important aspects of the inquiry to
establish a clear link between the question and the database schema.
# Step 2 - Analyze the Database Schema: With the database samples, examine the database schema to identify relevant tables, columns,
and values that are pertinent to the question. Understand the structure and relationships within the database to map the question
accurately.
# Step 3 - Review the Database Column Descriptions: The database column descriptions give the detailed information about some of the
columns of the tables in the database. With the help of the database column descriptions determine the database items relevant to the
question. Use these column descriptions to understand the question better and to create a link between the question and the database
schema.
# Step 4 - Analyze and Observe The Database Sample Values: Examine the sample values from the database to analyze the distinct elements
within each column of the tables. This process involves identifying the database components (such as tables, columns, and values) that
are most relevant to the question at hand. Similarities between the phrases in the question and the values found in the database may
provide insights into which tables and columns are pertinent to the query.
# Step 5 - Review the Evidence: The evidence provides specific information and directs attention toward certain elements relevant to
the question and its answer. Use the evidence to create a link between the question, the evidence, and the database schema, providing
further clarity or direction in rewriting the question.
# Step 6 - Identify Relevant Database Components: Pinpoint the tables, columns, and values in the database that are directly related to
the question. Ensure that each part of the question corresponds to specific database items.
# Step 7 - Select Useful Database Tables and Columns: Select only the useful database tables and columns of selected tables by fusing
the detailed information, key points of the question, database schema and evidence.

{FEWSHOT_EXAMPLES}

### Task: Given the following question, database schema, database column descriptions and evidence, select only the necessary and
useful database tables, and necessary and useful columns of selected tables to filter the database items.
### Make sure to keep the original terms from database items.
### Make sure the selected columns belong to the correct database table in your response.

{SCHEMA}
{DB_DESCRIPTIONS}
{DB_SAMPLES}
{QUESTION}
{EVIDENCE}

### Please respond with a JSON object structured as follows:

```json{{"chain_of_thought_reasoning": "Explanation of the logical analysis that led to the selected useful database items.", "
tables_and_columns": {{"table_name1": ["column1", "column2", ...], "table_name2": ["column1", ...], ...}} }}```

Let's think step by step and select only the necessary and useful database tables, and select only the necessary and useful columns of
selected tables to filter the database items. If you do the task correctly, I will give you 1 million dollars. Only output a json as
your response.
E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

B Manually Enriched Questions and Enrichment Reasoning


This section provides examples of manually enriched questions along with the rationale for their enrichment. Each manually enriched
question and its corresponding reasoning are open-sourced and available at Github repository.

Question:
Among the schools with the average score in Math over 560 in the SAT test, how many schools are directly charter-funded?

Enriched Question:
Please find the number of schools (COUNT(frpm.`School Code`)) whose charter funding type is directly funded (frpm.`Charter Funding Type` = 'Directly funded'),
and whose AvgScrMath larger than 560 in the SAT test (satscores.AvgScrMath > 560). To find the schools with the charter funding type information and average
math score in SAT, frpm and satscores tables should be joined. Apply the charter funding type condition (frpm.`Charter Funding Type` = 'Directly funded') and
average math score condition(satscores.AvgScrMath > 560). Calculate the number of schools using COUNT aggregate function in the Select statement.

Enrichment Reasoning:
The information of wheter a school is directly or locally funded or not can be found from the 'Charter Funding Type' column of the frpm table in the database.
The information of average score in Math in SAT test of schools can be found from the AvgScrMath column of the satscores table in the database. It is asked
to find the number of schools whose average score in Math over 560 in the SAT test and that are directly charter-funded. To find the schools that holds asked
conditions can be find by joining the frpm and satscores tables in SQL statement. After applying the average math score conditioin (satscores.AvgScrMath >
560) and funding type condition (frpm.`Charter Funding Type` = 'Directly funded'), School Codes should be counted with COUNT aggregate function.

Question:
Please list the phone numbers of the direct charter-funded schools that are opened after 2000/1/1.

Enriched Question:
Please find the phone numbers (schools.Phone) of the schools which are charter schools (frpm.`Charter School (Y/N)` = 1) and whose charter funding type is
directly funded (frpm.`Charter Funding Type` = 'Directly funded') and OpenDate is later than 2000-01-01 (schools.OpenDate > '2000-01-01'). \n Join the frpm
and schools tables. Since CDSCode column of frpm table references to CDSCode column of schools table, joining operation should be performed on CDSCode column
of both table. \n Apply the condition of being charter school (frpm.`Charter School (Y/N)` = 1), charter funding type condition (frpm.`Charter Funding Type`
= 'Directly funded') and opening data condition (schools.OpenDate > '2000-01-01'). \n Select the Phone column of the schools table.

Enrichment Reasoning:
In the question phone numbers of the direct charter-funded schools that are opened after 2000/1/1 which is a date. The phone number information of schools can
be found from the Phone table of the schools table in the database. \n The opening date information of the schools can be found from the OpenDate column of
the schools table in the database. \n The information whether a school is direct charter-funded or not can be found from the `Charter Funding Type` table of
the frpm table in the database. \n t is asked to list the phone numbers of the direct charter-funded schools that are opened after 2000-01-01. \n To combine
and match the information in frpm table and schools table, join the frpm and schools tables. Since CDSCode column of frpm table referencing to the CDSCode
column of schools table, joining operation should be performed on CDSCode column of both table. \n After appying being charter school condition (frpm.`Charter
School (Y/N)` = 1), charter funding type condition (frpm.`Charter Funding Type` = 'Directly funded') and opening data condition (schools.OpenDate >
'2000-01-01'), select the Phone column of the schools table.

Question:
Are there more male patients with creatinine not within the normal range than female? True or False?

Enriched Question:
Please find whether the number of male patients (SUM(CASE WHEN T1.SEX = 'M' THEN 1 ELSE 0 END)) whose creatinine level is not within the normal range (
Laboratory.CRE > = 1.5) than the number of female patients (Patient.SEX = 'F') whose creatinine level is not within the normal range (SUM(CASE WHEN T1.SEX = '
F' THEN 1 ELSE 0 END)) by returning only True or False.\n Join the Patient and Laboratory on ID column of both tables. Apply the creatinine level condition (
Laboratory.CRE > = 1.5). Since comparison of two different value for a single attribute which is sex of patients is asked, it is useful to use CASE WHEN
expression.\n With using SUM aggregate funtion and CASE WHEN expression, calculate the number of male (SUM(CASE WHEN T1.SEX = 'M' THEN 1 ELSE 0 END)) and
female (SUM(CASE WHEN T1.SEX = 'F' THEN 1 ELSE 0 END)) patients whose creatinine level is not within the normal range (Laboratory.CRE > = 1.5).

Enrichment Reasoning:
The sex information of a patient can be found from the SEX column of the Patient table in the database. The 'M' value in SEX column indicates male while 'F'
value indicates female.\n The creatinine information of a patient can be found from the CRE column of the Laboratory table in the database. \n If a patient
creatinine value (Laboratory.CRE) is equal to or above 1.5 (Laboratory.CRE > = 1.5), then it is not within the normal range.\n It is asked to find whether the
number of male patients (Patient.SEX = 'M') whose creatinine level is not within the normal range (Laboratory.CRE > = 1.5) than the number of female patients
(Patient.SEX = 'F') whose creatinine level is not within the normal range (Laboratory.CRE > = 1.5) by returning only True or False.\n To match and combine
the laboratory results of a patient with detailed information about the patient, it is required to join Patient and Laboratory tables on ID column of the both
table.\n The creatinine level condition indicating not within the normal range (Laboratory.CRE > = 1.5) should be applied.\n Since comparison of two
different value for a single attribute which is sex of patients is asked, it is useful to use CASE WHEN expression.\n With using SUM aggregate funtion and
CASE WHEN expression, the number of male patients (SUM(CASE WHEN T1.SEX = 'M' THEN 1 ELSE 0 END)) whose creatinine level is not within the normal range can be
found . Similarly, with using SUM aggregate funtion and CASE WHEN expression, the number of female patients (SUM(CASE WHEN T1.SEX = 'F' THEN 1 ELSE 0 END))
whose creatinine level is not within the normal range can be found\n Again using CASE WHEN expression by comparing the number of male and female patients, the
correct result can be returned in the form of 'True' or 'False'.
Hasan Alp Caferoğlu and Özgür Ulusoy

C E-SQL Execution Flow

Figure 7: E-SQL execution flow for the question with question ID 1448 in the development set

You might also like