0% found this document useful (0 votes)
227 views17 pages

Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL

Uploaded by

k2fy2rpc9t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
227 views17 pages

Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL

Uploaded by

k2fy2rpc9t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1

Next-Generation Database Interfaces:


A Survey of LLM-based Text-to-SQL
Zijin Hong1 , Zheng Yuan2 , Qinggang Zhang2 , Hao Chen2 ,
Junnan Dong2 , Feiran Huang1 , and Xiao Huang2∗
1 Jinan University, Guangzhou, China
2 The Hong Kong Polytechnic University, Hong Kong SAR, China

[email protected], [email protected], [email protected],


[email protected], [email protected], [email protected], [email protected]
arXiv:2406.08426v1 [cs.CL] 12 Jun 2024

Abstract—Generating accurate SQL according to natural User Question


language questions (text-to-SQL) is a long-standing problem since
it is challenging in user question understanding, database schema What cartoons were written by Joseph Kuhr?
comprehension, and SQL generation. Conventional text-to-SQL User
systems include human engineering and deep neural networks. Schema
Subsequently, pre-trained language models (PLMs) have been TABLE TV_Channel
developed and utilized for text-to-SQL tasks, achieving promising
performance. As modern databases become more complex and TABLE TV_series
corresponding user questions more challenging, PLMs with
TABLE Cartoon
limited comprehension capabilities can lead to incorrect SQL
"id" real,
generation. This necessitates more sophisticated and tailored
"Title" text,
optimization methods, which, in turn, restricts the applications
"Directed_by" text,
of PLM-based systems. Most recently, large language models
(LLMs) have demonstrated significant abilities in natural language "Written_by" text,
LLM
understanding as the model scale remains increasing. Therefore, "Production_code" real,
"Channel" text,
integrating the LLM-based implementation can bring unique
PRIMARY KEY ("id"), Batman
opportunities, challenges, and solutions to text-to-SQL research.
In this survey, we present a comprehensive review of LLM-based FOREIGN KEY ("Channel") Series
text-to-SQL. Specifically, we propose a brief overview of the REFERENCES "TV_Channel"("id")
current challenges and the evolutionary process of text-to-SQL.
Then, we provide a detailed introduction to the datasets and Generated SQL Query
metrics designed to evaluate text-to-SQL systems. After that, we SELECT * FROM Cartoon
present a systematic analysis of recent advances in LLM-based WHERE Written_by = "Joseph Kuhr"
Database
text-to-SQL. Finally, we discuss the remaining challenges in this
field and propose expectations for future directions.
Fig. 1: An example for LLM-based text-to-SQL selected from
Index Terms—Text-to-SQL, Large Language Models, Database, Spider. The user proposes a question, “What cartoons were
Natural Language Processing written by Joseph Kuhr?” the LLM takes the question and
the schema of its corresponding database as the input, then
I. I NTRODUCTION generates a SQL query as the output. The SQL query can be
executed in the database and retrieve a content “Batman Series”

T EXT-TO-SQL is a long-standing task in natural language


processing research. It aims to convert (translate) natural
language questions into database-executable SQL queries.
to answer the user question.

Fig. 1 provides an example of the large language model- of those systematically trained1 , the NLIDB enables non-
based (LLM-based) text-to-SQL; given a user question “What skilled users to access structured databases like professional
cartoons were written by Joseph Kuhr?”, the LLMs take database engineers [1, 2] and also accelerates human-computer
the question and its corresponding database schema as the interaction [3]. Furthermore, amid the research hotspot of
input and generate an SQL query as the output, which can be LLMs, text-to-SQL can provide a solution to the prevalent
executed in the database to retrieve the content “Batman Series” hallucination [4, 5] issue by incorporating realistic content
for answering the user question. The above system builds a from the database to fill the knowledge gaps of LLMs [6]. The
natural language interface to the database (NLIDB) with LLMs. significant value and potential for text-to-SQL have triggered
Since SQL remains one of the most widely used programming a range of studies on its incorporation and optimization with
languages, with over half (51.52%) of professional developers LLMs [7–9]; consequently, LLM-based text-to-SQL remains a
using SQL in their work, but only around a third (35.29%) highly discussed research field within the NLP and database

* Corresponding author. 1 https://survey.stackoverflow.co/2023


2

Large Language Models


Pre-trained Language Models
Deep Neural Networks In-context Learning
Rule-based Methods Pre-training
SQL sketch Fine-tuning
Parsing tree Schema Linking
Encoder-Decoder
Grammar rule

2015 2019 2021 2023 2024

Fig. 2: A sketch of the evolutionary process for text-to-SQL research from the perspective of implementation paradigm. Each
stage is presented with some implementation techniques and representative works. The timestamps for the stages are not exactly
accurate; we set each timestamp according to the release time of the representative works of each paradigm, with a margin of
error of about one year before and after. The format is inspired from [29]

communities. robust and generalized text-to-SQL systems [73]. In this


Previous studies have made notable progress in the implemen- survey, we aim to catch up with the recent advances and
tation of text-to-SQL and have undergone a long evolutionary provide a comprehensive review of the current state-of-the-art
process. Early efforts were mostly based on well-designed (SOTA) models and approaches in LLM-based text-to-SQL. We
rules and templates [10, 11], specifically suitable for simple begin by introducing the fundamental concepts and challenges
database scenarios. In recent years, with the heavy labor associated with text-to-SQL, highlighting the importance of
costs [12] brought by rule-based methods and the growing this task in various domains. We then delve into the evolution
complexity of database environments [13–15], designing a rule of LLMs and their application to text-to-SQL, discussing the
or template for each scenario has become increasingly difficult key advancements and breakthroughs in this field. After the
and impractical. The development of deep neural networks overview, we provide a detailed introduction to the recent
has advanced the progress of text-to-SQL [16, 17], which advances of text-to-SQL incorporating LLMs. Specifically, the
can automatically learn a mapping from the user question body of our survey covers a range of contents related to LLM-
to its corresponding SQL [18, 19]. Subsequently, pre-trained based text-to-SQL, including:
language models (PLMs) with strong semantic parsing capacity • Datasets and Benchmarks: We provide an overview of the
have become the new paradigm for text-to-SQL [20], taking its commonly used datasets and benchmarks for evaluating
performance to a new level [21–23]. Incrementally, PLM-based LLM-based Text-to-SQL systems. We discuss their char-
research focused on table content encoding [19, 24, 25] and pre- acteristics, complexity, and the challenges they pose for
training [11, 20] has further advanced this field. Recently, the text-to-SQL system development and evaluation.
LLM-based approaches implementing text-to-SQL through in- • Evaluation Metrics: We present the evaluation metrics
context learning (ICL) [7] and supervised fine-tuning (SFT) [9] used to assess the performance of LLM-based Text-to-
paradigm, reaching state-of-the-art accuracy with the well- SQL systems, including accuracy, exactness, and execution
designed framework and stronger comprehension capability correctness. We discuss the advantages and limitations of
compared to PLMs. each metric and their relevance to real-world applications.
The overall implementation details of LLM-based text-to- • Methods and Models: We explore the different methods
SQL can be divided into 3 aspects: 1. Question understanding: and models employed for LLM-based text-to-SQL, includ-
The NL question is a semantic representation of the user’s inten- ing in-context learning and fine-tuning-based paradigms.
tion, which the corresponding generated SQL query is expected We discuss their implementation details, strengths, and
to align with; 2. Schema comprehension: The schema provides adaptations specific to the text-to-SQL task.
the table and column structure of the database, and the text-to- • Expectations and Future Directions: We discuss the
SQL system is required to identify the target components in current challenges and limitations of LLM-based Text-
the database that match the user question; 3. SQL generation: to-SQL, such as real-world robustness, computational
This involves incorporating the above parsing and then writing efficiency, data privacy and extensions. We also outline
correct syntax to generate executable SQL queries that can potential future research directions and opportunities for
retrieve the desired answer. The LLMs have proven to perform improvement.
a good vanilla implementation [26, 27], benefiting from the We hope this survey will provide a clear overview of recent
more powerful semantic parsing enabled by the richer training studies and inspire future studies. A taxonomy tree is shown
corpus compared to the PLMs [28, 29]. Further studies on in Fig. 3
enhancing the LLMs for question understanding [7, 8], schema
comprehension [30, 31], and SQL generation [32] are being II. OVERVIEW
increasingly released.
Text-to-SQL is a task that aims to convert natural language-
type questions into corresponding SQL queries that can be
A. Challenges in Text-to-SQL executed on a relational database. Formally, given a user
Despite the significant progress made in text-to-SQL research, question Q (also known as a user query, natural language,
several challenges remain that hinder the development of NL, NL question, etc.) and a database schema D, the goal
3

BIRD [33], DuSQL [34], CoSQL [35], Spider [13], WikiSQL [14], KaggleDBQA [36], ADVETA [37]
Cross-
Spider-SS [38], Spider-CG [38], Spider-DK [39], Spider-SYN [40], Spider-Realistic [41], CSpider [42]
domain
SParC [43]
Knowledge-
BIRD [33], SQUALL [44], Spider-DK [39]
Original augmented
Datasets
Datasets Cross-
& DuSQL [34], CSpider [42]
(§III-A) lingual
Post-annotated
Datasets
Context-
CoSQL [35], Spider-SS [38], Spider-CG [38], SparC [43]
dependent

Robustness ADVETA [37], Spider-SYN [40], Spider-Realistic [41]


Content-
Matching Component Matching (CM) [13], Exact Matching (EM) [13]
Evaluation based
Metrics
LLM-based Text-to-SQL

(§III-B) Execution
Execution Accuracy (EX) [13], Valid Efficiency Score (VES) [33]
based
Zero-shot [26], [33], [27], [45], [46], [47], [48], [49], [8], [50]

Few-shot [33], [7], [51], [52], [53], [54], [49], [8], [32], [55]

Coder-Reviewer [56], DIN-SQL [7], QDecomp [51], C3 [30], MAC-SQL [57], DEA-SQL [58],
Decomposition
SGU-SQL [32], MetaSQL [59], PET-SQL [60], PURPLE [61]
In-context
Learning
Prompt DESEM+P [62], StructGPT [63], SD+SA+Voting [52]RAG+SP&DRC [64], C3 [30], DAIL-SQL [8]
Paradigm
Optimization ODIS [54], ACT-SQL [49], FUSED [65], DELLM [31]
(§IV-A)
Reasoning COT [8, 32, 33, 51], QDecomp [51], Least-to-Most [51], SQL-PaLM [53], ACT-SQL [49], POT [55]
Enhancement SQL-CRAFT [55], FUXI [66]

MBR-Exec [67], Coder-Reviewer [56], LEVER [68], SELF-DEBUGGING [48], DESEM+P [62]
Execution
Methods DIN-SQL [7], SD+SA+Voting [52], SQL-PaLM [53], RAG+SP&DRC [64], C3 [30], MAC-SQL [57]
Refinement
(§IV) DELLM [31], SQL-CRAFT [55], FUXI [66], PET-SQL [60], PURPLE [61]
Supervised
[45], [8], [50], [53]
Fine-tuning

Enhanced
CLLMs [69]
Architecture
Fine-tuning
Paradigm Pre-training CodeS [9]
(§IV-B)
Data
DAIL-SQL [8], Symbol-LLM [50], CodeS [9], StructLM [70]
Augmentation

Decomposition DTS-SQL [71]

Fig. 3: Taxonomy tree of the research in LLM-based text-to-SQL. The display order in each node is organized by the released
time. The format is adapted from [72]

is to generate an SQL query Y that can be executed on the to incorporate context and domain knowledge.
database to obtain the desired answer. Text-to-SQL has the
potential to democratize access to data by allowing users to 2) Schema Understanding and Representation: To generate
interact with databases using natural language without the need accurate SQL queries, text-to-SQL systems need to have a
for specialized knowledge of SQL. This can benefit various comprehensive understanding of the database schema, including
domains, such as business intelligence, customer support, table names, column names, and relationships between tables.
and scientific research, by enabling non-technical users to However, database schemas can be complex and vary signifi-
easily retrieve information from databases and facilitating more cantly across different domains. Representing and encoding the
efficient data analysis. schema information in a way that can be effectively utilized
by the text-to-SQL model is a challenging task.
1) Linguistic Complexity and Ambiguity: Natural language
questions often contain complex linguistic structures, such 3) Rare and Complex SQL Operations: Some SQL queries
as nested clauses, coreferences, and ellipses, which make involve rare or complex operations, such as nested subqueries,
it challenging to map them accurately to SQL queries. Ad- outer joins, and window functions. These operations are less
ditionally, natural language is inherently ambiguous, with frequent in the training data and pose challenges for text-to-
multiple possible interpretations for a given question. Resolving SQL models to generate accurately. Designing models that can
these ambiguities and understanding the intent behind the handle a wide range of SQL operations, including rare and
question requires deep language understanding and the ability complex ones, is an important consideration.
4

TABLE I: The statistics and analysis of well-known datasets of text-to-SQL ordered by release time. The original dataset
indicates that the dataset is designed with a corresponding database, while post-annotated datasets involve annotating new
components within existing datasets and databases rather than releasing a new database.
Original Dataset Release Time #Example #DB #Table/DB #Row/DB Characteristics
BIRD [33] May-2023 12,751 95 7.3 549K Cross-domain, Knowledge-augmented
KaggleDBQA [36] Jun-2021 272 8 2.3 280K Cross-domain
DuSQL [34] Nov-2020 23,797 200 4.1 - Cross-domain, Cross-lingual
SQUALL [44] Oct-2020 11,468 1,679 1 - Knowledge-augmented
CoSQL [35] Sep-2019 15,598 200 - - Cross-domain, Context-dependent
Spider [13] Sep-2018 10,181 200 5.1 2K Cross-domain
WikiSQL [14] Aug-2017 80,654 26,521 1 17 Cross-domain
Post-annotated Dataset Release Time Base Dataset Special Setting Characteristics
ADVETA [37] Dec-2022 Spider, etc. Adversarial table perturbation Robustness
Spider-SS&CG [38] May-2022 Spider Splitting example into sub-examples Context-dependent
Spider-DK [39] Sep-2021 Spider Adding domain knowledge Knowledge-augmented
Spider-SYN [40] Jun-2021 Spider Manual synonym replacement Robustness
Spider-Realistic [41] Oct-2020 Spider Removing column names in question Robustness
CSpider [42] Sep-2019 Spider Chinese version of Spider Cross-lingual
SParC [43] Jun-2019 Spider Annotate conversational contents Context-dependent

4) Cross-Domain Generalization: Text-to-SQL models often existing PLMs, such as BERT [24] and RoBERTa [76], on text-
struggle to generalize across different database schemas and to-SQL datasets. These PLMs, which were pre-trained on large
domains. Models trained on a specific domain may not perform amounts of text data, captured rich semantic representations and
well on questions from a different domain due to differences in language understanding capabilities. By fine-tuning them on
vocabulary, schema structure, and question patterns. Developing text-to-SQL tasks, researchers aimed to leverage the knowledge
models that can effectively adapt to new domains with minimal and linguistic understanding of PLMs to generate accurate
fine-tuning or domain-specific training data is an ongoing SQL queries [20, 74]. Another line of research focuses on
challenge. incorporating schema information into PLMs to improve
their understanding of database structures and generate more
accurate SQL queries. Schema-aware PLMs are designed to
A. Evolutionary Process capture the relationships and constraints present in the database
The field of text-to-SQL has witnessed significant advance- schema [21].
ments over the years, evolving from rule-based approaches 4) LLM-based Implementation: Large language models
to deep learning-based methods and, more recently, to the (LLMs), such as GPT series [77, 78], have gained significant
integration of pre-trained language models (PLMs) and large attention in recent years due to their ability to generate coherent
language models (LLMs), as shown in Fig. 2. and fluent text. Researchers have started exploring the potential
1) Rule-based Approaches: Early text-to-SQL systems relied of LLMs for text-to-SQL by leveraging their vast knowledge
heavily on rule-based approaches [10–12], where manually and generative capabilities [8, 26]. These approaches often
crafted rules and heuristics were used to map natural language involve prompt engineering to guide the proprietary LLMs in
questions to SQL queries. These approaches often involved the generation process or fine-tuning the open-source LLMs
extensive feature engineering and domain-specific knowledge. on text-to-SQL datasets.
While rule-based methods achieved success in specific domains, The integration of LLMs in text-to-SQL is still an emerging
they lacked the flexibility and generalization capabilities needed area, and there is significant potential for further exploration
to handle diverse and complex questions. and improvement. Researchers are investigating ways to better
2) Deep Learning-based Methods: With the rise of deep leverage the knowledge and reasoning capabilities of LLMs,
learning, sequence-to-sequence models, such as LSTMs and incorporate domain-specific knowledge, and develop more
transformers, were adapted to generate SQL queries from efficient fine-tuning strategies. As the field continues to evolve,
natural language input [19, 74]. Typically, RYANSQL [19] we can expect to see more advanced and powerful LLM-
introduced techniques like intermediate representations and based approaches that push the boundaries of text-to-SQL
sketch-based slot filling to handle complex questions and performance and generalization.
improve cross-domain generalization. Recently, researchers
introduced graph neural networks (GNNs) for text-to-SQL III. B ENCHMARKS & EVALUTION
tasks by leveraging schema dependency graphs to capture the In this section, we introduce the benchmarks for text-to-SQL,
relationships between database elements [18, 75]. encompassing well-known datasets and evaluation metrics.
3) PLM-based Approaches: Pre-trained language models
(PLMs) have emerged as a powerful solution for text-to-
SQL, leveraging the vast amounts of linguistic knowledge and A. Datasets
understanding captured during pre-training. The early adoption As shown in Table I, we classify the datasets into ”Original
of PLMs in text-to-SQL primarily focused on fine-tuning pre- Datasets” and ”Post annotated Datasets” based on whether
5

they were released with the original dataset and databases replacing original column names with misleading alternatives
or created by adapting existing datasets and databases with and inserting new columns with high semantic associations
special processing. We also highlight their release times. For the but low semantic equivalency. These perturbations lead to
original datasets, we provide a detailed analysis, including the significant drops in accuracy, as a text-to-SQL model with low
number of examples, databases, tables per database, and rows robustness may be misled by incorrect matches between tokens
per database. For the post-annotated datasets, we identify their in NL questions and database entities.
base dataset and describe the special processing applied to them. 5) Cross-lingual Dataset: SQL keywords, function names,
To illustrate the potential opportunities of each dataset, we table names, and column names are typically written in
categorize their characteristics into Cross-domain, Knowledge- English, posing challenges for applications in other languages.
augmented, Context-dependent, Robustness, and Cross-lingual, CSpider [42] translates the Spider dataset into Chinese, identi-
which we will discuss in detail below. fying new challenges in word segmentation and cross-lingual
1) Cross-domain Dataset.: This refers to datasets where matching between Chinese questions and English database
the background information of different databases comes from contents. DuSQL [34] introduces a practical text-to-SQL dataset
various domains. Since real-world text-to-SQL tasks often with Chinese questions and database contents provided in both
involve databases from multiple domains, most original text-to- English and Chinese.
SQL datasets [13, 14, 33–36] and post annotated datasets [37–
43] are in the cross-domain setting to fit well with the needs B. Evaluation Metrics
of cross-domain applications.
2) Knowledge-augmented Dataset: Interest in incorporating We introduce four widely used evaluation metrics for the
knowledge into text-to-SQL tasks has increased significantly in text-to-SQL task as follows: Component Matching and Exact
recent years. BIRD [33] employs domain experts to annotate Matching, which are based on SQL content matching, and
each text-to-SQL sample with external knowledge, categorized Execution Accuracy and Valid Efficiency Score, which are
into Numeric Reasoning Knowledge, Domain Knowledge, based on execution results.
Synonym Knowledge, and Value Illustration. Similarly, Spider- 1) Content Matching-based Metrics: SQL content matching
DK [39] defines and adds five types of domain knowledge metrics focus on comparing the generated SQL query with the
for a human-curated version of the Spider dataset: SELECT ground truth SQL query based on their structural and syntactic
Columns Mentioned by Omission, Simple Inference Required, similarities.
Synonyms Substitution in Cell Value Word, One Non-Cell • Component Matching (CM) [13] evaluates the model per-

Value Word Generate a Condition, and Easy to Conflict with formance by measuring the exact match between predicted
Other Domains. Both studies found that human-annotated and ground truth SQL components—SELECT, WHERE,
knowledge significantly improves SQL generation accuracy for GROUP BY, ORDER BY, and KEYWORDS—using the
samples requiring external domain knowledge. Additionally, F1 score. Each component is decomposed into sets of sub-
SQUALL [44] manually annotates alignments between the components and compared for an exact match, accounting
words in NL questions and the entities in SQL, providing for SQL components without order constraints.
finer-grained supervision than other datasets. • Exact Matching (EM) [13] measures the percentage

3) Context-dependent Dataset: SParC [43] and CoSQL [35] of examples whose predicted SQL query is equivalent
explore context-dependent SQL generation by constructing a to the ground truth SQL query. A predicted SQL is
conversational database querying system. Unlike traditional considered correct only if all its components, as described
text-to-SQL datasets that only have a single (question, SQL) in Component Matching, match exactly with those of the
pair for one example, SParC decomposes the (question, SQL) ground truth query.
examples in the Spider dataset into multiple (sub-question, 2) Execution-based Metrics: Execution result metrics assess
SQL) pairs to construct a simulated and meaningful interaction, the correctness of the generated SQL query by comparing the
including inter-related sub-questions that aid SQL generation, results obtained from executing the query on the target database
and unrelated sub-questions that enhance data diversity. CoSQL, with the expected results.
in comparison, involves conversational interactions in natural • Execution Accuracy (EX) [13] measures the correctness
language, simulating real-life scenarios to increase complexity of a predicted SQL query by executing it in the corre-
and diversity. Additionally, Spider-SS&CG [38] splits the NL sponding database and comparing the executed results
question in the Spider dataset into multiple sub-questions and with the ground truth queries.
sub-SQLs, demonstrating that training on these sub-examples • Valid Efficiency Score (VES) [33] is defined to measure
can improve a text-to-SQL model’s generalization ability on the efficiency of valid SQL queries, which are the
out-of-distribution samples. predicted SQL queries whose executed results exactly
4) Robustness Dataset: Evaluating the accuracy of text- match the ground truth query. Thus, VES evaluates both
to-SQL models with attacked or perturbed database contents the efficiency and accuracy of predicted SQL queries.
(e.g., schema and tables) is crucial for assessing robustness. For an evaluation dataset with N examples, VES can be
Spider-Realistic [41] removes explicit schema-related words computed by:
from the NL questions, while Spider-SYN [40] replaces them N
with manually selected synonyms. ADVETA [37] introduces 1 X
VES = 1(Vn , V̂n ) · R(Yn , Ŷn ), (1)
Adversarial Table Perturbation (ATP), which perturbs tables by N n=1
6

where Ŷn and V̂n are the predicted SQL query and its TABLE II: Typical methods used for in-context learning (ICL)
executed results. Yn and Vn are those of the correspond- in LLM-based text-to-SQL. The full table of existing methods
ing ground truth SQL query. 1(Vn , V̂n ) is an indicator with categorization C1:4 and more details are listed in Table III.
function, where: Methods Adopted by Applied LLMs
(
1, Vn = V̂n Zero-shot [26] ChatGPT
1(Vn , V̂n ) = (2) C0 -Trivial Prompt
Few-shot [8] ChatGPT
0, Vn ̸= V̂n
q C1 -Decomposition DIN-SQL [7] GPT-4
C2 -Prompt Optimization DAIL-SQL [8] GPT-4
Then, R(Yn , Ŷn ) = E(Yn )/E(Ŷn ) denotes the relative C3 -Reasoning Enhancement ACT-SQL [49] GPT-4
execution efficiency of the predicted SQL query in C4 -Execution Refinement LEVER [68] Codex
comparison to ground-truth query, where E(·) is the
execution time of each SQL in the database. [33] ensures
the stability of this metric by computing the average of C0 -Trivial Prompt: Trained through massive data, LLMs
R(Yn , Ŷn ) over 100 runs for each example. have a strong overall proficiency in different downstream tasks
Most of the recent LLM-based text-to-SQL studies focus on with zero-shot and few-shot prompting [80, 84, 85], which
these four datasets: Spider [13], Spider-Realistic [41], Spider- is widely recognized and achieves promising results. In our
SYN [40], and BIRD [33]; and these three evaluation methods: survey, we categorized the above prompting approaches without
EM, EX, and VES, we will primarily focus on them in the the well-designed framework as trivial prompts. As introduce
following introduction. above, the formulated process of LLM-based text-to-SQL Eq. 3
can also represent zero-shot prompting, the input P0 can be
IV. M ETHODS obtained by concatenate I, S and Q:
The implementation of LLM-based applications mostly relies P0 = I ⊕ S ⊕ Q. (4)
on in-context learning (prompt engineering) [79] and fine-
tuning [80] since the powerful proprietary models and well- To regulate the prompting process, we set the OpenAI demon-
2
architected open-source models are being released in large stration as the standard (trivial) prompt [30] for text-to-SQL.
quantities [81–83]. The specific methods can generally be Zero-shot: Many research works [26, 27, 46]l utilize zero-
divided into two paradigms, in-context learning (ICL) and shot prompting, focusing mainly on the influence of the style
fine-tuning (FT). In this survey, we primarily focus on these of prompt construction and the zero-shot performance of
text-to-SQL paradigms and will discuss them accordingly. various LLMs for text-to-SQL. As an empirical evaluation,
[26] evaluates the baseline text-to-SQL capabilities of different
A. In-context Learning early-developed LLMs [77, 86, 87], which also presents
the results for different prompt styles. The results indicate
Through extensive and widely recognized research, prompt
that prompt design is critical for performance, with error
engineering has been proven to play a decisive role in the
analysis, [26] propose more database content can harm the
performance of LLMs [28], also impacting the text-to-SQL
overall performance. Since ChatGPT emerged with impressive
task in different prompt styles [8, 46]. Necessarily, developing
capabilities in conversational scenarios and code generation,
text-to-SQL methods in the in-context learning (ICL) paradigm
[27] assesses its performance of text-to-SQL. With zero-shot
is valuable for achieving promising improvement. The im-
settings, the results demonstrate that ChatGPT has a promising
plementation of LLM-based text-to-SQL process to generate
text-to-SQL ability compared to the state-of-the-art (SOTA)
executable SQL query Y is formulated as:
models. For fair comparability, [47] reveal effective prompt
Y = f (Q, S, I | θ), (3) construction for LLM-based text-to-SQL; they study different
where Q represents the user question. S is the database styles of prompt construction and make conclusions of zero-
schema/content, which can be decomposed as S = ⟨C, T , K⟩, shot prompt design based on the comparisons.
where C = {c1 , c2 , ...} and tables T = {t1 , t2 , ...} represent the Primary key and foreign key carry contiguous knowledge
collection of various columns and tables, K is the potentially of different tables, [49] study their impact by adding these
external knowledge (e.g. foreign key relationships [49], schema keys on different designed database prompt style on zero-shot
linking [30] and domain knowledge [31, 33]). I represents the prompting results. A benchmark evaluation [8] also studies the
instruction for the text-to-SQL task, which performs indicative influence of foreign keys, with five different prompt representa-
guidance to guide the LLMs for generating an accurate SQL tion styles, each style can be considered as the permutation and
query. f (· | θ) is a LLM with parameter θ. In the in-context combination of the instruction, rule implication, and foreign
learning (ICL) paradigm, we utilize an off-the-shelf text-to- key. Apart from the foreign key, this study also explores zero-
SQL model (i.e., parameter θ of the model is frozen) for shot prompting combined with “no explanation” rule and the
implementation. Various well-designed methods have been rule implication “Let’s think step by step”. Empowered by the
adopted in the ICL paradigm for text-to-SQL. We group annotated knowledge of human experts, [33] follow the standard
them into five categories C0:4 , including: C0 -Trivial Prompt, prompting and obtain improvement by combining the provided
C -Decomposition, C -Prompt Optimization, C -Reasoning oracle knowledge. With the explosion of open-source LLMs,
1 2 3
Enhancement, and C4 -Execution Refinement, the details are 2 The prompt style that follows the official document from OpenAI
shown in Tab. II. platform: https://platform.openai.com/examples/default-sql-translate
7

TABLE III: Well-designed methods used in in-context learning (ICL) paradigm for LLM-based text-to-SQL ordered by release
time. The methods are grouped in four categories based on their implementation perspective: C1 -Decomposition, C2 -Context
Augmentation, C3 -Reasoning Enhancement, C4 -Execution Refinement. The method in multiple categories will be introduced
respectively. * There are multiple applied LLMs in the corresponding method; we present the selected LLM with representative
performance. † COT method are reported in multiple venues: NeurIPS’23 [33], EMNLP’23 [51], VLDB’24 [8], arXiv’24 [32]
Methods Applied LLMs Benchmark Metrics C1 C2 C3 C4 Release Time Publication Venue
MBR-Exec [67] Codex [13] EX ✓ Apr-2022 EMNLP’22
Coder-Reviewer [56] Codex [13] EX ✓ ✓ Nov-2022 ICML’23
LEVER [68] Codex [13] EX ✓ Feb-2023 ICML’23
SELF-DEBUGGING [48] StarCoder* [13] EX ✓ Apr-2023 ICLR’24
DESEM+P [62] ChatGPT [13, 40] EX ✓ ✓ Apr-2023 PRICAI’23
DIN-SQL [7] GPT-4* [13, 33] EX, EM, VES ✓ ✓ Apr-2023 NeurIPS’23
COT [8, 32, 33, 51] GPT-4 [13, 33, 41] EX, VES ✓ May-2023 Multiple Venues†
StructGPT [63] ChatGPT* [13, 40, 41] EX ✓ May-2023 EMNLP’23
SD+SA+Voting [52] ChatGPT* [13, 40, 41] EX ✓ ✓ May-2023 EMNLP’23 Findings
QDecomp [51] Codex [13, 41] EX ✓ ✓ May-2023 EMNLP’23
Least-to-Most [51] Codex [13] EX ✓ May-2023 EMNLP’23
SQL-PaLM [53] PaLM-2 [13] EX ✓ ✓ May-2023 arXiv’23
RAG+SP&DRC [64] ChatGPT [13] EX ✓ ✓ Jul-2023 ICONIP’23
C3 [30] ChatGPT [13] EX ✓ ✓ ✓ Jul-2023 arXiv’23
DAIL-SQL [8] GPT-4* [13, 33, 41] EX, EM, VES ✓ Aug-2023 VLDB’24
ODIS [54] Codex* [13] EX ✓ Oct-2023 EMNLP’23 Findings
ACT-SQL [49] GPT-4* [13, 40] EX, EM ✓ ✓ Oct-2023 EMNLP’23 Findings
MAC-SQL [57] GPT-4* [13, 33] EX, EM, VES ✓ ✓ Dec-2023 arXiv’23
DEA-SQL [58] GPT-4 [13] EX ✓ Feb-2024 ACL’24 Findings
FUSED [65] ChatGPT* [13] EX ✓ Feb-2024 arXiv’24
DELLM [31] GPT-4* [13, 33] EX, VES ✓ ✓ Feb-2024 ACL’24 Findings
SGU-SQL [32] GPT-4* [13, 33] EX, EM ✓ Feb-2024 arXiv’24
POT [55] GPT-4* [13, 33] EX ✓ Feb-2024 arXiv’24
SQL-CRAFT [55] GPT-4* [13, 33] EX ✓ ✓ Feb-2024 arXiv’24
FUXI [66] GPT-4* [33] EX ✓ ✓ Feb-2024 arXiv’24
MetaSQL [59] GPT-4* [13] EX, EM ✓ Feb-2024 ICDE’24
PET-SQL [60] GPT-4 [13] EX ✓ ✓ Mar-2024 arXiv’24
PURPLE [61] GPT-4* [13, 40, 41] EX, EM ✓ ✓ Mar-2024 ICDE’24

according to the results of similar evaluation, these models prompting, the decomposed reasoning steps of the given
are also capable of zero-shot text-to-SQL task [45, 46, 50], example trigger the text-to-SQL model for generating accurate
especially code generation models [46, 48]. For zero-shot SQL. [55] study the effect of the number of few-shot examples.
prompting optimization, [46] raises a challenge for designing [52] focus on the sampling strategies by studying the similarity
an effective prompt template for LLMs; the former prompt and the diversity between different demonstrations, setting
construction lacks structure uniformity, which makes it hard random sampling as the baseline, and evaluating different
to find out a concrete element within a prompt constructing strategies and their combination for comparison. Furthermore,
template influences the performance of LLMs. They address above the similarity selection, [8] evaluated masked question
this challenge by investigating a more unified series of prompt similarity selection and the upper limit of similarity approaches
templates warping with different prefixes, infixes, and postfixes. with various numbers of few-shot examples. A study of
Few-shot: The technique of few-shot prompting is widely used difficult-level samples selection [51] compared the performance
in both practical applications and well-designed research, which of few-shot Codex, with random selection and difficulty-
has been proven efficient for eliciting better performance of based selection for few-shot instances on difficulty categorized
LLMs [28, 88]. The input prompt of the few-shot approach dataset [13, 41]. Three difficulty-based selection strategies are
LLM-based text-to-SQL can be formulated as an extension of devised based on the number of selected samples at different
Eq. 3: difficulty levels. [49] utilize a hybrid strategy for selecting
Pn = {F1 , F2 , . . . , Fn } ⊕ P0 , (5) samples, which combines static examples and similarity-based
where the Pn represent the input prompt for n-shot learning, n dynamic examples for few-shot prompting. In their evaluations,
is the provided instances (examples) number; F denote the few- they also test the impact of different input schema styles and
shot instance, which can be decomposed as F = (S , Q , Y ), different static and dynamic exemplar numbers.
i i i i
i is the serial number of instances. The study of few-shot The impact of cross-domain few-shot examples is also
prompting focuses on the number of representations and few- being studied [54]. When incorporating in-domain and out-
shot instance selection. of-domain with different numbers of examples, the in-domain
As pilot experiments, few-shot prompting for text-to-SQL demonstration always outperforms zero-shot and out-of-domain
are evaluated in multiple datasets with various LLMs [7, 32], examples and gets better as the number of examples rises. To
achieve better performance rather than zero-shot approaches. explore the detailed construction of input prompt, [53] compare
[33] provides a 1-shot example combining with CoT [89] the concise and verbose prompt design approaches. The former
8

style splits the schema, the column names, and the primary and sub-tasks. Through the workflow paradigm, an accurate SQL
foreign keys by bar, and the latter organizes them as natural query is generated. SGU-SQL [32] is a structure-to-SQL
language descriptions. framework, leveraging the inherent structure information to
C1 -Decomposition: As an intuitive solution, decomposing assist SQL generation. Specifically, the framework constructs
a challenging user question into simpler sub-questions and a graph structure for the user question and the corresponding
using multi-step reasoning for implementation can reduce database respectively, then uses the encoded graphs to construct
the complexity of the full text-to-SQL task [7, 51]. Dealing structure linking [91, 92]. A meta operator decomposes the
with less complexity, LLMs have the potential to have better user question with a grammar tree and finally designs the input
performance. The decomposition approaches for LLM-based prompt with meta-operation in SQL. MetaSQL [59] introduces
text-to-SQL are categorized into two paradigms: (1) sub-task a three-stage approach for SQL generation, which consists
decomposing, provides additional parsing to assist the final of decomposition, generation, and rank. The decomposition
SQL generation by decomposing the overall text-to-SQL task stage uses semantic decomposition and metadata composition
into smaller effective sub-tasks (e.g., schema linking, domain to process the user question. Taking the previously processed
classification). (2) sub-question decomposing, divides the user data as input, a text-to-SQL model using metadata-conditioned
question into sub-questions to reduce the question’s complexity generation to generate some candidate SQL queries. Finally, a
and difficulty, then generates the sub-SQL by solving these two-stage ranking pipeline is applied to get a global-optimal
questions to deduce the final SQL query. SQL query. PET-SQL [60] proposed a prompt-enhanced two-
DIN-SQL [7] proposed a decomposed in-context learn- stage framework. Firstly, an elaborated prompt instructs the
ing method consisting of four modules: schema linking, LLMs to generate preliminary SQL (PreSQL) where some
classification & decomposition, SQL generation, and self- few-shot demonstrations are selected based on similarity. Then,
correction. DIN-SQL first finishes the schema linking between schema linking is found based on PreSQL and combined to
the user question and the target database; the following prompt the LLMs to generate the Final SQL (FinSQL). Finally,
module decomposes the user question into correlated sub- multiple LLMs are utilized to generate a FinSQL, ensuring
questions and does a difficulty classification. Based on the consistency based on the execution results.
above information, the SQL generation module generates a C2 -Prompt Optimization: As previously introduced, few-
corresponding SQL, and the self-correction module identifies shot learning is widely studied for prompting LLMs [77].
and corrects the potential errors in the predicted SQL. This For LLM-based text-to-SQL with in-context learning, trivial
approach comprehensively considers the decomposition of both few-shot approaches obtained promising results [7, 8, 33],
sub-tasks and sub-questions. Coder-Reviewer [56] framework further optimization of few-shot prompting has the potential
proposed a re-ranking method, combining Coder models for the to achieve higher performance. Since the performance of
generation and Reviewer models to evaluate the likelihood of SQL generation in off-the-shelf LLMs largely depends on
the instruction. Refer to the Chain-of-Thought [89] and Least- the quality of the corresponding input prompt [93], many
to-Most prompting [90], QDecomp [51] introduce question decisive factors that can influence the quality of the prompt have
decomposition prompting, which follows the question reduction become focuses of the research [8] (e.g., quality and quantity
stage in least-to-most prompting and instruct the LLM to in the few-shot organization, the similarity between user
decompose the original complex question as the intermediate questions and few-shots instances, external knowledge/hints).
reasoning steps. C3 [30] consists of three key components: The process of prompt quality improvement is actually the
clear prompting, calibration bias prompting, and consistency; prompt’s optimization, including few-shot sampling strategies,
these components are accomplished by assigning ChatGPT schema augmentation, and external knowledge generation.
with different tasks. Firstly, the clear prompting component DESEM [62] is a prompt engineering framework with de-
generates the schema linking and the distilled question-relevant semanticization and skeleton retrieval. The framework first
schema as a clear prompt. Then, a multi-turn dialogue about employs domain-specific words masking module to remove
text-to-SQL hints is utilized as a calibration bias prompt, which the semantic tokens in questions that preserve the question’s
combines with the clear prompt to guide the SQL generation. intentions. And then utilizes an adjustable prompting module
The generated SQL queries are selected by consistency and that retrieves the few-shot examples with identical question
execution-based voting to get the final SQL. MAC-SQL [57] intentions and incorporates schema-relevance filtering to guide
presents a multi-agent collaborating framework; the text-to- the LLM’s SQL generation. The QDecomp [51] framework
SQL process is finished as the collaboration of the agents: introduces the InterCOL mechanism to incrementally incorpo-
Selector, Decomposer, and Refiner. The Selector preserves rate the decomposed sub-questions with correlative table and
relevant tables for user questions; the Decomposer breaks down column names. With difficulty-based selection, the few-shot
user questions into sub-questions and provides solutions; finally, examples for QDecomp are difficult-level sampled. Besides
the Refiner validates and refines the defective SQL. DEA- similarity-diversity sampling, [52] proposed SD+SA+Voting
SQL [58] introduces a workflow paradigm aiming to enhance (Similarity-Diversity+Schema Augmentation+Voting) sampling
the attention and problem-solving scope of LLM-based text- strategy. They first employ semantic similarity and k-Means
to-SQL through decomposition. This method decomposes the cluster diversity for sampling few-shot examples and then
overall task, enabling the SQL generation module to have the enhance the prompt with schema knowledge (semantic or
corresponding prerequisite (information determination, question structure augmentation). C3 [30] framework comprises a clear
classification) and subsequent (self-correction, active learning) prompting component, which takes the question and schema as
9

the LLMs input, generates a clear prompt that includes a schema think step by step” in prompt construction [8, 32, 33, 51].
that removes the redundant information irrelevant to the user However, the straightforward (original) CoT strategy has not
question and a schema linking, and also a calibration component demonstrated the potential in text-to-SQL tasks that it has in
providing hints. The LLMs take their composition as context- other reasoning tasks; studying CoT for adaptations is still an
augmented prompts for SQL generation. A retrieval-augmented ongoing research [51]. Since CoT prompting always uses static
framework is introduced with sample-aware prompting [64], examples with human annotation for demonstrations, which
which simplifies the original question and extracts the question requires empirical judgment for the effective selection of few-
skeleton from the simplified question, then finishes the sample shot examples, and manual annotating is also an essential need.
retrieval in the repository according to skeleton similarities. As a solution, ACT-SQL [49] proposed a method to generate
The retrieved samples are combined with the original question CoT examples automatically. Specifically, given a question,
for few-shot prompting. ODIS [54] introduces the selection of ACT-SQL truncates a set of slices of the question and then
a sample with out-of-domain demonstrations and in-domain enumerates every column appearing in the corresponding SQL
synthetic data, which retrieves few-shot demonstrations from query. Each column will be linked with its most relevant
hybrid sources to augment the prompt representations. DAIL- slice through the similarity function and appended to the
SQL [8] proposed a novel approach to address the issues in CoT prompt. Through systematical study for enhancing LLMs
few-shot sampling and organization, presenting a better balance SQL generation incorporating CoT prompting, QDecomp [51]
between the quality and quantity of few-shot examples. DAIL presents a novel framework to address the challenge for
Selection first masks domain-specific words in user and few- CoT how to come up with the reasoning steps to predict
shot example questions, then ranks the candidate examples the SQL query. The framework utilizes every slice of the
based on the embedded Euclidean distance. Meanwhile, the SQL query to construct a logical step in CoT reasoning,
similarity between the pre-predicted SQL queries is calculated. then employs natural language templates to articulate each
Finally, the selection mechanism obtains the similarity-sorted slice of the SQL query and arranges them in the logical
candidates according to the pre-set criteria. The few-shot execution order. Least-to-Most [90] is another prompting
examples are guaranteed good similarity with both questions technique that decomposes questions into sub-questions and
and SQL queries with this method. ACT-SQL [49] proposed then sequentially solves them. As iterative prompting, pilot
dynamic examples in few-shot prompting, which is selected experiments [51] demonstrate that it may be unnecessary for
according to similarity score. FUSED [65] are presented to text-to-SQL parsing. Using detailed reasoning steps tends to
build a high-diversity demonstrations pool through human- have more error propagation issues. As a variant of CoT,
free multiple-iteration synthesis to improve the diversity of Program-of-Thoughts (PoT) prompting strategy [96] are
the few-shot demonstrations. The pipeline of FUSED samples proposed to enhance arithmetic reasoning for LLMs. Through
the demonstrations to be fused by clustering, then fuse the evaluation [55], PoT enhances the LLM for SQL generation,
sampled demonstrations to construct the pool to enhance few- especially in complicated datasets [33]. SQL-CRAFT [55]
shot learning. Knowledge-to-SQL [31] framework aims to build are proposed to enhance LLM-based SQL generation, which
a Data Expert LLM (DELLM) to provide knowledge for SQL incorporates PoT prompting for Python-enhanced reasoning.
generation. The DELLM is trained by supervised fine-tuning PoT strategy requires the model to simultaneously generate
using human expert annotations [33] and further refined by the Python code and the SQL queries, enforcing the model
preference learning with the database’s feedback. DELLM to incorporate Python code in its reasoning process. Self-
generates four categories of knowledge, the well-designed Consistency [95] is a prompting strategy improving reasoning
methods (e.g. DAIL-SQL [8], MAC-SQL [57]) incorporating in LLMs, which leverages the intuition that a complex reasoning
the generated knowledge to achieve better performance for problem typically admits multiple different ways of thinking,
LLM-based text-to-SQL with in-context learning. leading to its unique correct answer. In the text-to-SQL task,
C3 -Reasoning Enhancement: LLMs have exhibited promis- self-consistency is adapted to sampling a set of different SQL
ing capabilities in tasks involving commonsense reasoning, and voting for consistent SQL via execution feedback [30, 53].
symbolic reasoning, and arithmetic reasoning [94]. Since for the Similarly, the SD+SA+Voting [52] framework eliminates those
text-to-SQL tasks, numeric and synonym reasoning frequently with execution errors identified by the deterministic database
occur in realistic scenarios [33, 41], the prompting strategies management system (DBMS) and opts for the prediction that
for the LLMs reasoning possess the potential to enhance their garners the majority vote. Furthermore, motivated by recent
SQL generation capabilities. Recent studies primarily focus research on extending the capabilities of LLMs with tools,
on incorporating well-designed reasoning-enhanced methods FUXI [66] are proposed to enhance LLMs SQL generation
for text-to-SQL adaptation, improving LLMs to address the through effectively invoking crafted tools.
challenge about complex questions that require multi-step C4 -Execution Refinement: To design criteria for accurate
reasoning and the issue of self-consistency [95] in SQL SQL generation, whether a generated SQL can be successfully
generation. executed and elicit a correct answer for the user question is
Chain-of-Thoughts (CoT) prompting technique [89] in- always the priority [13]. As a complex programming task,
volves a comprehensive reasoning process that guides LLMs generating the correct SQL in one go becomes challenging.
towards accurate deduction, eliciting reasoning in LLMs. The Intuitively, considering the execution feedback/results in SQL
study of LLM-based text-to-SQL utilizes CoT prompting generation assists the alignment to the corresponding database
as rule implication [8], which setting the instruction “Let’s environment, which allows the LLMs to gather the potential
10

executed errors and results to refine the generated SQL or framework incorporating the database execution feedback with
hold a majority vote [30]. The execution-aware methods in a direct preference optimization [98] for refining the proposed
text-to-SQL incorporate the execution feedback in two main DELLM. PET-SQL [60] proposed cross consistency, which
approaches: 1) Incorporating the feedback through second comprises two variants: 1) naive voting: instruct multiple LLMs
round prompting for regeneration, for every SQL query to generate the SQL query, then utilizing the majority vote
generated in the initial response, it will be executed in the for the final SQL base on different execution results; 2) fine-
corresponding database, thus obtaining feedback from the grained voting: refine the naive voting based on the difficulty
database. This feedback might be an error, or it might yield level to mitigate the voting bias.
results that will be appended to the second round prompt.
Through in-context learning of this feedback, LLMs are able B. Fine-tuning
to refine or regenerate the original SQL, thereby enhancing Since supervised fine-tuning (SFT) is the mainstream ap-
accuracy. 2) Utilize execution-based selection strategies for proach in the LLMs training [29], for open-source LLMs (e.g.,
generated SQL, sample multiple generated SQL queries from LLaMA-2 [82], Gemma [99]), the most straightforward method
LLM), and execute each in the database. Based on the results of to enable the model to adapt a specific domain quickly is to use
each SQL query, use selection strategies (e.g., self-consistency, collected domain label to perform SFT on the model. The SFT
majority vote [60]) to define a query from the SQL set that phase is typically the preliminary phase of the well-designed
satisfies the criteria as the final predicted SQL. training framework [98, 100], as well as the fine-tuning of
MRC-EXEC [67] introduced a natural language to code text-to-SQL. The auto-regressive generation process of SQL
(NL2Code) translation framework with execution, which exe- query Y can be formulated as follows:
cutes each sampled SQL query and selects the example with the Yn
minimal execution result–based Bayes risk [97]. LEVER [68] Pπ (Y | P) = Pπ (yk | P, Y1:k−1 ), (6)
proposed an approach to verify NL2Code with execution, k=1
utilizing a generation and execution module to collect sampled where Y = {y1 , y2 , . . . , yn } is an SQL query of length n,
SQL set and their execution results, respectively, then using yk is the corresponding k th token of the SQL query, Y1:k−1
a learned verifier to output the probability of the correctness. is the prefix sequence of Y ahead the token yk . Pπ (yk | ·)
Similarly, the SELF-DEBUGGING [48] framework is presented is a conditional probability of a LLM π for generating the
to teach LLMs to debug their predicted SQL via few-shot k th token of Y base on the input prompt P and the prefix
demonstrations. The model is able to refine its mistakes by sequence.
investigating the execution results and explaining the generated Given a basic open-source model π 0 , the goal of SFT is
SQL in natural language without human interventions obtain a model π SF T through minimizing the cross-entropy
As previously introduced, to incorporate the well-designed loss:
framework with execution feedback, two-stage implications Xn

are widely-used: 1. sampling a set of SQL queries. 2. majority LSF T = − log Pπ0 (ŷk = yk | P, Y1:k−1 ), (7)
vote (self-consistency). Specifically, the C3 [30] framework k=1
removes the errors and identifies the most consistent SQL; where ŷk is the k th token of the generated SQL query Ŷ , and
The retrieval-augmented framework [64] introduced a dynamic Y is the corresponding ground-truth label.
revision chain, combining fine-grained execution messages with The SFT approach for text-to-SQL has been widely adopted
database content to prompt the LLMs to convert the generated in text-to-SQL research for various open-source LLMs [8, 9,
SQL query into natural language explanation; the LLMs are 46]. Compared to in-context learning (ICL) approaches, fine-
requested to identify the semantic gaps and revise their own tuning paradigms are more inclined to be at a starting point in
generated SQL. Although schema-filtering approaches enhance LLM-based text-to-SQL. Currently, several studies exploring a
SQL generation, the generated SQL could be unexecutable. better fine-tuning method have been released. We categorize
DESEM [62] incorporates a fallback revision to address the the well-designed fine-tuning methods in different groups based
issue; it revises and regenerates the SQL base on different on their mechanisms, as shown in Tab. IV.
kinds of errors and sets termination criteria to avoid the loop. Enhanced Architecture: The widely-used generative pre-
DIN-SQL [7] designed a generic and gentle prompt in their trained transformer (GPT) framework utilizes decoder-only
self-correction module; the generic prompt requests the LLMs transformer architecture and conventional auto-regressive de-
to identify and correct the error, and the gentle prompt asks the coding for text generation. Recent studies on the efficiency of
model to check the potential issue. The multi-agent framework LLMs have revealed a common challenge: when generating
MAC-SQL [57] comprises a refiner agent, which is able to long sequences with the auto-regressive paradigm, the need
detect and automatically rectify SQL errors, taking SQLite error to incorporate the attention mechanism results in high latency
and exception class to regenerate fixed SQL. Since different for LLMs [101, 102]. In LLM-based text-to-SQL, the speed
questions may require different numbers of revisions, SQL- of generating SQL queries is significantly slower compared to
CRAFT [55] framework introduced interactive correction with traditional language modeling [21, 28], which has become a
an automated control determination process to avoid over- challenge in constructing high-efficiency local NLIDB.
correction or insufficient correction. FUXI [66] considers the As one of the solutions, CLLMs [69] are designed to address
error feedback in tool-based reasoning for SQL generation. the above challenge with an enhanced model architecture and
The Knowledge-to-SQL [31] introduces a preference learning achieve a speedup for SQL generation.
11

TABLE IV: Well-designed methods used in fine-tuning (FT) for LLM-based text-to-SQL. The methods in each category are
ordered by release time. * The methods are utilized in multiple open-source LLMs; we select a representative model to present.
Category Adopted by Applied LLMs Dataset EX EM VES Release Time Publication Venue
Enhanced Architecture CLLMs [69] Deepseek* [13] ✓ Mar-2024 ICML’24
Pre-training CodeS [9] StarCoder [13, 33] ✓ ✓ Feb-2024 SIGMOD’24
DAIL-SQL [71] LLaMA* [13, 41] ✓ ✓ Aug-2023 VLDB’24
Symbol-LLM [50] CodeLLaMA [13] ✓ Nov-2023 ACL’24
Data Augmentation
CodeS [9] StarCoder [13, 33] ✓ ✓ Feb-2024 SIGMOD’24
StructLM [70] CodeLLaMA [13] ✓ Feb-2024 arXiv’24
Decomposition DTS-SQL [71] Mistral* [13, 40] ✓ ✓ Feb-2024 arXiv’24

Data Augmentation: During the fine-tuning process, the in Sec. IV-A, ICL paradigm. The proprietary models utilized in
most straightforward factor affecting the model’s performance ICL-based methods have a massive number of parameters that
is the quality of the training labels [103]. The fine-tuning under are not at the same parameter level as the open-source models
the low quality or lack of the training labels is “making bricks used in fine-tuning methods. These models inherently possess
without straw”, using high-quality or augmented data for fine- the capability to perform assigned sub-tasks well (through
tuning always surpasses the meticulous design of fine-tuning mechanisms such as few-shot learning) [30, 57]. Thus, to
methods on low-quality or raw data [29, 104]. Data-augmented replicate the success of this paradigm in ICL methods, it is
fine-tuning in text-to-SQL made substantial progress, focusing necessary to reasonably assign corresponding sub-tasks to open-
on enhancing the data quality during the SFT process. source models (such as generating external knowledge, schema
DAIL-SQL [8] are designed as an in-context learning linking, and distilling the schema) for sub-task-specific fine-
framework, utilizing a sampling strategy for better few-shot tuning and constructing the corresponding data for fine-tuning,
instances. Incorporating the sampled instances in the SFT thereby assisting in the final SQL generation.
process improves the performance of open-source LLMs. DTS-SQL [71] proposed a two-stage decomposed text-to-
Symbol-LLM [50] propose injection and infusion stage for data SQL fine-tuning framework and designed a schema-linking
augmented instruction tuning. CodeS [9] augmented the training pre-generation task ahead of the final SQL generation.
data with bi-directional generation with the help of ChatGPT.
StructLM [70] are trained on multiple struct knowledge tasks V. E XPECTATIONS
for improving overall capability. Despite the significant advancements made in text-to-SQL
Pre-training: Pre-training is a fundamental phase of the research, there are still several challenges that need to be
complete fine-tuning process, aimed at acquiring text genera- addressed. In this section, we discuss the remaining challenges
tion capabilities through auto-regressive training on extensive that we expect to overcome in future work.
data [105]. Conventionally, the current powerful proprietary
LLMs (e.g., ChatGPT [106], GPT-4 [78], Claude [107]) are A. Robustness in Real-world Applications
pre-trained on hybrid corpus, which mostly benefit from the dia- The text-to-SQL implemented by LLMs is expected to
logue scenario that exhibits text generation capability [77]. The perform generalization and robustness across complex scenarios
code-specific LLMs (e.g., CodeLLaMA [108], StarCoder [109]) in real-world applications. Despite recent advances having made
are pre-trained on code data [87], and the mixture of various substantial progress in robustness-specific datasets [37, 41], its
programming languages enables the LLMs to generate code to performance still falls short of practical application [33]. There
meet with the user’s instruction [110]. As a sub-task of code are still challenges that are expected to be overcome in future
generation, the main challenge of SQL-specific pre-training studies. From the user aspect, there is a phenomenon that the
technique is that the SQL/Database-related content occupies user is not always a clear question proposer, which means
only a small portion of the entire pre-training corpus. Then, the user questions might not have the exact database value and
as a result, the open-source LLMs with comparatively limited also can be varied from the standard datasets, the synonyms,
comprehensive capacity (compared to ChatGPT, GPT-4) do typos, and vague expressions could be included [40]. For
not acquire a promising understanding of how to convert NL instance, the models are trained on clear indicative questions
questions to SQL during their pre-training process. with concrete expressions in the fine-tuning paradigm. Since
The pre-training phase of the Codes [9] model consists of the model has not learned the mapping of realistic questions
three stages of incremental pre-training. Starting from a basic to the corresponding database, this leads to a knowledge gap
code-specific LLM [109], CodeS are further pre-trained on a when applied to real-world scenarios [33]. As reported in the
hybrid training corpus, including SQL-related data, NL-to-Code corresponding evaluations of the dataset with synonym and
data, and NL-related data. The text-to-SQL understanding and incomplete instruction [26, 51], the SQL queries generated by
performance are significantly improved. ChatGPT contain around 40% incorrect execution, which is
Decomposition: Decomposing a task into multiple steps or 10% lower than the original evaluation [51]. Simultaneously, the
using multiple models to solve the task is an intuitive solution fine-tuning with local text-to-SQL datasets may contain non-
for addressing a complex scenario, as we previously introduced standardized samples and labels. As an example, the name
12

of the table or column is not always an accurate representation C. Data Privacy and Interpretability
of its content, which yields an inconsistency within the training As a part of the LLMs’ study, LLM-based text-to-SQL also
data construction and may lead to a semantic gap between faces some general challenges present in LLM research [4, 113,
the database schema and the user question. To address this 114]. Potential improvements from the text-to-SQL perspective
challenge, aligning the LLMs with intention bias and designing are also expected to be seen in these challenges, thereby exten-
the training strategy towards noisy scenarios will benefit the sively benefiting the study of LLMs. As previously discussed in
recent advances. At the same time, the data size in real- Sec. IV-A, the in-context learning paradigm predominates the
world applications is relatively smaller than the research- number and performance in recent studies, with the majority
oriented benchmark. Since extending a large amount of the of work using proprietary models for implementation [7, 8]. A
data by human annotation incurs high labor costs, designing straightforward challenge is proposed regarding data privacy,
data-augmentation methods to obtain more question-SQL pairs as calling proprietary APIs to handle local databases with
will support the LLM in data scarcity. Also, the adaptation study confidentiality can pose a risk of data leakage. Using a
of fine-tuned open-source LLM to the local small-size dataset local fine-tuning paradigm can partially address this issue.
can be potentially beneficial. Furthermore, the extensions on Still, the current performance of vanilla fine-tuning is not
multi-lingual [42, 111] and multi-modal scenarios [112] ideal [8], and advanced fine-tuning framework potentially relies
should be studied comprehensively in future research, which on proprietary LLMs for data augmentation [9]. Based on the
will benefit more language groups and help build more general current status, more tailored frameworks in the local fine-
database interfaces. tuning paradigm for text-to-SQL deserve widespread attention.
Overall, the development of deep learning continually faces
B. Computational Efficiency challenges regarding interpretability [114, 115]. As a long-
standing challenge, considerable work has already been studied
The computational efficiency is determined by the inference to address this issue [116, 117]. However, in text-to-SQL
speed and the cost of computational resources, which is worth research, the interpretability of LLM-based implementation
considering in both application and research work [49, 69]. is still not being discussed, whether in the in-context learning
With the increasing complexity of databases in up-to-date or fine-tuning paradigm. The approaches with a decomposition
text-to-SQL benchmarks [15, 33], databases will carry more phase explain the text-to-SQL implementation process from
information (including more tables and columns), and the the perspective of step-by-step generation [7, 51]. Building on
token length of the database schema will correspondingly this, combining advanced study in interpretability [118, 119]
increase, raising a series of challenges. Dealing with an to enhance text-to-SQL performance and interpreting the local
ultra-complex database, taking the corresponding schema as model architecture from the database knowledge aspect remain
input may encounter the challenge that the cost of calling future directions.
proprietary LLMs will significantly increase, potentially
exceeding the model’s maximum token length, especially
with the implementation of open-source models that have D. Extensions
shorter context lengths. Meanwhile, another obvious challenge As a sub-field of LLMs and natural language understanding
is that most works use the full schema as model input, research, many studies in these fields have been adopted
which introduces significant redundancy [57]. Providing for text-to-SQL tasks, advancing its development [89, 95].
LLMs with a precise question-related filtered schema directly However, text-to-SQL research can also be extended to the
from the user end to reduce cost and redundancy is a potential larger scope studies of these fields at meanwhile. For instance,
solution to improve computational efficiency [30]. Designing an SQL generation is a part of code generation. The well-
accurate method for schema filtering remains a future direction. designed approaches in code generation also obtain promising
Although the in-context learning paradigm achieves promising performance in text-to-SQL [48, 68], performing generaliza-
accuracy, as a computational efficiency concern, the well- tion across various programming languages. The potential
designed methods with the multi-stage framework or extended extension of some tailored text-to-SQL frameworks to
context increasing the number of API calls to enhance NL-to-code studies can also be discussed. For instance,
performance has simultaneously led to a substantial rise in frameworks integrating execution output in NL-to-code can
costs [7]. As reported in related approaches [49], a trade-off also achieve solid performance in SQL generation [7]. An
between performance and computational efficiency should be attempt to extend execution-aware approaches in text-to-SQL
considered carefully, and designing a comparable (even better) with other advancing modules [30, 31] for code generation
in-context learning method with less API cost will be a practical is worth discussing. From another perspective, we previously
implementation and is still under exploration. Compared to discussed that text-to-SQL can enhance LLM-based question-
PLM-based methods, the inference speed of LLM-based answering (QA) by providing factual information. The database
methods is observably slower [21, 28]. Accelerating inference can store relational knowledge as structural information, and
by shortening the input length and reducing the number of the structure-based QA can potentially benefit from text-to-
stages in implementation would be intuitive for the in-context SQL (e.g., knowledge-based question-answering, KBQA [120]).
learning paradigm. For local LLMs, from a starting point [69], Construct the factual knowledge with database structure, and
more speedup strategies can be studied in enhancing the model’s then incorporate the text-to-SQL system to achieve information
architecture in future exploration. retrieval, which can potentially assist further QA with more
13

accurate factual knowledge [121]. More extensions of text-to- and cross-domain semantic parsing and text-to-SQL task,”
SQL studies are expected in future work. in Empirical Methods in Natural Language Processing
(EMNLP), 2018.
[14] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating
R EFERENCES
structured queries from natural language using reinforce-
[1] L. Wang, B. Qin, B. Hui, B. Li, M. Yang, B. Wang, ment learning,” arXiv preprint arXiv:1709.00103, 2017.
B. Li, J. Sun, F. Huang, L. Si, and Y. Li, “Proton: [15] M. Pourreza and D. Rafiei, “Evaluating cross-domain
Probing schema linking information from pre-trained text-to-SQL models and benchmarks,” in Empirical
language models for text-to-sql parsing,” in International Methods in Natural Language Processing (EMNLP),
Conference on Knowledge Discovery and Data Mining 2023.
(KDD), 2022. [16] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to
[2] B. Qin, B. Hui, L. Wang, M. Yang, J. Li, B. Li, R. Geng, sequence learning with neural networks,” in Advances
R. Cao, J. Sun, L. Si et al., “A survey on text-to-sql in Neural Information Processing Systems (NeurIPS),
parsing: Concepts, methods, and future directions,” arXiv 2014.
preprint arXiv:2208.13629, 2022. [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
[3] S. Xu, S. Semnani, G. Campagna, and M. Lam, “Autoqa: L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
From databases to qa semantic parsers with only syn- “Attention is all you need,” in Advances in Neural
thetic training data,” in Empirical Methods in Natural Information Processing Systems (NeurIPS), 2017.
Language Processing (EMNLP), 2020. [18] B. Hui, X. Shi, R. Geng, B. Li, Y. Li, J. Sun, and X. Zhu,
[4] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, “Improving text-to-sql with schema dependency learning,”
E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the arXiv preprint arXiv:2103.04399, 2021.
ai ocean: a survey on hallucination in large language [19] D. Choi, M. C. Shin, E. Kim, and D. R. Shin, “Ryan-
models,” arXiv preprint arXiv:2309.01219, 2023. sql: Recursively applying sketch-based slot fillings
[5] P. Manakul, A. Liusie, and M. J. Gales, “Selfcheckgpt: for complex text-to-sql in cross-domain databases,” in
Zero-resource black-box hallucination detection for International Conference on Computational Linguistics
generative large language models,” in Empirical Methods (COLING), 2021.
in Natural Language Processing (EMNLP), 2023. [20] P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “Tabert:
[6] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring Pretraining for joint understanding of textual and tabular
how models mimic human falsehoods,” in Association data,” arXiv preprint arXiv:2005.08314, 2020.
for Computational Linguistics (ACL), 2021. [21] H. Li, J. Zhang, C. Li, and H. Chen, “Resdsql: Decou-
[7] M. Pourreza and D. Rafiei, “DIN-SQL: Decomposed pling schema linking and skeleton parsing for text-to-sql,”
in-context learning of text-to-SQL with self-correction,” in Conference on Artificial Intelligence (AAAI), 2023.
in Advances in Neural Information Processing Systems [22] J. Li, B. Hui, R. Cheng, B. Qin, C. Ma, N. Huo, F. Huang,
(NeurIPS), 2023. W. Du, L. Si, and Y. Li, “Graphix-t5: Mixing pre-trained
[8] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, transformers with graph-aware layers for text-to-sql
and J. Zhou, “Text-to-sql empowered by large language parsing,” in Conference on Artificial Intelligence (AAAI),
models: A benchmark evaluation,” in International 2023.
Conference on Very Large Data Bases (VLDB), 2024. [23] D. Rai, B. Wang, Y. Zhou, and Z. Yao, “Improv-
[9] H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zhu, ing generalization in language model-based text-to-
R. Wei, H. Pan, C. Li, and H. Chen, “Codes: Towards sql semantic parsing: Two simple semantic boundary-
building open-source language models for text-to-sql,” based techniques,” in Association for Computational
arXiv preprint arXiv:2402.16347, 2024. Linguistics (ACL), 2023.
[10] F. Li and H. V. Jagadish, “Constructing an interactive [24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,
natural language interface for relational databases,” in “BERT: Pre-training of deep bidirectional transformers
International Conference on Very Large Data Bases for language understanding,” in North American Chapter
(VLDB), 2014. of the Association for Computational Linguistics: Human
[11] T. Yu, C.-S. Wu, X. V. Lin, bailin wang, Y. C. Tan, Language Technologies (NAACL-HLT), 2019.
X. Yang, D. Radev, richard socher, and C. Xiong, [25] Q. Lyu, K. Chakrabarti, S. Hathi, S. Kundu, J. Zhang,
“Grappa: Grammar-augmented pre-training for table se- and Z. Chen, “Hybrid ranking network for text-to-sql,”
mantic parsing,” in International Conference on Learning arXiv preprint arXiv:2008.04759, 2020.
Representations (ICLR), 2021. [26] N. Rajkumar, R. Li, and D. Bahdanau, “Evaluating the
[12] T. Mahmud, K. A. Hasan, M. Ahmed, and T. H. C. Chak, text-to-sql capabilities of large language models,” arXiv
“A rule based approach for nlp based query processing,” preprint arXiv:2204.00498, 2022.
in International Conference on Electrical Information [27] A. Liu, X. Hu, L. Wen, and P. S. Yu, “A comprehensive
and Communication Technologies (EICT), 2015. evaluation of chatgpt’s zero-shot text-to-sql capability,”
[13] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, arXiv preprint arXiv:2303.13547, 2023.
J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev, [28] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang,
“Spider: A large-scale human-labeled dataset for complex S. Zhong, B. Yin, and X. Hu, “Harnessing the power of
14

llms in practice: A survey on chatgpt and beyond,” Trans- J. Xie, and P. Huang, “Towards robustness of text-to-SQL
actions on Knowledge Discovery from Data (TKDD), models against synonym substitution,” in Association
2024. for Computational Linguistics and International Joint
[29] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Conference on Natural Language Processing (ACL-
Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., IJCNLP), 2021.
“A survey of large language models,” arXiv preprint [41] X. Deng, A. H. Awadallah, C. Meek, O. Polozov, H. Sun,
arXiv:2303.18223, 2023. and M. Richardson, “Structure-grounded pretraining for
[30] X. Dong, C. Zhang, Y. Ge, Y. Mao, Y. Gao, J. Lin, text-to-SQL,” in North American Chapter of the Associ-
D. Lou et al., “C3: Zero-shot text-to-sql with chatgpt,” ation for Computational Linguistics: Human Language
arXiv preprint arXiv:2307.07306, 2023. Technologies (NAACL-HLT), 2021.
[31] Z. Hong, Z. Yuan, H. Chen, Q. Zhang, F. Huang, and [42] Q. Min, Y. Shi, and Y. Zhang, “A pilot study for Chinese
X. Huang, “Knowledge-to-sql: Enhancing sql generation SQL semantic parsing,” in Empirical Methods in Natural
with data expert llm,” arXiv preprint arXiv:2402.11517, Language Processing and International Joint Conference
2024. on Natural Language Processing (EMNLP-IJCNLP),
[32] Q. Zhang, J. Dong, H. Chen, W. Li, F. Huang, and 2019.
X. Huang, “Structure guided large language model for [43] T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, S. Li,
sql generation,” arXiv preprint arXiv:2402.13284, 2024. H. Er, I. Li, B. Pang, T. Chen, E. Ji, S. Dixit, D. Proctor,
[33] J. Li, B. Hui, G. QU, J. Yang, B. Li, B. Li, B. Wang, S. Shim, J. Kraft, V. Zhang, C. Xiong, R. Socher, and
B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, D. Radev, “SParC: Cross-domain semantic parsing in
K. Chang, F. Huang, R. Cheng, and Y. Li, “Can LLM context,” in Association for Computational Linguistics
already serve as a database interface? a BIg bench (ACL), 2019.
for large-scale database grounded text-to-SQLs,” in [44] T. Shi, C. Zhao, J. Boyd-Graber, H. Daumé III, and
Advances in Neural Information Processing Systems L. Lee, “On the potential of lexico-logical alignments
(NeurIPS), 2023. for semantic parsing to SQL queries,” in Findings of
[34] L. Wang, A. Zhang, K. Wu, K. Sun, Z. Li, H. Wu, Empirical Methods in Natural Language Processing
M. Zhang, and H. Wang, “DuSQL: A large-scale and (EMNLP), 2020.
pragmatic Chinese text-to-SQL dataset,” in Empirical [45] S. Xue, C. Jiang, W. Shi, F. Cheng, K. Chen, H. Yang,
Methods in Natural Language Processing (EMNLP), Z. Zhang, J. He, H. Zhang, G. Wei, W. Zhao, F. Zhou,
2020. D. Qi, H. Yi, S. Liu, and F. Chen, “Db-gpt: Empowering
[35] T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. database interactions with private large language models,”
Lin, Y. C. Tan, T. Shi, Z. Li, Y. Jiang, M. Yasunaga, arXiv preprint arXiv:2312.17449, 2024.
S. Shim, T. Chen, A. Fabbri, Z. Li, L. Chen, Y. Zhang, [46] B. Zhang, Y. Ye, G. Du, X. Hu, Z. Li, S. Yang, C. H.
S. Dixit, V. Zhang, C. Xiong, R. Socher, W. Lasecki, Liu, R. Zhao, Z. Li, and H. Mao, “Benchmarking the
and D. Radev, “CoSQL: A conversational text-to-SQL text-to-sql capability of large language models: A com-
challenge towards cross-domain natural language inter- prehensive evaluation,” arXiv preprint arXiv:2403.02951,
faces to databases,” in Empirical Methods in Natural 2024.
Language Processing and International Joint Conference [47] S. Chang and E. Fosler-Lussier, “How to prompt LLMs
on Natural Language Processing (EMNLP-IJCNLP), for text-to-SQL: A study in zero-shot, single-domain, and
2019. cross-domain settings,” in NeurIPS 2023 Second Table
[36] C.-H. Lee, O. Polozov, and M. Richardson, “KaggleD- Representation Learning Workshop (NeurIPS), 2023.
BQA: Realistic evaluation of text-to-SQL parsers,” in [48] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching
Association for Computational Linguistics and Interna- large language models to self-debug,” in International
tional Joint Conference on Natural Language Processing Conference on Learning Representations (ICLR), 2024.
(ACL-IJCNLP), 2021. [49] H. Zhang, R. Cao, L. Chen, H. Xu, and K. Yu,
[37] X. Pi, B. Wang, Y. Gao, J. Guo, Z. Li, and J.-G. “ACT-SQL: In-context learning for text-to-SQL with
Lou, “Towards robustness of text-to-SQL models against automatically-generated chain-of-thought,” in Findings
natural and realistic adversarial table perturbation,” in of Empirical Methods in Natural Language Processing
Association for Computational Linguistics (ACL), 2022. (EMNLP), 2023.
[38] Y. Gan, X. Chen, Q. Huang, and M. Purver, “Measuring [50] F. Xu, Z. Wu, Q. Sun, S. Ren, F. Yuan, S. Yuan, Q. Lin,
and improving compositional generalization in text-to- Y. Qiao, and J. Liu, “Symbol-llm: Towards foundational
SQL via component alignment,” in Findings of North symbol-centric interface for large language models,”
American Chapter of the Association for Computational arXiv preprint arXiv:2311.09278, 2024.
Linguistics (NAACL), 2022. [51] C.-Y. Tai, Z. Chen, T. Zhang, X. Deng, and H. Sun,
[39] Y. Gan, X. Chen, and M. Purver, “Exploring underex- “Exploring chain of thought style prompting for text-
plored limitations of cross-domain text-to-SQL gener- to-SQL,” in Empirical Methods in Natural Language
alization,” in Empirical Methods in Natural Language Processing (EMNLP), 2023.
Processing (EMNLP), 2021. [52] L. Nan, Y. Zhao, W. Zou, N. Ri, J. Tae, E. Zhang,
[40] Y. Gan, X. Chen, Q. Huang, M. Purver, J. R. Woodward, A. Cohan, and D. Radev, “Enhancing text-to-SQL
15

capabilities of large language models: A study on prompt for text-to-sql,” arXiv preprint arXiv:2402.10663, 2024.
design strategies,” in Findings of Empirical Methods in [66] Y. Gu, Y. Shu, H. Yu, X. Liu, Y. Dong, J. Tang,
Natural Language Processing (EMNLP), 2023. J. Srinivasa, H. Latapie, and Y. Su, “Middleware for llms:
[53] R. Sun, S. O. Arik, H. Nakhost, H. Dai, R. Sinha, Tools are instrumental for language agents in complex
P. Yin, and T. Pfister, “Sql-palm: Improved large lan- environments,” arXiv preprint arXiv:2402.14672, 2024.
guage model adaptation for text-to-sql,” arXiv preprint [67] F. Shi, D. Fried, M. Ghazvininejad, L. Zettlemoyer, and
arXiv:2306.00739, 2023. S. I. Wang, “Natural language to code translation with
[54] S. Chang and E. Fosler-Lussier, “Selective demonstra- execution,” in Empirical Methods in Natural Language
tions for cross-domain text-to-SQL,” in Findings of Processing (EMNLP), 2022.
Empirical Methods in Natural Language Processing [68] A. Ni, S. Iyer, D. Radev, V. Stoyanov, W.-t. Yih,
(EMNLP), 2023. S. I. Wang, and X. V. Lin, “Lever: Learning to verify
[55] H. Xia, F. Jiang, N. Deng, C. Wang, G. Zhao, R. Mi- language-to-code generation with execution,” in Interna-
halcea, and Y. Zhang, “Sql-craft: Text-to-sql through tional Conference on Machine Learning (ICML), 2023.
interactive refinement and enhanced reasoning,” arXiv [69] S. Kou, L. Hu, Z. He, Z. Deng, and H. Zhang, “Cllms:
preprint arXiv:2402.14851, 2024. Consistency large language models,” arXiv preprint
[56] T. Zhang, T. Yu, T. B. Hashimoto, M. Lewis, W. tau Yih, arXiv:2403.00835, 2024.
D. Fried, and S. I. Wang, “Coder reviewer reranking [70] A. Zhuang, G. Zhang, T. Zheng, X. Du, J. Wang, W. Ren,
for code generation,” in International Conference on S. W. Huang, J. Fu, X. Yue, and W. Chen, “Structlm:
Machine Learning (ICML), 2023. Towards building generalist models for structured knowl-
[57] B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, edge grounding,” arXiv preprint arXiv:2402.16671,
Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, and Z. Li, “Mac- 2024.
sql: A multi-agent collaborative framework for text-to- [71] M. Pourreza and D. Rafiei, “Dts-sql: Decomposed text-
sql,” arXiv preprint arXiv:2312.11242, 2024. to-sql with small large language models,” arXiv preprint
[58] Y. Xie, X. Jin, T. Xie, M. Lin, L. Chen, C. Yu, L. Cheng, arXiv:2402.01117, 2024.
C. Zhuo, B. Hu, and Z. Li, “Decomposition for enhanc- [72] D. Xu, W. Chen, W. Peng, C. Zhang, T. Xu, X. Zhao,
ing attention: Improving llm-based text-to-sql through X. Wu, Y. Zheng, and E. Chen, “Large language models
workflow paradigm,” arXiv preprint arXiv:2402.10671, for generative information extraction: A survey,” arXiv
2024. preprint arXiv:2312.17617, 2023.
[59] Y. Fan, Z. He, T. Ren, C. Huang, Y. Jing, K. Zhang, and [73] G. Katsogiannis-Meimarakis and G. Koutrika, “A survey
X. S. Wang, “Metasql: A generate-then-rank framework on deep learning approaches for text-to-sql,” The VLDB
for natural language to sql translation,” arXiv preprint Journal, 2023.
arXiv:2402.17144, 2024. [74] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J.-G. Lou, T. Liu,
[60] Z. Li, X. Wang, J. Zhao, S. Yang, G. Du, X. Hu, and D. Zhang, “Towards complex text-to-sql in cross-
B. Zhang, Y. Ye, Z. Li, R. Zhao, and H. Mao, “Pet-sql: A domain database with intermediate representation,” arXiv
prompt-enhanced two-stage text-to-sql framework with preprint arXiv:1905.08205, 2019.
cross-consistency,” arXiv preprint arXiv:2403.09732, [75] X. Xu, C. Liu, and D. Song, “Sqlnet: Generating
2024. structured queries from natural language without re-
[61] T. Ren, Y. Fan, Z. He, R. Huang, J. Dai, C. Huang, inforcement learning,” arXiv preprint arXiv:1711.04436,
Y. Jing, K. Zhang, Y. Yang, and X. S. Wang, “Purple: 2017.
Making a large language model a better sql writer,” arXiv [76] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
preprint arXiv:2403.20014, 2024. O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov,
[62] C. Guo, Z. Tian, J. Tang, P. Wang, Z. Wen, K. Yang, “Roberta: A robustly optimized bert pretraining approach,”
and T. Wang, “Prompting gpt-3.5 for text-to-sql with arXiv preprint arXiv:1907.11692, 2019.
de-semanticization and skeleton retrieval,” in Pacific [77] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan,
Rim International Conference on Artificial Intelligence P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,
(PRICAI), 2024. A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger,
[63] J. Jiang, K. Zhou, Z. Dong, K. Ye, X. Zhao, and J.- T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu,
R. Wen, “StructGPT: A general framework for large C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
language model to reason over structured data,” in S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
Empirical Methods in Natural Language Processing A. Radford, I. Sutskever, and D. Amodei, “Language
(EMNLP), 2023. models are few-shot learners,” in Advances in Neural
[64] C. Guo, Z. Tian, J. Tang, S. Li, Z. Wen, K. Wang, and Information Processing Systems (NeurIPS), 2020.
T. Wang, “Retrieval-augmented gpt-3.5-based text-to-sql [78] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
framework with sample-aware prompting and dynamic F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman,
revision chain,” in International Conference on Neural S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint
Information Processing (ICONIP), 2024. arXiv:2303.08774, 2023.
[65] D. Wang, L. Dou, X. Zhang, Q. Zhu, and W. Che, [79] J. Wang, E. Shi, S. Yu, Z. Wu, C. Ma, H. Dai, Q. Yang,
“Improving demonstration diversity by human-free fusing Y. Kang, J. Wu, H. Hu et al., “Prompt engineering
16

for healthcare: Methodologies and applications,” arXiv and G. Liao, “Enhancing text-to-SQL capabilities of
preprint arXiv:2304.14670, 2023. large language models through tailored promptings,” in
[80] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, International Conference on Computational Linguistics,
B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned Language Resources and Evaluation (LREC-COLING),
language models are zero-shot learners,” arXiv preprint 2024.
arXiv:2109.01652, 2021. [94] J. Huang and K. C.-C. Chang, “Towards reasoning
[81] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. in large language models: A survey,” in Findings of
Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, Association for Computational Linguistics (ACL), 2023.
F. Azhar et al., “Llama: Open and efficient foundation [95] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H.
language models,” arXiv preprint arXiv:2302.13971, Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-
2023. consistency improves chain of thought reasoning in lan-
[82] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, guage models,” in International Conference on Learning
Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhos- Representations (ICLR), 2023.
ale et al., “Llama 2: Open foundation and fine-tuned [96] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program
chat models,” arXiv preprint arXiv:2307.09288, 2023. of thoughts prompting: Disentangling computation from
[83] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, reasoning for numerical reasoning tasks,” Transactions
W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” on Machine Learning Research (TMLR), 2023.
arXiv preprint arXiv:2309.16609, 2023. [97] M. Müller and R. Sennrich, “Understanding the proper-
[84] L. Reynolds and K. McDonell, “Prompt programming for ties of minimum Bayes risk decoding in neural machine
large language models: Beyond the few-shot paradigm,” translation,” in Association for Computational Linguis-
in ACM Conference on Human Factors in Computing tics and International Joint Conference on Natural
Systems (CHI), 2021. Language Processing (ACL-IJCNLP), 2021.
[85] X. Ye and G. Durrett, “The unreliability of explana- [98] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning,
tions in few-shot prompting for textual reasoning,” in S. Ermon, and C. Finn, “Direct preference optimization:
Advances in Neural Information Processing Systems Your language model is secretly a reward model,” in
(NeurIPS), 2022. Advances in Neural Information Processing Systems
[86] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, (NeurIPS), 2023.
M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring [99] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupati-
the limits of transfer learning with a unified text-to-text raju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love
transformer,” The Journal of Machine Learning Research et al., “Gemma: Open models based on gemini research
(JMLR), 2020. and technology,” arXiv preprint arXiv:2403.08295, 2024.
[87] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. [100] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright,
Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray
G. Brockman et al., “Evaluating large language models et al., “Training language models to follow instructions
trained on code,” arXiv preprint arXiv:2107.03374, 2021. with human feedback,” in Advances in Neural Informa-
[88] J. Zamfirescu-Pereira, R. Y. Wong, B. Hartmann, and tion Processing Systems (NeurIPS), 2022.
Q. Yang, “Why johnny can’t prompt: how non-ai experts [101] Y. Leviathan, M. Kalman, and Y. Matias, “Fast infer-
try (and fail) to design llm prompts,” in ACM Conference ence from transformers via speculative decoding,” in
on Human Factors in Computing Systems (CHI), 2023. International Conference on Machine Learning (ICML),
[89] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, 2023.
E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought [102] C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre,
prompting elicits reasoning in large language models,” and J. Jumper, “Accelerating large language model
in Advances in Neural Information Processing Systems decoding with speculative sampling,” arXiv preprint
(NeurIPS), 2022. arXiv:2302.01318, 2023.
[90] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, [103] H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee,
D. Schuurmans, C. Cui, O. Bousquet, Q. Le et al., “Least- “Learning from noisy labels with deep neural networks:
to-most prompting enables complex reasoning in large A survey,” IEEE Transactions on Neural Networks and
language models,” arXiv preprint arXiv:2205.10625, Learning Systems (TNNLS), 2023.
2022. [104] N. Deng, Y. Chen, and Y. Zhang, “Recent advances in
[91] W. Lei, W. Wang, Z. Ma, T. Gan, W. Lu, M.-Y. Kan, and text-to-SQL: A survey of what we have and what we
T.-S. Chua, “Re-examining the role of schema linking in expect,” in International Conference on Computational
text-to-SQL,” in Empirical Methods in Natural Language Linguistics (COLING), 2022.
Processing (EMNLP), 2020. [105] D. Erhan, A. Courville, Y. Bengio, and P. Vincent, “Why
[92] Q. Liu, D. Yang, J. Zhang, J. Guo, B. Zhou, and J.- does unsupervised pre-training help deep learning?” in
G. Lou, “Awakening latent grounding from pretrained Artificial Intelligence and Statistics (AISTATS), 2010.
language models for semantic parsing,” in Findings of [106] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He,
Association for Computational Linguistics (ACL), 2021. A. Li, M. He, Z. Liu et al., “Summary of chatgpt-related
[93] Z. Tan, X. Liu, Q. Shu, X. Li, C. Wan, D. Liu, Q. Wan, research and perspective towards the future of large
17

language models,” Meta-Radiology, 2023. Processing (EMNLP), 2023.


[107] Anthropic, “Introducing Claude,” 2023. [120] H. Luo, Z. Tang, S. Peng, Y. Guo, W. Zhang, C. Ma,
[108] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, G. Dong, M. Song, W. Lin et al., “Chatkbqa: A generate-
X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., then-retrieve framework for knowledge base question
“Code llama: Open foundation models for code,” arXiv answering with fine-tuned large language models,” arXiv
preprint arXiv:2308.12950, 2023. preprint arXiv:2310.08975, 2023.
[109] R. Li, L. B. allal, Y. Zi, N. Muennighoff, D. Kocetkov, [121] G. Xiong, J. Bao, and W. Zhao, “Interactive-kbqa:
C. Mou, M. Marone, C. Akiki, J. LI, J. Chim, Q. Liu, Multi-turn interactions for knowledge base question
E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene, answering with large language models,” arXiv preprint
J. Lamy-Poirier, J. Monteiro, N. Gontier, M.-H. Yee, arXiv:2402.15131, 2024.
L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, [122] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep-
Z. Wang, R. Murthy, J. T. Stillerman, S. S. Patel, ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey,
D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, U. Bhat- Z. Chen et al., “Palm 2 technical report,” arXiv preprint
tacharyya, W. Yu, S. Luccioni, P. Villegas, F. Zh- arXiv:2305.10403, 2023.
danov, T. Lee, N. Timor, J. Ding, C. S. Schlesinger, [123] Z. Hong and J. Liu, “Towards better question gen-
H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, C. J. eration in qa-based event extraction,” arXiv preprint
Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, arXiv:2405.10517, 2024.
D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, [124] Y. Liu, H. He, T. Han, X. Zhang, M. Liu, J. Tian,
S. Hughes, T. Wolf, A. Guha, L. V. Werra, and Y. Zhang, J. Wang, X. Gao, T. Zhong et al., “Under-
H. de Vries, “Starcoder: may the source be with you!” standing llms: A comprehensive overview from training
Transactions on Machine Learning Research (TMLR), to inference,” arXiv preprint arXiv:2401.02038, 2024.
2023. [125] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding,
[110] Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma,
L. Shang, X. Jiang, and Q. Liu, “Aligning large lan- Y. Xue, J. Zhai, W. Chen, Z. Liu, P. Zhang, Y. Dong,
guage models with human: A survey,” arXiv preprint and J. Tang, “GLM-130b: An open bilingual pre-
arXiv:2307.12966, 2023. trained model,” in International Conference on Learning
[111] A. Tuan Nguyen, M. H. Dao, and D. Q. Nguyen, “A pilot Representations (ICLR), 2023.
study of text-to-SQL semantic parsing for Vietnamese,” [126] Q. Zhang, J. Dong, Q. Tan, and X. Huang, “Integrating
in Findings of Empirical Methods in Natural Language entity attributes for error-aware knowledge graph em-
Processing (EMNLP), 2020. bedding,” IEEE Transactions on Knowledge and Data
[112] Y. Song, R. C.-W. Wong, and X. Zhao, “Speech-to-sql: Engineering (TKDE), 2024.
toward speech-driven sql query generation from natural [127] Q. Zhang, J. Dong, H. Chen, X. Huang, D. Zha,
language question,” The VLDB Journal, 2024. and Z. Yu, “Knowgpt: Black-box knowledge in-
[113] B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, jection for large language models,” arXiv preprint
and X. Cheng, “On protecting the data privacy of arXiv:2312.06185, 2023.
large language models (llms): A survey,” arXiv preprint [128] F. Huang, Z. Yang, J. Jiang, Y. Bei, Y. Zhang, and
arXiv:2403.05156, 2024. H. Chen, “Large language model interaction simulator
[114] C. Singh, J. P. Inala, M. Galley, R. Caruana, and J. Gao, for cold-start item recommendation,” arXiv preprint
“Rethinking interpretability in the era of large language arXiv:2402.09176, 2024.
models,” arXiv preprint arXiv:2402.01761, 2024. [129] Y. Bei, H. Xu, S. Zhou, H. Chi, M. Zhang, Z. Li,
[115] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei, and J. Bu, “Cpdg: A contrastive pre-training method
“Knowledge neurons in pretrained transformers,” in for dynamic graph neural networks,” arXiv preprint
Association for Computational Linguistics (ACL), 2022. arXiv:2307.02813, 2023.
[116] N. Zhang, Y. Yao, B. Tian, P. Wang, S. Deng, M. Wang, [130] Y. Bei, H. Chen, S. Chen, X. Huang, S. Zhou, and
Z. Xi, S. Mao, J. Zhang, Y. Ni et al., “A comprehensive F. Huang, “Non-recursive cluster-scale graph interacted
study of knowledge editing for large language models,” model for click-through rate prediction,” in International
arXiv preprint arXiv:2401.01286, 2024. Conference on Information and Knowledge Management
[117] K. Meng, A. S. Sharma, A. J. Andonian, Y. Belinkov, (CIKM), 2023.
and D. Bau, “Mass-editing memory in a transformer,” in [131] Z. Yuan, D. Liu, W. Pan, and Z. Ming, “Sql-rank++: A
International Conference on Learning Representations novel listwise approach for collaborative ranking with
(ICLR), 2023. implicit feedback,” in International Joint Conference on
[118] K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Lo- Neural Networks (IJCNN), 2022.
cating and editing factual associations in gpt,” Advances [132] H. Chen, Y. Bei, Q. Shen, Y. Xu, S. Zhou, W. Huang,
in Neural Information Processing Systems (NeurIPS), F. Huang, S. Wang, and X. Huang, “Macro graph neural
2022. networks for online billion-scale recommender systems,”
[119] C. Zheng, L. Li, Q. Dong, Y. Fan, Z. Wu, J. Xu, and in International World Wide Web Conference (WWW),
B. Chang, “Can we edit factual knowledge by in-context 2024.
learning?” in Empirical Methods in Natural Language

You might also like