LLM Based TXT To SQL
LLM Based TXT To SQL
T EXT-TO-SQL is a long-standing task in natural language half (51.52%) of professional developers using SQL in their
processing research. It aims to convert (translate) natural work, it is notable that only around a third (35.29%) of those
language questions into database-executable SQL queries. Fig. 1 developers are systematically trained1 , the NLIDB enables non-
provides an example of a large language model-based (LLM- skilled users to access structured databases like professional
based) text-to-SQL system. Given a user question such as database engineers [1, 2] and also accelerates human-computer
“Could you tell me the names of the 5 leagues with the highest interaction [3]. Furthermore, amid the research hotspot of
matches of all time and how many matches were played LLMs, text-to-SQL can provide a potential solution to the
in the said league?”, the LLM takes the question and its prevalent hallucination [4, 5] issue by incorporating realistic
corresponding database schema as input and then generates an content from the database to fill the knowledge gaps of
SQL query as output. This SQL query can be executed in the LLMs [6]. The significant value and potential for text-to-
database to retrieve the relevant content to answer the user’s SQL have triggered a range of studies on its integration and
question. The above system builds a natural language interface optimization with LLMs [7–10]; consequently, LLM-based
to the database (NLIDB) with LLMs. Since SQL remains one text-to-SQL remains a highly discussed research field within
of the most widely used programming languages, with over the NLP and database communities.
* Corresponding author. 1 https://survey.stackoverflow.co/2023
2
Fig. 2: A sketch of the evolutionary process for text-to-SQL research from the perspective of implementation paradigm. Each
stage is presented with two representative implementation techniques. The timestamps for the stages are not exactly accurate;
we set each timestamp according to the release time of the representative works of each implementation paradigm, with a
margin of error of about one year before and after. The format is inspired from [29].
Previous studies have made notable progress in the implemen- and approaches in LLM-based text-to-SQL. We begin by
tation of text-to-SQL and have undergone a long evolutionary introducing the fundamental concepts and challenges associated
process. Early efforts were mostly based on well-designed with text-to-SQL, highlighting the importance of this task
rules and templates [11], specifically suitable for simple in various domains. We then delve into the evolution of the
database scenarios. In recent years, with the heavy labor implementation paradigm for text-to-SQL systems, discussing
costs [12] brought by rule-based methods and the growing the key advancements and breakthroughs in this field. After the
complexity of database environments [13–15], designing a rule overview, we provide a detailed introduction and analysis of the
or template for each scenario has become increasingly difficult recent advances in text-to-SQL integrating LLMs. Specifically,
and impractical. The development of deep neural networks the body of our survey covers a range of contents related to
has advanced the progress of text-to-SQL [16, 17], which LLM-based text-to-SQL, including:
can automatically learn a mapping from the user question • Datasets and Benchmarks: We provide a detailed introduc-
to its corresponding SQL [18, 19]. Subsequently, pre-trained tion to the commonly used datasets and benchmarks for
language models (PLMs) with strong semantic parsing capacity evaluating LLM-based text-to-SQL systems. We discuss
have become the new paradigm for text-to-SQL systems [20], their characteristics, complexity, and the challenges they
boosting their performance to a new level [21–23]. Incremental pose for text-to-SQL development and evaluation.
research on PLM-based optimization, such as table content • Evaluation Metrics: We present the evaluation metrics
encoding [19, 24, 25] and pre-training [20, 26], has further used to assess the performance of LLM-based text-to-
advanced this field. Most recently, the LLM-based approaches SQL systems, including both content matching-based and
implementing text-to-SQL through in-context learning (ICL) [8] execution-based paradigms. We then briefly introduce the
and fine-tuning (FT) [10] paradigm, reaching state-of-the- characteristics of each metric.
art accuracy with the well-designed framework and stronger • Methods and Models: We present a systematic analysis
comprehension capability compared to PLMs. of the different methods and models employed for LLM-
The overall implementation details of LLM-based text-to- based text-to-SQL, including in-context learning and fine-
SQL can be divided into 3 aspects: 1. Question understanding: tuning-based paradigms. We discuss their implementation
The NL question is a semantic representation of the user’s details, strengths, and adaptations specific to the text-to-
intention, which the corresponding generated SQL query is SQL task from various implementation perspectives.
expected to align with; 2. Schema comprehension: The schema • Expectations and Future Directions: We discuss the
provides the table and column structure of the database, and the remaining challenges and limitations of LLM-based text-
text-to-SQL system is required to identify the target components to-SQL, such as real-world robustness, computational
that match the user question; 3. SQL generation: This efficiency, data privacy, and extensions. We also outline
involves incorporating the above parsing and then predicting potential future research directions and opportunities for
the correct syntax to generate executable SQL queries that improvement and optimization.
can retrieve the required answer. The LLMs have proven to We hope this survey provides a clear overview of recent
perform a good vanilla implementation [7, 27], benefiting studies and inspires future research. Fig. 3 shows a taxonomy
from the more powerful semantic parsing capacity enabled tree that summarizes the structure and contents of our survey.
by the richer training corpus [28, 29]. Further studies on
enhancing the LLMs for question understanding [8, 9], schema II. OVERVIEW
comprehension [30, 31], and SQL generation [32] are being Text-to-SQL is a task that aims to convert natural language
increasingly released. questions into corresponding SQL queries that can be executed
Despite the significant progress made in text-to-SQL research, in a relational database. Formally, given a user question Q
several challenges remain that hinder the development of (also known as a user query, natural language question, etc.)
robust and generalized text-to-SQL systems [73]. In this survey, and a database schema S, the goal of the task is to generate
we aim to catch up with the recent advances and provide a an SQL query Y that retrieves the required content from the
comprehensive review of the current state-of-the-art models database to answer the user question. Text-to-SQL has the
3
BIRD [33], DuSQL [34], CoSQL [35], Spider [13], WikiSQL [14], KaggleDBQA [36], ADVETA [37]
Cross-
Spider-SS [38], Spider-CG [38], Spider-DK [39], Spider-SYN [40], Spider-Realistic [41], CSpider [42]
domain
SParC [43]
Knowledge-
BIRD [33], SQUALL [44], Spider-DK [39]
Original augmented
Datasets
Datasets Cross-
& DuSQL [34], CSpider [42]
(§III-A) lingual
Post-annotated
Datasets
Context-
CoSQL [35], Spider-SS [38], Spider-CG [38], SparC [43]
dependent
(§III-B) Execution
Execution Accuracy (EX) [13], Valid Efficiency Score (VES) [33]
based
Zero-shot [7], [33], [27], [45], [46], [47], [48], [49], [9], [50]
Few-shot [33], [8], [51], [52], [53], [54], [49], [9], [32], [55]
Coder-Reviewer [56], DIN-SQL [8], QDecomp [51], C3 [30], MAC-SQL [57], DEA-SQL [58],
Decomposition
SGU-SQL [32], MetaSQL [59], PET-SQL [60], PURPLE [61]
In-context
Learning
Prompt DESEM+P [62], StructGPT [63], SD+SA+Voting [52]RAG+SP&DRC [64], C3 [30], DAIL-SQL [9]
Paradigm
Optimization ODIS [54], ACT-SQL [49], FUSED [65], DELLM [31]
(§IV-A)
Reasoning CoT [9, 32, 33, 51], QDecomp [51], Least-to-Most [51], SQL-PaLM [53], ACT-SQL [49], POT [55]
Enhancement SQL-CRAFT [55], FUXI [66]
MBR-Exec [67], Coder-Reviewer [56], LEVER [68], SELF-DEBUGGING [48], DESEM+P [62]
Execution
Methods DIN-SQL [8], SD+SA+Voting [52], SQL-PaLM [53], RAG+SP&DRC [64], C3 [30], MAC-SQL [57]
Refinement
(§IV) DELLM [31], SQL-CRAFT [55], FUXI [66], PET-SQL [60], PURPLE [61]
Supervised
[45], [9], [50], [53]
Fine-tuning
Enhanced
CLLMs [69]
Architecture
Fine-tuning
Paradigm Pre-training CodeS [10]
(§IV-B)
Data
DAIL-SQL [9], Symbol-LLM [50], CodeS [10], StructLM [70]
Augmentation
Fig. 3: Taxonomy tree of the research in LLM-based text-to-SQL. The display order in each node is organized by the released
time. The format is adapted from [72].
potential to democratize access to data by allowing users to for a given user question [75, 76]. Resolving these ambiguities
interact with databases using natural language without the need and understanding the intent behind the user question requires
for specialized knowledge of SQL programming [74]. This can deep natural language understanding and the capability to
benefit various domains, such as business intelligence, customer incorporate context and domain knowledge [33].
support, and scientific research, by enabling non-skilled users 2) Schema Understanding and Representation: To generate
to easily retrieve target content from databases and facilitating accurate SQL queries, text-to-SQL systems need to have a
more efficient data analysis. comprehensive understanding of the database schema, including
table names, column names, and relationships between various
A. Challenges in Text-to-SQL tables. However, the database schema can be complex and
The technical challenges for text-to-SQL implementations vary significantly across different domains [13]. Representing
can be summarized as follows: and encoding the schema information in a way that can be
1) Linguistic Complexity and Ambiguity: Natural language effectively utilized by the text-to-SQL model is a challenging
questions often contain complex linguistic representations, such task.
as nested clauses, coreferences, and ellipses, which make 3) Rare and Complex SQL Operations: Some SQL queries
it challenging to map them accurately to the corresponding involve rare or complex operations and syntax in challenging
part of SQL queries [41]. Additionally, natural language is scenarios, such as nested sub-queries, outer joins, and window
inherently ambiguous, with multiple possible representations functions. These operations are less frequent in the training
4
data and pose challenges for text-to-SQL systems to generate 4) LLM-based Implementation: Large language models
accurately. Designing models that can be generalized to a wide (LLMs), such as GPT series [83–85], have gained significant
range of SQL operations, including rare and complex scenarios, attention in recent years due to their capability to generate
is an essential consideration. coherent and fluent text. Researchers have started exploring the
4) Cross-Domain Generalization: Text-to-SQL systems of- potential of LLMs for text-to-SQL by leveraging their extensive
ten struggle to generalize across various database scenarios and knowledge reserves and superior generation capabilities [7, 9].
domains. Models trained on a specific domain may not perform These approaches often involve prompt engineering to guide
well on the proposed questions from other domains due to the the proprietary LLMs in SQL generation [47] or fine-tuning
variety in vocabulary, database schema structure, and question the open-source LLMs on text-to-SQL datasets [9].
patterns. Developing systems that can effectively generalized The integration of LLMs in text-to-SQL is still an emerging
to new domains with minimal domain-specific training data or research area with significant potential for further exploration
fine-tuning adaptation is a significant challenge [77]. and improvement. Researchers are investigating approaches
to better leverage the knowledge and reasoning capabilities
of LLMs, incorporate domain-specific knowledge [31, 33],
B. Evolutionary Process
and develop more efficient fine-tuning strategies [10]. As the
The research field of text-to-SQL has witnessed significant field continues to evolve, we anticipate the development of
advancements over the years in the NLP community, having more advanced and superior LLM-based implementations that
evolved from rule-based methods to deep learning-based will elevate the performance and generalization capabilities of
approaches and, more recently, to integrating pre-trained text-to-SQL to new heights.
language models (PLMs) and large language models (LLMs),
a sketch of the evolutionary process is shown in Fig. 2.
III. B ENCHMARKS & EVALUTION
1) Rule-based Methods: Early text-to-SQL systems relied
heavily on rule-based methods [11, 12, 26], where manually In this section, we introduce the benchmarks for text-to-SQL,
crafted rules and heuristics were used to map natural language encompassing well-known datasets and evaluation metrics.
questions to SQL queries. These approaches often involved
extensive feature engineering and domain-specific knowledge.
While rule-based methods achieved success in specific simple A. Datasets
domains, they lacked the flexibility and generalization capabil- As shown in Tab. I, we classify the datasets into ’Original
ities needed to handle diverse and complex questions. Datasets’ and ’Post-annotated Datasets’ based on whether
2) Deep Learning-based Approaches: With the rise of deep they were released with the original dataset and databases
neural networks, sequence-to-sequence models and encoder- or created by adapting existing datasets and databases with
decoder structures, such as LSTMs [78] and transformers [17], special settings. For the original datasets, we provide a detailed
were adapted to generate SQL queries from natural lan- analysis, including the number of examples, the number of
guage input [19, 79]. Typically, RYANSQL [19] introduced databases, the number of tables per database, and the number of
techniques like intermediate representations and sketch-based rows per database. For the post-annotated datasets, we identify
slot filling to handle complex questions and improve cross- their source dataset and describe the special setting applied to
domain generalization. Recently, researchers introduced graph them. To illustrate the potential opportunities of each dataset,
neural networks (GNNs) for text-to-SQL tasks by leveraging we annotate them based on their characteristics. The annotations
schema dependency graphs to capture the relationships between are listed in the rightmost column of Tab. I, which we will
database elements [18, 80]. discuss in detail below.
3) PLM-based Implementation: Pre-trained language models 1) Cross-domain Dataset: This refers to datasets where the
(PLMs) have emerged as a powerful solution for text-to- background information of different databases comes from
SQL, leveraging the vast amounts of linguistic knowledge various domains. Since real-world text-to-SQL applications
and semantic understanding captured during the pre-training often involve databases from multiple domains, most origi-
process. The early adoption of PLMs in text-to-SQL pri- nal text-to-SQL datasets [13, 14, 33–36] and post-annotated
marily focused on fine-tuning off-the-shelf PLMs, such as datasets [37–43] are in the cross-domain setting to fit well with
BERT [24] and RoBERTa [81], on standard text-to-SQL the requirements of cross-domain applications.
datasets [13, 14]. These PLMs, pre-trained on large amounts 2) Knowledge-augmented Dataset: Interest in incorporat-
of training corpus, captured rich semantic representations and ing domain-specific knowledge into text-to-SQL tasks has
language understanding capabilities. By fine-tuning them on increased significantly in recent years. BIRD [33] employs
text-to-SQL tasks, researchers aimed to leverage the semantic human database experts to annotate each text-to-SQL sample
and linguistic understanding of PLMs to generate accurate with external knowledge, categorized into numeric reasoning
SQL queries [20, 79, 82]. Another line of research focuses knowledge, domain knowledge, synonym knowledge, and value
on incorporating schema information into PLMs to improve illustration. Similarly, Spider-DK [39] defines and adds five
their understanding of database structures and enable them to types of domain knowledge for a human-curated version of the
generate more executable SQL queries. Schema-aware PLMs Spider dataset [13]: SELECT columns mentioned by omission,
are designed to capture the relationships and constraints present simple inference required, synonyms substitution in cell value
in the database structure [21]. word, one non-cell value word generate a condition, and easy
5
TABLE I: The statistics and analysis of well-known datasets of text-to-SQL ordered by release time. The original dataset
indicates that the dataset is designed with a corresponding database, while post-annotated datasets involve annotating new
components within existing datasets and databases rather than releasing a new database.
Original Dataset Release Time #Example #DB #Table/DB #Row/DB Characteristics
BIRD [33] May-2023 12,751 95 7.3 549K Cross-domain, Knowledge-augmented
KaggleDBQA [36] Jun-2021 272 8 2.3 280K Cross-domain
DuSQL [34] Nov-2020 23,797 200 4.1 - Cross-domain, Cross-lingual
SQUALL [44] Oct-2020 11,468 1,679 1 - Knowledge-augmented
CoSQL [35] Sep-2019 15,598 200 - - Cross-domain, Context-dependent
Spider [13] Sep-2018 10,181 200 5.1 2K Cross-domain
WikiSQL [14] Aug-2017 80,654 26,521 1 17 Cross-domain
Post-annotated Dataset Release Time Source Dataset Special Setting Characteristics
ADVETA [37] Dec-2022 Spider, etc. Adversarial table perturbation Robustness
Spider-SS&CG [38] May-2022 Spider Splitting example into sub-examples Context-dependent
Spider-DK [39] Sep-2021 Spider Adding domain knowledge Knowledge-augmented
Spider-SYN [40] Jun-2021 Spider Manual synonym replacement Robustness
Spider-Realistic [41] Oct-2020 Spider Removing column names in question Robustness
CSpider [42] Sep-2019 Spider Chinese version of Spider Cross-lingual
SParC [43] Jun-2019 Spider Annotate conversational contents Context-dependent
to conflict with other domains. Both studies found that human- fying new challenges in word segmentation and cross-lingual
annotated knowledge significantly improves SQL generation matching between Chinese questions and English database
performance for samples requiring external domain knowledge. contents. DuSQL [34] introduces a practical text-to-SQL dataset
Additionally, SQUALL [44] manually annotates alignments with Chinese questions and database contents provided in both
between the words in NL questions and the entities in SQL, English and Chinese.
providing finer-grained supervision than other datasets.
3) Context-dependent Dataset: SParC [43] and CoSQL [35]
explore context-dependent SQL generation by constructing a B. Evaluation Metrics
conversational database querying system. Unlike traditional We introduce four widely used evaluation metrics for the
text-to-SQL datasets that only have a single question-SQL text-to-SQL task as follows: Component Matching and Exact
pair for one example, SParC decomposes the question-SQL Matching, which are based on SQL content matching, and
examples in the Spider dataset into multiple sub-question- Execution Accuracy and Valid Efficiency Score, which are
SQL pairs to construct a simulated and meaningful inter- based on execution results.
action, including inter-related sub-questions that aid SQL 1) Content Matching-based Metrics: SQL content matching
generation, and unrelated sub-questions that enhance data metrics focus on comparing the predicted SQL query with the
diversity. CoSQL, in comparison, involves conversational ground truth SQL query based on their structural and syntactic
interactions in natural language, simulating real-world scenarios similarities.
to increase complexity and diversity. Additionally, Spider-
• Component Matching (CM) [13] evaluates the perfor-
SS&CG [38] splits the NL question in the Spider dataset [13]
mance of text-to-SQL system by measuring the exact
into multiple sub-questions and sub-SQLs, demonstrating that
match between predicted and ground truth SQL compo-
training on these sub-examples can improve a text-to-SQL
nents—SELECT, WHERE, GROUP BY, ORDER BY, and
system’s generalization ability on out-of-distribution samples.
KEYWORDS—using the F1 score. Each component is
4) Robustness Dataset: Evaluating the accuracy of text-to-
decomposed into sets of sub-components and compared for
SQL systems with polluted or perturbed database contents
an exact match, accounting for SQL components without
(e.g., schema and tables) is crucial for assessing robustness.
order constraints.
Spider-Realistic [41] removes explicit schema-related words
• Exact Matching (EM) [13] measures the percentage of
from the NL questions, while Spider-SYN [40] replaces them
examples whose predicted SQL query is identical to the
with manually selected synonyms. ADVETA [37] introduces
ground truth SQL query. A predicted SQL is considered
adversarial table perturbation (ATP), which perturbs tables by
correct only if all its components, as described in CM,
replacing original column names with misleading alternatives
match exactly with those of the ground truth query.
and inserting new columns with high semantic associations
but low semantic equivalency. These perturbations lead to 2) Execution-based Metrics: Execution result metrics assess
significant drops in accuracy, as a text-to-SQL system with the correctness of the generated SQL query by comparing the
low robustness may be misled by incorrect matches between results obtained from executing the query on the target database
tokens in NL questions and database entities. with the expected results.
5) Cross-lingual Dataset: SQL keywords, function names, • Execution Accuracy (EX) [13] measures the correctness
table names, and column names are typically written in of a predicted SQL query by executing it in the corre-
English, posing challenges for applications in other languages. sponding database and comparing the executed results
CSpider [42] translates the Spider dataset into Chinese, identi- with the results obtained by the ground truth query.
6
• Valid Efficiency Score (VES) [33] is defined to measure TABLE II: Typical methods used for in-context learning (ICL)
the efficiency of valid SQL queries. A valid SQL query in LLM-based text-to-SQL. The full table of existing methods
is a predicted SQL query whose executed results exactly with categorization C1:4 and more details are listed in Tab. III.
match the ground truth results. Specifically, VES evaluates Methods Adopted by Applied LLMs
both the efficiency and accuracy of predicted SQL queries.
Zero-shot [7] ChatGPT
For a text dataset with N examples, VES can be computed C0 -Trivial Prompt
Few-shot [9] ChatGPT
by:
N C1 -Decomposition DIN-SQL [8] GPT-4
1 X C2 -Prompt Optimization
VES = 1(Vn , V̂n ) · R(Yn , Ŷn ), (1) C3 -Reasoning Enhancement
DAIL-SQL [9]
ACT-SQL [49]
GPT-4
GPT-4
N n=1
C4 -Execution Refinement LEVER [68] Codex
where Ŷn and V̂n are the predicted SQL query and its
executed results and Yn and Vn are the ground truth SQL
query and its corresponding executed results, respectively. linking [30] and domain knowledge [31, 33]). I represents the
1(Vn , V̂n ) is an indicator function, where: instruction for the text-to-SQL task, which provides indicative
( guidance to trigger the LLMs for generating an accurate SQL
1, Vn = V̂n query. f (· | θ) is a LLM with parameter θ. In the in-context
1(Vn , V̂n ) = (2)
0, Vn ̸= V̂n learning (ICL) paradigm, we utilize an off-the-shelf text-to-
q SQL model (i.e., parameter θ of the model is frozen) for
Then, R(Yn , Ŷn ) = E(Yn )/E(Ŷn ) denotes the relative generating the predicted SQL query. Various well-designed
execution efficiency of the predicted SQL query in methods within the ICL paradigm have been adopted for LLM-
comparison to ground-truth query, where E(·) is the based text-to-SQL tasks. We group them into five categories
execution time of each SQL in the database. BIRD C0:4 , including C0 -Trivial Prompt, C1 -Decomposition, C2 -
benchmark [33] ensures the stability of this metric by Prompt Optimization, C3 -Reasoning Enhancement, and C4 -
computing the average of R(Yn , Ŷn ) over 100 runs for Execution Refinement, the representative methods of each
each example. category are given in Tab. II.
Most of the recent LLM-based text-to-SQL studies focus on C0 -Trivial Prompt: Trained through massive data, LLMs
these four datasets: Spider [13], Spider-Realistic [41], Spider- have a strong overall proficiency in different downstream tasks
SYN [40], and BIRD [33]; and these three evaluation methods: with zero-shot and few-shot prompting [89, 96, 97], which
EM, EX, and VES, we will primarily focus on them in the is widely recognized and used in real-world applications. In
following analysis. our survey, we categorized the above prompting approaches
without the well-designed framework as trivial prompts (vanilla
IV. M ETHODS prompt engineering). As introduced above, Eq. 3 formulated the
process of LLM-based text-to-SQL, which can also represent
The implementation of current LLM-based applications
zero-shot prompting. The overall input P0 can be obtained by
mostly relies on in-context learning (ICL) (prompt engineer-
concatenating I, S and Q:
ing) [86–88] and fine-tuning (FT) [89, 90] paradigms since
the powerful proprietary models and well-architected open- P0 = I ⊕ S ⊕ Q. (4)
source models are being released in large quantities [45, 85, 91–
94]. LLM-based text-to-SQL systems follow these paradigms To regulate the prompting process, the OpenAI demonstration2
for implementation. In this survey, we will discuss them is set as the standard (trivial) prompt [30] for text-to-SQL.
accordingly. Zero-shot: Many research works [7, 27, 46] utilize zero-shot
prompting, studying mainly on the influence of the style
A. In-context Learning of prompt construction and the zero-shot performance of
various LLMs for text-to-SQL. As an empirical evaluation,
Through extensive and widely recognized research, prompt
[7] evaluates the baseline text-to-SQL capabilities of different
engineering has been proven to play a decisive role in the
early-developed LLMs [84, 98, 99] and the performance for
performance of LLMs [28, 95], also impacting the SQL
different prompting styles. The results indicate that prompt
generation in different prompt styles [9, 46]. Necessarily, de-
design is critical for the performance, with error analysis,
veloping text-to-SQL methods in the in-context learning (ICL)
[7] propose more database content can harm the overall
paradigm is valuable for achieving promising improvement.
accuracy. Since ChatGPT emerged with impressive capabilities
The implementation of LLM-based text-to-SQL process to
in conversational scenarios and code generation [100], [27]
generate executable SQL query Y can be formulated as:
assesses its performance of text-to-SQL. With zero-shot settings,
Y = f (Q, S, I | θ), (3) the results demonstrate that ChatGPT has a promising text-to-
SQL performance compared to the state-of-the-art PLM-based
where Q represents the user question. S is the database systems. For fair comparability, [47] reveal effective prompt
schema/content, which can be decomposed as S = ⟨C, T , K⟩, construction for LLM-based text-to-SQL; they study different
where C = {c1 , c2 , ...} and tables T = {t1 , t2 , ...} represent the
collection of various columns and tables, K is the potentially 2 The prompt style that follows the official document from OpenAI
external knowledge (e.g. foreign key relationships [49], schema platform: https://platform.openai.com/examples/default-sql-translate
7
TABLE III: Well-designed methods used in in-context learning (ICL) paradigm for LLM-based text-to-SQL ordered by release
time. The methods are grouped in four categories based on their implementation perspective: C1 -Decomposition, C2 -Prompt
Optimization, C3 -Reasoning Enhancement, C4 -Execution Refinement. The method in multiple categories will be introduced
respectively. * There are multiple applied LLMs in the corresponding method; we present the selected LLM with representative
performance. † CoT method are reported in multiple venues: NeurIPS’23 [33], EMNLP’23 [51], VLDB’24 [9], arXiv’24 [32].
Methods Applied LLMs Benchmark Metrics C1 C2 C3 C4 Release Time Publication Venue
MBR-Exec [67] Codex [13] EX ✓ Apr-2022 EMNLP’22
Coder-Reviewer [56] Codex [13] EX ✓ ✓ Nov-2022 ICML’23
LEVER [68] Codex [13] EX ✓ Feb-2023 ICML’23
SELF-DEBUGGING [48] StarCoder* [13] EX ✓ Apr-2023 ICLR’24
DESEM+P [62] ChatGPT [13, 40] EX ✓ ✓ Apr-2023 PRICAI’23
DIN-SQL [8] GPT-4* [13, 33] EX, EM, VES ✓ ✓ Apr-2023 NeurIPS’23
CoT [9, 32, 33, 51] GPT-4 [13, 33, 41] EX, VES ✓ May-2023 Multiple Venues†
StructGPT [63] ChatGPT* [13, 40, 41] EX ✓ May-2023 EMNLP’23
SD+SA+Voting [52] ChatGPT* [13, 40, 41] EX ✓ ✓ May-2023 EMNLP’23 Findings
QDecomp [51] Codex [13, 41] EX ✓ ✓ May-2023 EMNLP’23
Least-to-Most [51] Codex [13] EX ✓ May-2023 EMNLP’23
SQL-PaLM [53] PaLM-2 [13] EX ✓ ✓ May-2023 arXiv’23
RAG+SP&DRC [64] ChatGPT [13] EX ✓ ✓ Jul-2023 ICONIP’23
C3 [30] ChatGPT [13] EX ✓ ✓ ✓ Jul-2023 arXiv’23
DAIL-SQL [9] GPT-4* [13, 33, 41] EX, EM, VES ✓ Aug-2023 VLDB’24
ODIS [54] Codex* [13] EX ✓ Oct-2023 EMNLP’23 Findings
ACT-SQL [49] GPT-4* [13, 40] EX, EM ✓ ✓ Oct-2023 EMNLP’23 Findings
MAC-SQL [57] GPT-4* [13, 33] EX, EM, VES ✓ ✓ Dec-2023 arXiv’23
DEA-SQL [58] GPT-4 [13] EX ✓ Feb-2024 ACL’24 Findings
FUSED [65] ChatGPT* [13] EX ✓ Feb-2024 arXiv’24
DELLM [31] GPT-4* [13, 33] EX, VES ✓ ✓ Feb-2024 ACL’24 Findings
SGU-SQL [32] GPT-4* [13, 33] EX, EM ✓ Feb-2024 arXiv’24
POT [55] GPT-4* [13, 33] EX ✓ Feb-2024 arXiv’24
SQL-CRAFT [55] GPT-4* [13, 33] EX ✓ ✓ Feb-2024 arXiv’24
FUXI [66] GPT-4* [33] EX ✓ ✓ Feb-2024 arXiv’24
MetaSQL [59] GPT-4* [13] EX, EM ✓ Feb-2024 ICDE’24
PET-SQL [60] GPT-4 [13] EX ✓ ✓ Mar-2024 arXiv’24
PURPLE [61] GPT-4* [13, 40, 41] EX, EM ✓ ✓ Mar-2024 ICDE’24
styles of prompt construction and make conclusions of zero- Few-shot: The technique of few-shot prompting is widely
shot prompt design based on the comparisons. used in both practical applications and well-designed research,
Primary keys and foreign keys carry contiguous knowledge which has been proven efficient for eliciting better performance
of different tables. [49] studies their impact by incorporating of LLMs [28, 101]. The overall input prompt of the few-shot
these keys into various prompt styles with different database approach LLM-based text-to-SQL can be formulated as an
content to analyze zero-shot prompting results. A benchmark extension of Eq. 3:
evaluation [9] also studies the influence of foreign keys, with
Pn = {F1 , F2 , . . . , Fn } ⊕ P0 , (5)
five different prompt representation styles, each style can be
considered as the permutation and combination of the instruc- where the Pn represent the input prompt for n-shot learning, n
tion, rule implication, and foreign key. Apart from the foreign is the provided instances (examples) number; F denote the few-
key, this study also explores zero-shot prompting combined with shot instance, which can be decomposed as Fi = (Si , Qi , Yi ),
the rule implication “no explanation” to collect concise outputs. i is the serial number of instances. The study of few-shot
Empowered by the annotated external knowledge of human prompting focuses on the number of representations and few-
experts, [33] follows the standard prompting and achieves shot instance selection.
improvement by incorporating the provided annotated oracle As pilot experiments, few-shot prompting for text-to-SQL
knowledge. are evaluated in multiple datasets with various LLMs [8, 32],
With the explosion of open-source LLMs, according to the achieve better performance rather than zero-shot approaches.
results of similar evaluation, these models are also capable [33] provides a 1-shot example combining with CoT [102]
of zero-shot text-to-SQL task [45, 46, 50], especially code prompting, the decomposed reasoning steps of the given
generation models [46, 48]. For zero-shot prompting optimiza- example trigger the text-to-SQL model for generating accurate
tion, [46] raises a challenge for designing an effective prompt SQL. [55] study the effect of the number of few-shot examples.
template for LLMs; the former prompt construction lacks [52] focus on the sampling strategies by studying the similarity
structure uniformity, which makes it hard to find out a concrete and the diversity between different demonstrations, setting
element within a prompt constructing template influences random sampling as the baseline, and evaluating different
the performance of LLMs. They address this challenge by strategies and their combination for comparison. Furthermore,
investigating a more unified series of prompt templates warping above the similarity selection, [9] evaluated masked question
with different prefixes, infixes, and postfixes. similarity selection and the upper limit of similarity approaches
8
with various numbers of few-shot examples. A study of generates the schema linking and the distilled question-relevant
difficult-level samples selection [51] compared the performance schema as a clear prompt. Then, a multi-turn dialogue about
of few-shot Codex, with random selection and difficulty- text-to-SQL hints is utilized as a calibration bias prompt, which
based selection for few-shot instances on difficulty categorized combines with the clear prompt to guide the SQL generation.
dataset [13, 41]. Three difficulty-based selection strategies are The generated SQL queries are selected by consistency and
devised based on the number of selected samples at different execution-based voting to get the final SQL. MAC-SQL [57]
difficulty levels. [49] utilize a hybrid strategy for selecting presents a multi-agent collaborating framework; the text-to-
samples, which combines static examples and similarity-based SQL process is finished as the collaboration of the agents:
dynamic examples for few-shot prompting. In their evaluations, Selector, Decomposer, and Refiner. The Selector preserves
they also test the impact of different input schema styles and relevant tables for user questions; the Decomposer breaks down
different static and dynamic exemplar numbers. user questions into sub-questions and provides solutions; finally,
The impact of cross-domain few-shot examples is also the Refiner validates and refines the defective SQL. DEA-
being studied [54]. When incorporating in-domain and out- SQL [58] introduces a workflow paradigm aiming to enhance
of-domain with different numbers of examples, the in-domain the attention and problem-solving scope of LLM-based text-
demonstration always outperforms zero-shot and out-of-domain to-SQL through decomposition. This method decomposes the
examples and gets better as the number of examples rises. To overall task, enabling the SQL generation module to have the
explore the detailed construction of input prompt, [53] compare corresponding prerequisite (information determination, question
the concise and verbose prompt design approaches. The former classification) and subsequent (self-correction, active learning)
style splits the schema, the column names, and the primary and sub-tasks. Through the workflow paradigm, an accurate SQL
foreign keys by bar, and the latter organizes them as natural query is generated. SGU-SQL [32] is a structure-to-SQL
language descriptions. framework, leveraging the inherent structure information to
C1 -Decomposition: As an intuitive solution, decomposing assist SQL generation. Specifically, the framework constructs
a challenging user question into simpler sub-questions and a graph structure for the user question and the corresponding
using multi-step reasoning for implementation can reduce database respectively, then uses the encoded graphs to construct
the complexity of the full text-to-SQL task [8, 51]. Dealing structure linking [104, 105]. A meta operator decomposes the
with less complexity, LLMs have the potential to have better user question with a grammar tree and finally designs the input
performance. The decomposition approaches for LLM-based prompt with meta-operation in SQL. MetaSQL [59] introduces
text-to-SQL are categorized into two paradigms: (1) sub-task a three-stage approach for SQL generation, which consists
decomposing, provides additional parsing to assist the final of decomposition, generation, and rank. The decomposition
SQL generation by decomposing the overall text-to-SQL task stage uses semantic decomposition and metadata composition
into smaller effective sub-tasks (e.g., schema linking, domain to process the user question. Taking the previously processed
classification). (2) sub-question decomposing, divides the user data as input, a text-to-SQL model using metadata-conditioned
question into sub-questions to reduce the question’s complexity generation to generate some candidate SQL queries. Finally, a
and difficulty, then generates the sub-SQL by solving these two-stage ranking pipeline is applied to get a global-optimal
questions to deduce the final SQL query. SQL query. PET-SQL [60] proposed a prompt-enhanced two-
DIN-SQL [8] proposed a decomposed in-context learn- stage framework. Firstly, an elaborated prompt instructs the
ing method consisting of four modules: schema linking, LLMs to generate preliminary SQL (PreSQL) where some
classification & decomposition, SQL generation, and self- few-shot demonstrations are selected based on similarity. Then,
correction. DIN-SQL first finishes the schema linking between schema linking is found based on PreSQL and combined to
the user question and the target database; the following prompt the LLMs to generate the Final SQL (FinSQL). Finally,
module decomposes the user question into correlated sub- multiple LLMs are utilized to generate a FinSQL, ensuring
questions and does a difficulty classification. Based on the consistency based on the execution results.
above information, the SQL generation module generates a C2 -Prompt Optimization: As previously introduced, few-
corresponding SQL, and the self-correction module identifies shot learning is widely studied for prompting LLMs [84].
and corrects the potential errors in the predicted SQL. This For LLM-based text-to-SQL with in-context learning, trivial
approach comprehensively considers the decomposition of both few-shot approaches obtained promising results [8, 9, 33],
sub-tasks and sub-questions. Coder-Reviewer [56] framework further optimization of few-shot prompting has the potential
proposed a re-ranking method, combining Coder models for the to achieve higher performance. Since the performance of
generation and Reviewer models to evaluate the likelihood of SQL generation in off-the-shelf LLMs largely depends on
the instruction. Refer to the Chain-of-Thought [102] and Least- the quality of the corresponding input prompt [106], many
to-Most prompting [103], QDecomp [51] introduce question decisive factors that can influence the quality of the prompt have
decomposition prompting, which follows the question reduction become focuses of the research [9] (e.g., quality and quantity
stage in least-to-most prompting and instruct the LLM to in the few-shot organization, the similarity between user
decompose the original complex question as the intermediate questions and few-shots instances, external knowledge/hints).
reasoning steps. C3 [30] consists of three key components: The process of prompt quality improvement is actually the
clear prompting, calibration bias prompting, and consistency; prompt’s optimization, including few-shot sampling strategies,
these components are accomplished by assigning ChatGPT schema augmentation, and external knowledge generation.
with different tasks. Firstly, the clear prompting component DESEM [62] is a prompt engineering framework with de-
9
semanticization and skeleton retrieval. The framework first LLM-based text-to-SQL with in-context learning.
employs domain-specific words masking module to remove C3 -Reasoning Enhancement: LLMs have exhibited promis-
the semantic tokens in questions that preserve the question’s ing capabilities in tasks involving commonsense reasoning,
intentions. And then utilizes an adjustable prompting module symbolic reasoning, and arithmetic reasoning [107]. Since
that retrieves the few-shot examples with identical question for the text-to-SQL tasks, numeric and synonym reasoning
intentions and incorporates schema-relevance filtering to guide frequently occur in realistic scenarios [33, 41], the prompting
the LLM’s SQL generation. The QDecomp [51] framework strategies for the LLMs reasoning possess the potential to
introduces the InterCOL mechanism to incrementally incorpo- enhance their SQL generation capabilities. Recent studies
rate the decomposed sub-questions with correlative table and primarily focus on incorporating well-designed reasoning-
column names. With difficulty-based selection, the few-shot enhanced methods for text-to-SQL adaptation, improving LLMs
examples for QDecomp are difficult-level sampled. Besides to address the challenge about complex questions that require
similarity-diversity sampling, [52] proposed SD+SA+Voting multi-step reasoning and the issue of self-consistency [108]
(Similarity-Diversity+Schema Augmentation+Voting) sampling in SQL generation.
strategy. They first employ semantic similarity and k-Means Chain-of-Thoughts (CoT) prompting technique [102] in-
cluster diversity for sampling few-shot examples and then volves a comprehensive reasoning process that guides LLMs
enhance the prompt with schema knowledge (semantic or towards accurate deduction, eliciting reasoning in LLMs. The
structure augmentation). C3 [30] framework comprises a clear study of LLM-based text-to-SQL utilizes CoT prompting
prompting component, which takes the question and schema as as rule implication [9], which setting the instruction “Let’s
the LLMs input, generates a clear prompt that includes a schema think step by step” in prompt construction [9, 32, 33, 51].
that removes the redundant information irrelevant to the user However, the straightforward (original) CoT strategy has not
question and a schema linking, and also a calibration component demonstrated the potential in text-to-SQL tasks that it has in
providing hints. The LLMs take their composition as context- other reasoning tasks; studying CoT for adaptations is still an
augmented prompts for SQL generation. A retrieval-augmented ongoing research [51]. Since CoT prompting always uses static
framework is introduced with sample-aware prompting [64], examples with human annotation for demonstrations, which
which simplifies the original question and extracts the question requires empirical judgment for the effective selection of few-
skeleton from the simplified question, then finishes the sample shot examples, and manual annotating is also an essential need.
retrieval in the repository according to skeleton similarities. As a solution, ACT-SQL [49] proposed a method to generate
The retrieved samples are combined with the original question CoT examples automatically. Specifically, given a question,
for few-shot prompting. ODIS [54] introduces the selection of ACT-SQL truncates a set of slices of the question and then
a sample with out-of-domain demonstrations and in-domain enumerates every column appearing in the corresponding SQL
synthetic data, which retrieves few-shot demonstrations from query. Each column will be linked with its most relevant
hybrid sources to augment the prompt representations. DAIL- slice through the similarity function and appended to the
SQL [9] proposed a novel approach to address the issues in CoT prompt. Through systematical study for enhancing LLMs
few-shot sampling and organization, presenting a better balance SQL generation incorporating CoT prompting, QDecomp [51]
between the quality and quantity of few-shot examples. DAIL presents a novel framework to address the challenge for
Selection first masks domain-specific words in user and few- CoT how to come up with the reasoning steps to predict
shot example questions, then ranks the candidate examples the SQL query. The framework utilizes every slice of the
based on the embedded Euclidean distance. Meanwhile, the SQL query to construct a logical step in CoT reasoning,
similarity between the pre-predicted SQL queries is calculated. then employs natural language templates to articulate each
Finally, the selection mechanism obtains the similarity-sorted slice of the SQL query and arranges them in the logical
candidates according to the pre-set criteria. The few-shot execution order. Least-to-Most [103] is another prompting
examples are guaranteed good similarity with both questions technique that decomposes questions into sub-questions and
and SQL queries with this method. ACT-SQL [49] proposed then sequentially solves them. As iterative prompting, pilot
dynamic examples in few-shot prompting, which is selected experiments [51] demonstrate that it may be unnecessary for
according to similarity score. FUSED [65] are presented to text-to-SQL parsing. Using detailed reasoning steps tends to
build a high-diversity demonstrations pool through human- have more error propagation issues. As a variant of CoT,
free multiple-iteration synthesis to improve the diversity of Program-of-Thoughts (PoT) prompting strategy [109] are
the few-shot demonstrations. The pipeline of FUSED samples proposed to enhance arithmetic reasoning for LLMs. Through
the demonstrations to be fused by clustering, then fuse the evaluation [55], PoT enhances the LLM for SQL generation,
sampled demonstrations to construct the pool to enhance few- especially in complicated datasets [33]. SQL-CRAFT [55]
shot learning. Knowledge-to-SQL [31] framework aims to build are proposed to enhance LLM-based SQL generation, which
a Data Expert LLM (DELLM) to provide knowledge for SQL incorporates PoT prompting for Python-enhanced reasoning.
generation. The DELLM is trained by supervised fine-tuning PoT strategy requires the model to simultaneously generate
using human expert annotations [33] and further refined by the Python code and the SQL queries, enforcing the model
preference learning with the database’s feedback. DELLM to incorporate Python code in its reasoning process. Self-
generates four categories of knowledge, the well-designed Consistency [108] is a prompting strategy improving reasoning
methods (e.g. DAIL-SQL [9], MAC-SQL [57]) incorporating in LLMs, which leverages the intuition that a complex reasoning
the generated knowledge to achieve better performance for problem typically admits multiple different ways of thinking,
10
leading to its unique correct answer. In the text-to-SQL task, generated SQL. Although schema-filtering approaches enhance
self-consistency is adapted to sampling a set of different SQL SQL generation, the generated SQL could be unexecutable.
and voting for consistent SQL via execution feedback [30, 53]. DESEM [62] incorporates a fallback revision to address the
Similarly, the SD+SA+Voting [52] framework eliminates those issue; it revises and regenerates the SQL base on different
with execution errors identified by the deterministic database kinds of errors and sets termination criteria to avoid the loop.
management system (DBMS) and opts for the prediction that DIN-SQL [8] designed a generic and gentle prompt in their
garners the majority vote. Furthermore, motivated by recent self-correction module; the generic prompt requests the LLMs
research on extending the capabilities of LLMs with tools, to identify and correct the error, and the gentle prompt asks the
FUXI [66] are proposed to enhance LLMs SQL generation model to check the potential issue. The multi-agent framework
through effectively invoking crafted tools. MAC-SQL [57] comprises a refiner agent, which is able to
C4 -Execution Refinement: To design criteria for accurate detect and automatically rectify SQL errors, taking SQLite error
SQL generation, whether a generated SQL can be successfully and exception class to regenerate fixed SQL. Since different
executed and elicit a correct answer for the user question is questions may require different numbers of revisions, SQL-
always the priority [13]. As a complex programming task, CRAFT [55] framework introduced interactive correction with
generating the correct SQL in one go becomes challenging. an automated control determination process to avoid over-
Intuitively, considering the execution feedback/results in SQL correction or insufficient correction. FUXI [66] considers the
generation assists the alignment to the corresponding database error feedback in tool-based reasoning for SQL generation.
environment, which allows the LLMs to gather the potential The Knowledge-to-SQL [31] introduces a preference learning
executed errors and results to refine the generated SQL or framework incorporating the database execution feedback with
hold a majority vote [30]. The execution-aware methods in a direct preference optimization [111] for refining the proposed
text-to-SQL incorporate the execution feedback in two main DELLM. PET-SQL [60] proposed cross consistency, which
approaches: 1) Incorporating the feedback through second comprises two variants: 1) naive voting: instruct multiple LLMs
round prompting for regeneration, for every SQL query to generate the SQL query, then utilizing the majority vote
generated in the initial response, it will be executed in the for the final SQL base on different execution results; 2) fine-
corresponding database, thus obtaining feedback from the grained voting: refine the naive voting based on the difficulty
database. This feedback might be an error, or it might yield level to mitigate the voting bias.
results that will be appended to the second round prompt.
Through in-context learning of this feedback, LLMs are able
B. Fine-tuning
to refine or regenerate the original SQL, thereby enhancing
accuracy. 2) Utilize execution-based selection strategies for Since supervised fine-tuning (SFT) is the mainstream ap-
generated SQL, sample multiple generated SQL queries from proach in the LLMs training [29], for open-source LLMs
LLM), and execute each in the database. Based on the results of (e.g., LLaMA-2 [93], Gemma [112]), the most straightforward
each SQL query, use selection strategies (e.g., self-consistency, method to enable the model to adapt a specific domain quickly
majority vote [60]) to define a query from the SQL set that is to use collected domain label to perform SFT on the model.
satisfies the criteria as the final predicted SQL. The SFT phase is typically the preliminary phase of the well-
MRC-EXEC [67] introduced a natural language to code designed training framework [111, 113], as well as the fine-
(NL2Code) translation framework with execution, which exe- tuning of text-to-SQL. The auto-regressive generation process
cutes each sampled SQL query and selects the example with the of SQL query Y can be formulated as follows:
minimal execution result–based Bayes risk [110]. LEVER [68] n
Y
proposed an approach to verify NL2Code with execution, Pπ (Y | P) = Pπ (yk | P, Y1:k−1 ), (6)
utilizing a generation and execution module to collect sampled k=1
SQL set and their execution results, respectively, then using where Y = {y1 , y2 , . . . , yn } is an SQL query of length n,
a learned verifier to output the probability of the correctness. yk is the corresponding k th token of the SQL query, Y1:k−1
Similarly, the SELF-DEBUGGING [48] framework is presented is the prefix sequence of Y ahead the token yk . Pπ (yk | ·)
to teach LLMs to debug their predicted SQL via few-shot is a conditional probability of a LLM π for generating the
demonstrations. The model is able to refine its mistakes by k th token of Y base on the input prompt P and the prefix
investigating the execution results and explaining the generated sequence.
SQL in natural language without human interventions Given a basic open-source model π 0 , the goal of SFT is
As previously introduced, to incorporate the well-designed obtain a model π SF T through minimizing the cross-entropy
framework with execution feedback, two-stage implications loss:
are widely-used: 1. sampling a set of SQL queries. 2. majority n
vote (self-consistency). Specifically, the C3 [30] framework X
LSF T = − log Pπ0 (ŷk = yk | P, Y1:k−1 ), (7)
removes the errors and identifies the most consistent SQL;
k=1
The retrieval-augmented framework [64] introduced a dynamic
revision chain, combining fine-grained execution messages with where ŷk is the k th token of the generated SQL query Ŷ , and
database content to prompt the LLMs to convert the generated Y is the corresponding ground-truth label.
SQL query into natural language explanation; the LLMs are The SFT approach for text-to-SQL has been widely adopted
requested to identify the semantic gaps and revise their own in text-to-SQL research for various open-source LLMs [9, 10,
11
TABLE IV: Well-designed methods used in fine-tuning (FT) for LLM-based text-to-SQL. The methods in each category are
ordered by release time. * The methods are utilized in multiple open-source LLMs; we select a representative model to present.
Category Adopted by Applied LLMs Dataset EX EM VES Release Time Publication Venue
Enhanced Architecture CLLMs [69] Deepseek* [13] ✓ Mar-2024 ICML’24
Pre-training CodeS [10] StarCoder [13, 33] ✓ ✓ Feb-2024 SIGMOD’24
DAIL-SQL [71] LLaMA* [13, 41] ✓ ✓ Aug-2023 VLDB’24
Symbol-LLM [50] CodeLLaMA [13] ✓ Nov-2023 ACL’24
Data Augmentation
CodeS [10] StarCoder [13, 33] ✓ ✓ Feb-2024 SIGMOD’24
StructLM [70] CodeLLaMA [13] ✓ Feb-2024 arXiv’24
Decomposition DTS-SQL [71] Mistral* [13, 40] ✓ ✓ Feb-2024 arXiv’24
46]. Compared to in-context learning (ICL) approaches, fine- code-specific LLMs (e.g., CodeLLaMA [121], StarCoder [122])
tuning paradigms are more inclined to be at a starting point in are pre-trained on code data [99], and the mixture of various
LLM-based text-to-SQL. Currently, several studies exploring a programming languages enables the LLMs to generate code to
better fine-tuning method have been released. We categorize meet with the user’s instruction [123]. As a sub-task of code
the well-designed fine-tuning methods in different groups based generation, the main challenge of SQL-specific pre-training
on their mechanisms, as shown in Tab. IV. technique is that the SQL/Database-related content occupies
Enhanced Architecture: The widely-used generative pre- only a small portion of the entire pre-training corpus. Then,
trained transformer (GPT) framework utilizes decoder-only as a result, the open-source LLMs with comparatively limited
transformer architecture and conventional auto-regressive de- comprehensive capacity (compared to ChatGPT, GPT-4) do
coding for text generation. Recent studies on the efficiency of not acquire a promising understanding of how to convert NL
LLMs have revealed a common challenge: when generating questions to SQL during their pre-training process.
long sequences with the auto-regressive paradigm, the need The pre-training phase of the Codes [10] model consists of
to incorporate the attention mechanism results in high latency three stages of incremental pre-training. Starting from a basic
for LLMs [114, 115]. In LLM-based text-to-SQL, the speed code-specific LLM [122], CodeS are further pre-trained on a
of generating SQL queries is significantly slower compared to hybrid training corpus, including SQL-related data, NL-to-Code
traditional language modeling [21, 28], which has become a data, and NL-related data. The text-to-SQL understanding and
challenge in constructing high-efficiency local NLIDB. performance are significantly improved.
As one of the solutions, CLLMs [69] are designed to address Decomposition: Decomposing a task into multiple steps or
the above challenge with an enhanced model architecture and using multiple models to solve the task is an intuitive solution
achieve a speedup for SQL generation. for addressing a complex scenario, as we previously introduced
Data Augmentation: During the fine-tuning process, the in Sec. IV-A, ICL paradigm. The proprietary models utilized in
most straightforward factor affecting the model’s performance ICL-based methods have a massive number of parameters that
is the quality of the training labels [116]. The fine-tuning under are not at the same parameter level as the open-source models
the low quality or lack of the training labels is “making bricks used in fine-tuning methods. These models inherently possess
without straw”, using high-quality or augmented data for fine- the capability to perform assigned sub-tasks well (through
tuning always surpasses the meticulous design of fine-tuning mechanisms such as few-shot learning) [30, 57]. Thus, to
methods on low-quality or raw data [29, 117]. Data-augmented replicate the success of this paradigm in ICL methods, it is
fine-tuning in text-to-SQL made substantial progress, focusing necessary to reasonably assign corresponding sub-tasks to open-
on enhancing the data quality during the SFT process. source models (such as generating external knowledge, schema
linking, and distilling the schema) for sub-task-specific fine-
DAIL-SQL [9] are designed as an in-context learning
tuning and constructing the corresponding data for fine-tuning,
framework, utilizing a sampling strategy for better few-shot
thereby assisting in the final SQL generation.
instances. Incorporating the sampled instances in the SFT
DTS-SQL [71] proposed a two-stage decomposed text-to-
process improves the performance of open-source LLMs.
SQL fine-tuning framework and designed a schema-linking
Symbol-LLM [50] propose injection and infusion stage for
pre-generation task ahead of the final SQL generation.
data augmented instruction tuning. CodeS [10] augmented
the training data with bi-directional generation with the help
V. E XPECTATIONS
of ChatGPT. StructLM [70] are trained on multiple struct
knowledge tasks for improving overall capability. Despite the significant advancements made in text-to-SQL
Pre-training: Pre-training is a fundamental phase of the research, there are still several challenges that need to be
complete fine-tuning process, aimed at acquiring text genera- addressed. In this section, we discuss the remaining challenges
tion capabilities through auto-regressive training on extensive that we expect to overcome in future work.
data [118]. Conventionally, the current powerful proprietary
LLMs (e.g., ChatGPT [119], GPT-4 [85], Claude [120]) are A. Robustness in Real-world Applications
pre-trained on hybrid corpus, which mostly benefit from the dia- The text-to-SQL implemented by LLMs is expected to
logue scenario that exhibits text generation capability [84]. The perform generalization and robustness across complex scenarios
12
in real-world applications. Despite recent advances having made from the user end to reduce cost and redundancy is a potential
substantial progress in robustness-specific datasets [37, 41], its solution to improve computational efficiency [30]. Designing an
performance still falls short of practical application [33]. There accurate method for schema filtering remains a future direction.
are still challenges that are expected to be overcome in future Although the in-context learning paradigm achieves promising
studies. From the user aspect, there is a phenomenon that the accuracy, as a computational efficiency concern, the well-
user is not always a clear question proposer, which means designed methods with the multi-stage framework or extended
the user questions might not have the exact database value and context increasing the number of API calls to enhance
also can be varied from the standard datasets, the synonyms, performance has simultaneously led to a substantial rise in
typos, and vague expressions could be included [40]. For costs [8]. As reported in related approaches [49], a trade-off
instance, the models are trained on clear indicative questions between performance and computational efficiency should be
with concrete expressions in the fine-tuning paradigm. Since considered carefully, and designing a comparable (even better)
the model has not learned the mapping of realistic questions in-context learning method with less API cost will be a practical
to the corresponding database, this leads to a knowledge gap implementation and is still under exploration. Compared to
when applied to real-world scenarios [33]. As reported in the PLM-based methods, the inference speed of LLM-based
corresponding evaluations of the dataset with synonym and methods is observably slower [21, 28]. Accelerating inference
incomplete instruction [7, 51], the SQL queries generated by by shortening the input length and reducing the number of
ChatGPT contain around 40% incorrect execution, which is stages in implementation would be intuitive for the in-context
10% lower than the original evaluation [51]. Simultaneously, the learning paradigm. For local LLMs, from a starting point [69],
fine-tuning with local text-to-SQL datasets may contain non- more speedup strategies can be studied in enhancing the model’s
standardized samples and labels. As an example, the name architecture in future exploration.
of the table or column is not always an accurate representation
of its content, which yields an inconsistency within the training C. Data Privacy and Interpretability
data construction and may lead to a semantic gap between
As a part of the LLMs’ study, LLM-based text-to-SQL also
the database schema and the user question. To address this
faces some general challenges present in LLM research [4, 126,
challenge, aligning the LLMs with intention bias and designing
127]. Potential improvements from the text-to-SQL perspective
the training strategy towards noisy scenarios will benefit the
are also expected to be seen in these challenges, thereby exten-
recent advances. At the same time, the data size in real-
sively benefiting the study of LLMs. As previously discussed in
world applications is relatively smaller than the research-
Sec. IV-A, the in-context learning paradigm predominates the
oriented benchmark. Since extending a large amount of the
number and performance in recent studies, with the majority
data by human annotation incurs high labor costs, designing
of work using proprietary models for implementation [8, 9]. A
data-augmentation methods to obtain more question-SQL pairs
straightforward challenge is proposed regarding data privacy,
will support the LLM in data scarcity. Also, the adaptation study
as calling proprietary APIs to handle local databases with
of fine-tuned open-source LLM to the local small-size dataset
confidentiality can pose a risk of data leakage. Using a
can be potentially beneficial. Furthermore, the extensions on
local fine-tuning paradigm can partially address this issue.
multi-lingual [42, 124] and multi-modal scenarios [125]
Still, the current performance of vanilla fine-tuning is not
should be studied comprehensively in future research, which
ideal [9], and advanced fine-tuning framework potentially relies
will benefit more language groups and help build more general
on proprietary LLMs for data augmentation [10]. Based on
database interfaces.
the current status, more tailored frameworks in the local fine-
tuning paradigm for text-to-SQL deserve widespread attention.
B. Computational Efficiency Overall, the development of deep learning continually faces
challenges regarding interpretability [127, 128]. As a long-
The computational efficiency is determined by the inference
standing challenge, considerable work has already been studied
speed and the cost of computational resources, which is worth
to address this issue [129, 130]. However, in text-to-SQL
considering in both application and research work [49, 69].
research, the interpretability of LLM-based implementation
With the increasing complexity of databases in up-to-date
is still not being discussed, whether in the in-context learning
text-to-SQL benchmarks [15, 33], databases will carry more
or fine-tuning paradigm. The approaches with a decomposition
information (including more tables and columns), and the
phase explain the text-to-SQL implementation process from
token length of the database schema will correspondingly
the perspective of step-by-step generation [8, 51]. Building on
increase, raising a series of challenges. Dealing with an
this, combining advanced study in interpretability [131, 132]
ultra-complex database, taking the corresponding schema as
to enhance text-to-SQL performance and interpreting the local
input may encounter the challenge that the cost of calling
model architecture from the database knowledge aspect remain
proprietary LLMs will significantly increase, potentially
future directions.
exceeding the model’s maximum token length, especially
with the implementation of open-source models that have
shorter context lengths. Meanwhile, another obvious challenge D. Extensions
is that most works use the full schema as model input, As a sub-field of LLMs and natural language understanding
which introduces significant redundancy [57]. Providing research, many studies in these fields have been adopted
LLMs with a precise question-related filtered schema directly for text-to-SQL tasks, advancing its development [102, 108].
13
However, text-to-SQL research can also be extended to the models: A benchmark evaluation,” in International
larger scope studies of these fields at meanwhile. For instance, Conference on Very Large Data Bases (VLDB), 2024.
SQL generation is a part of code generation. The well- [10] H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zhu,
designed approaches in code generation also obtain promising R. Wei, H. Pan, C. Li, and H. Chen, “Codes: Towards
performance in text-to-SQL [48, 68], performing generaliza- building open-source language models for text-to-sql,”
tion across various programming languages. The potential arXiv preprint arXiv:2402.16347, 2024.
extension of some tailored text-to-SQL frameworks to [11] F. Li and H. V. Jagadish, “Constructing an interactive
NL-to-code studies can also be discussed. For instance, natural language interface for relational databases,” in
frameworks integrating execution output in NL-to-code can International Conference on Very Large Data Bases
also achieve solid performance in SQL generation [8]. An (VLDB), 2014.
attempt to extend execution-aware approaches in text-to-SQL [12] T. Mahmud, K. A. Hasan, M. Ahmed, and T. H. C. Chak,
with other advancing modules [30, 31] for code generation “A rule based approach for nlp based query processing,”
is worth discussing. From another perspective, we previously in International Conference on Electrical Information
discussed that text-to-SQL can enhance LLM-based question- and Communication Technologies (EICT), 2015.
answering (QA) by providing factual information. The database [13] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li,
can store relational knowledge as structural information, and J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev,
the structure-based QA can potentially benefit from text-to- “Spider: A large-scale human-labeled dataset for complex
SQL (e.g., knowledge-based question-answering, KBQA [133]). and cross-domain semantic parsing and text-to-SQL task,”
Construct the factual knowledge with database structure, and in Empirical Methods in Natural Language Processing
then incorporate the text-to-SQL system to achieve information (EMNLP), 2018.
retrieval, which can potentially assist further QA with more [14] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating
accurate factual knowledge [134]. More extensions of text-to- structured queries from natural language using reinforce-
SQL studies are expected in future work. ment learning,” arXiv preprint arXiv:1709.00103, 2017.
[15] M. Pourreza and D. Rafiei, “Evaluating cross-domain
R EFERENCES text-to-SQL models and benchmarks,” in Empirical
[1] L. Wang, B. Qin, B. Hui, B. Li, M. Yang, B. Wang, Methods in Natural Language Processing (EMNLP),
B. Li, J. Sun, F. Huang, L. Si, and Y. Li, “Proton: 2023.
Probing schema linking information from pre-trained [16] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to
language models for text-to-sql parsing,” in Conference sequence learning with neural networks,” in Advances
on Knowledge Discovery and Data Mining (KDD), 2022. in Neural Information Processing Systems (NeurIPS),
[2] B. Qin, B. Hui, L. Wang, M. Yang, J. Li, B. Li, R. Geng, 2014.
R. Cao, J. Sun, L. Si et al., “A survey on text-to-sql [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
parsing: Concepts, methods, and future directions,” arXiv L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
preprint arXiv:2208.13629, 2022. “Attention is all you need,” in Advances in Neural
[3] S. Xu, S. Semnani, G. Campagna, and M. Lam, “Autoqa: Information Processing Systems (NeurIPS), 2017.
From databases to qa semantic parsers with only syn- [18] B. Hui, X. Shi, R. Geng, B. Li, Y. Li, J. Sun, and X. Zhu,
thetic training data,” in Empirical Methods in Natural “Improving text-to-sql with schema dependency learning,”
Language Processing (EMNLP), 2020. arXiv preprint arXiv:2103.04399, 2021.
[4] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, [19] D. Choi, M. C. Shin, E. Kim, and D. R. Shin, “Ryan-
E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the sql: Recursively applying sketch-based slot fillings
ai ocean: a survey on hallucination in large language for complex text-to-sql in cross-domain databases,” in
models,” arXiv preprint arXiv:2309.01219, 2023. International Conference on Computational Linguistics
[5] P. Manakul, A. Liusie, and M. J. Gales, “Selfcheckgpt: (COLING), 2021.
Zero-resource black-box hallucination detection for [20] P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “Tabert:
generative large language models,” in Empirical Methods Pretraining for joint understanding of textual and tabular
in Natural Language Processing (EMNLP), 2023. data,” arXiv preprint arXiv:2005.08314, 2020.
[6] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring [21] H. Li, J. Zhang, C. Li, and H. Chen, “Resdsql: Decou-
how models mimic human falsehoods,” in Association pling schema linking and skeleton parsing for text-to-sql,”
for Computational Linguistics (ACL), 2021. in Conference on Artificial Intelligence (AAAI), 2023.
[7] N. Rajkumar, R. Li, and D. Bahdanau, “Evaluating the [22] J. Li, B. Hui, R. Cheng, B. Qin, C. Ma, N. Huo, F. Huang,
text-to-sql capabilities of large language models,” arXiv W. Du, L. Si, and Y. Li, “Graphix-t5: Mixing pre-trained
preprint arXiv:2204.00498, 2022. transformers with graph-aware layers for text-to-sql
[8] M. Pourreza and D. Rafiei, “DIN-SQL: Decomposed parsing,” in Conference on Artificial Intelligence (AAAI),
in-context learning of text-to-SQL with self-correction,” 2023.
in Advances in Neural Information Processing Systems [23] D. Rai, B. Wang, Y. Zhou, and Z. Yao, “Improv-
(NeurIPS), 2023. ing generalization in language model-based text-to-
[9] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, sql semantic parsing: Two simple semantic boundary-
and J. Zhou, “Text-to-sql empowered by large language based techniques,” in Association for Computational
14
Linguistics (ACL), 2023. [36] C.-H. Lee, O. Polozov, and M. Richardson, “KaggleD-
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BQA: Realistic evaluation of text-to-SQL parsers,” in
“BERT: Pre-training of deep bidirectional transformers Association for Computational Linguistics and Interna-
for language understanding,” in North American Chapter tional Joint Conference on Natural Language Processing
of the Association for Computational Linguistics: Human (ACL-IJCNLP), 2021.
Language Technologies (NAACL-HLT), 2019. [37] X. Pi, B. Wang, Y. Gao, J. Guo, Z. Li, and J.-G.
[25] Q. Lyu, K. Chakrabarti, S. Hathi, S. Kundu, J. Zhang, Lou, “Towards robustness of text-to-SQL models against
and Z. Chen, “Hybrid ranking network for text-to-sql,” natural and realistic adversarial table perturbation,” in
arXiv preprint arXiv:2008.04759, 2020. Association for Computational Linguistics (ACL), 2022.
[26] T. Yu, C.-S. Wu, X. V. Lin, bailin wang, Y. C. Tan, [38] Y. Gan, X. Chen, Q. Huang, and M. Purver, “Measuring
X. Yang, D. Radev, richard socher, and C. Xiong, and improving compositional generalization in text-to-
“Grappa: Grammar-augmented pre-training for table se- SQL via component alignment,” in Findings of North
mantic parsing,” in International Conference on Learning American Chapter of the Association for Computational
Representations (ICLR), 2021. Linguistics (NAACL), 2022.
[27] A. Liu, X. Hu, L. Wen, and P. S. Yu, “A comprehensive [39] Y. Gan, X. Chen, and M. Purver, “Exploring underex-
evaluation of chatgpt’s zero-shot text-to-sql capability,” plored limitations of cross-domain text-to-SQL gener-
arXiv preprint arXiv:2303.13547, 2023. alization,” in Empirical Methods in Natural Language
[28] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, Processing (EMNLP), 2021.
S. Zhong, B. Yin, and X. Hu, “Harnessing the power [40] Y. Gan, X. Chen, Q. Huang, M. Purver, J. R. Woodward,
of llms in practice: A survey on chatgpt and beyond,” J. Xie, and P. Huang, “Towards robustness of text-to-SQL
ACM Transactions on Knowledge Discovery from Data models against synonym substitution,” in Association
(TKDD), 2024. for Computational Linguistics and International Joint
[29] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Conference on Natural Language Processing (ACL-
Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., IJCNLP), 2021.
“A survey of large language models,” arXiv preprint [41] X. Deng, A. H. Awadallah, C. Meek, O. Polozov, H. Sun,
arXiv:2303.18223, 2023. and M. Richardson, “Structure-grounded pretraining for
[30] X. Dong, C. Zhang, Y. Ge, Y. Mao, Y. Gao, J. Lin, text-to-SQL,” in North American Chapter of the Associ-
D. Lou et al., “C3: Zero-shot text-to-sql with chatgpt,” ation for Computational Linguistics: Human Language
arXiv preprint arXiv:2307.07306, 2023. Technologies (NAACL-HLT), 2021.
[31] Z. Hong, Z. Yuan, H. Chen, Q. Zhang, F. Huang, and [42] Q. Min, Y. Shi, and Y. Zhang, “A pilot study for Chinese
X. Huang, “Knowledge-to-sql: Enhancing sql generation SQL semantic parsing,” in Empirical Methods in Natural
with data expert llm,” arXiv preprint arXiv:2402.11517, Language Processing and International Joint Conference
2024. on Natural Language Processing (EMNLP-IJCNLP),
[32] Q. Zhang, J. Dong, H. Chen, W. Li, F. Huang, and 2019.
X. Huang, “Structure guided large language model for [43] T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, S. Li,
sql generation,” arXiv preprint arXiv:2402.13284, 2024. H. Er, I. Li, B. Pang, T. Chen, E. Ji, S. Dixit, D. Proctor,
[33] J. Li, B. Hui, G. QU, J. Yang, B. Li, B. Li, B. Wang, S. Shim, J. Kraft, V. Zhang, C. Xiong, R. Socher, and
B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, D. Radev, “SParC: Cross-domain semantic parsing in
K. Chang, F. Huang, R. Cheng, and Y. Li, “Can LLM context,” in Association for Computational Linguistics
already serve as a database interface? a BIg bench (ACL), 2019.
for large-scale database grounded text-to-SQLs,” in [44] T. Shi, C. Zhao, J. Boyd-Graber, H. Daumé III, and
Advances in Neural Information Processing Systems L. Lee, “On the potential of lexico-logical alignments
(NeurIPS), 2023. for semantic parsing to SQL queries,” in Findings of
[34] L. Wang, A. Zhang, K. Wu, K. Sun, Z. Li, H. Wu, Empirical Methods in Natural Language Processing
M. Zhang, and H. Wang, “DuSQL: A large-scale and (EMNLP), 2020.
pragmatic Chinese text-to-SQL dataset,” in Empirical [45] S. Xue, C. Jiang, W. Shi, F. Cheng, K. Chen, H. Yang,
Methods in Natural Language Processing (EMNLP), Z. Zhang, J. He, H. Zhang, G. Wei, W. Zhao, F. Zhou,
2020. D. Qi, H. Yi, S. Liu, and F. Chen, “Db-gpt: Empowering
[35] T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. database interactions with private large language models,”
Lin, Y. C. Tan, T. Shi, Z. Li, Y. Jiang, M. Yasunaga, arXiv preprint arXiv:2312.17449, 2024.
S. Shim, T. Chen, A. Fabbri, Z. Li, L. Chen, Y. Zhang, [46] B. Zhang, Y. Ye, G. Du, X. Hu, Z. Li, S. Yang, C. H.
S. Dixit, V. Zhang, C. Xiong, R. Socher, W. Lasecki, Liu, R. Zhao, Z. Li, and H. Mao, “Benchmarking the
and D. Radev, “CoSQL: A conversational text-to-SQL text-to-sql capability of large language models: A com-
challenge towards cross-domain natural language inter- prehensive evaluation,” arXiv preprint arXiv:2403.02951,
faces to databases,” in Empirical Methods in Natural 2024.
Language Processing and International Joint Conference [47] S. Chang and E. Fosler-Lussier, “How to prompt LLMs
on Natural Language Processing (EMNLP-IJCNLP), for text-to-SQL: A study in zero-shot, single-domain, and
2019. cross-domain settings,” in NeurIPS 2023 Second Table
15
Representation Learning Workshop (NeurIPS), 2023. Making a large language model a better sql writer,” arXiv
[48] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching preprint arXiv:2403.20014, 2024.
large language models to self-debug,” in International [62] C. Guo, Z. Tian, J. Tang, P. Wang, Z. Wen, K. Yang,
Conference on Learning Representations (ICLR), 2024. and T. Wang, “Prompting gpt-3.5 for text-to-sql with
[49] H. Zhang, R. Cao, L. Chen, H. Xu, and K. Yu, de-semanticization and skeleton retrieval,” in Pacific
“ACT-SQL: In-context learning for text-to-SQL with Rim International Conference on Artificial Intelligence
automatically-generated chain-of-thought,” in Findings (PRICAI), 2024.
of Empirical Methods in Natural Language Processing [63] J. Jiang, K. Zhou, Z. Dong, K. Ye, X. Zhao, and J.-
(EMNLP), 2023. R. Wen, “StructGPT: A general framework for large
[50] F. Xu, Z. Wu, Q. Sun, S. Ren, F. Yuan, S. Yuan, Q. Lin, language model to reason over structured data,” in
Y. Qiao, and J. Liu, “Symbol-llm: Towards foundational Empirical Methods in Natural Language Processing
symbol-centric interface for large language models,” (EMNLP), 2023.
arXiv preprint arXiv:2311.09278, 2024. [64] C. Guo, Z. Tian, J. Tang, S. Li, Z. Wen, K. Wang, and
[51] C.-Y. Tai, Z. Chen, T. Zhang, X. Deng, and H. Sun, T. Wang, “Retrieval-augmented gpt-3.5-based text-to-sql
“Exploring chain of thought style prompting for text- framework with sample-aware prompting and dynamic
to-SQL,” in Empirical Methods in Natural Language revision chain,” in International Conference on Neural
Processing (EMNLP), 2023. Information Processing (ICONIP), 2024.
[52] L. Nan, Y. Zhao, W. Zou, N. Ri, J. Tae, E. Zhang, [65] D. Wang, L. Dou, X. Zhang, Q. Zhu, and W. Che,
A. Cohan, and D. Radev, “Enhancing text-to-SQL “Improving demonstration diversity by human-free fusing
capabilities of large language models: A study on prompt for text-to-sql,” arXiv preprint arXiv:2402.10663, 2024.
design strategies,” in Findings of Empirical Methods in [66] Y. Gu, Y. Shu, H. Yu, X. Liu, Y. Dong, J. Tang,
Natural Language Processing (EMNLP), 2023. J. Srinivasa, H. Latapie, and Y. Su, “Middleware for llms:
[53] R. Sun, S. O. Arik, H. Nakhost, H. Dai, R. Sinha, Tools are instrumental for language agents in complex
P. Yin, and T. Pfister, “Sql-palm: Improved large lan- environments,” arXiv preprint arXiv:2402.14672, 2024.
guage model adaptation for text-to-sql,” arXiv preprint [67] F. Shi, D. Fried, M. Ghazvininejad, L. Zettlemoyer, and
arXiv:2306.00739, 2023. S. I. Wang, “Natural language to code translation with
[54] S. Chang and E. Fosler-Lussier, “Selective demonstra- execution,” in Empirical Methods in Natural Language
tions for cross-domain text-to-SQL,” in Findings of Processing (EMNLP), 2022.
Empirical Methods in Natural Language Processing [68] A. Ni, S. Iyer, D. Radev, V. Stoyanov, W.-t. Yih,
(EMNLP), 2023. S. I. Wang, and X. V. Lin, “Lever: Learning to verify
[55] H. Xia, F. Jiang, N. Deng, C. Wang, G. Zhao, R. Mi- language-to-code generation with execution,” in Interna-
halcea, and Y. Zhang, “Sql-craft: Text-to-sql through tional Conference on Machine Learning (ICML), 2023.
interactive refinement and enhanced reasoning,” arXiv [69] S. Kou, L. Hu, Z. He, Z. Deng, and H. Zhang, “Cllms:
preprint arXiv:2402.14851, 2024. Consistency large language models,” arXiv preprint
[56] T. Zhang, T. Yu, T. B. Hashimoto, M. Lewis, W. tau Yih, arXiv:2403.00835, 2024.
D. Fried, and S. I. Wang, “Coder reviewer reranking [70] A. Zhuang, G. Zhang, T. Zheng, X. Du, J. Wang, W. Ren,
for code generation,” in International Conference on S. W. Huang, J. Fu, X. Yue, and W. Chen, “Structlm:
Machine Learning (ICML), 2023. Towards building generalist models for structured knowl-
[57] B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, edge grounding,” arXiv preprint arXiv:2402.16671,
Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, and Z. Li, “Mac- 2024.
sql: A multi-agent collaborative framework for text-to- [71] M. Pourreza and D. Rafiei, “Dts-sql: Decomposed text-
sql,” arXiv preprint arXiv:2312.11242, 2024. to-sql with small large language models,” arXiv preprint
[58] Y. Xie, X. Jin, T. Xie, M. Lin, L. Chen, C. Yu, L. Cheng, arXiv:2402.01117, 2024.
C. Zhuo, B. Hu, and Z. Li, “Decomposition for enhanc- [72] D. Xu, W. Chen, W. Peng, C. Zhang, T. Xu, X. Zhao,
ing attention: Improving llm-based text-to-sql through X. Wu, Y. Zheng, and E. Chen, “Large language models
workflow paradigm,” arXiv preprint arXiv:2402.10671, for generative information extraction: A survey,” arXiv
2024. preprint arXiv:2312.17617, 2023.
[59] Y. Fan, Z. He, T. Ren, C. Huang, Y. Jing, K. Zhang, and [73] G. Katsogiannis-Meimarakis and G. Koutrika, “A survey
X. S. Wang, “Metasql: A generate-then-rank framework on deep learning approaches for text-to-sql,” The VLDB
for natural language to sql translation,” arXiv preprint Journal, 2023.
arXiv:2402.17144, 2024. [74] P. Ma and S. Wang, “Mt-teql: evaluating and augment-
[60] Z. Li, X. Wang, J. Zhao, S. Yang, G. Du, X. Hu, ing neural nlidb on real-world linguistic and schema
B. Zhang, Y. Ye, Z. Li, R. Zhao, and H. Mao, “Pet-sql: A variations,” in International Conference on Very Large
prompt-enhanced two-stage text-to-sql framework with Data Bases (VLDB), 2021.
cross-consistency,” arXiv preprint arXiv:2403.09732, [75] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang,
2024. “SQuAD: 100,000+ questions for machine comprehen-
[61] T. Ren, Y. Fan, Z. He, R. Huang, J. Dai, C. Huang, sion of text,” in Empirical Methods in Natural Language
Y. Jing, K. Zhang, Y. Yang, and X. S. Wang, “Purple: Processing (EMNLP), 2016.
16
[76] P. Rajpurkar, R. Jia, and P. Liang, “Know what you arXiv:2109.01652, 2021.
don’t know: Unanswerable questions for SQuAD,” in [90] Y. Zheng, R. Zhang, J. Zhang, Y. Ye, and Z. Luo,
Association for Computational Linguistics (ACL), 2018. “Llamafactory: Unified efficient fine-tuning of 100+
[77] H. Yang, Y. Zhang, J. Xu, H. Lu, P.-A. Heng, and language models,” arXiv preprint arXiv:2403.13372,
W. Lam, “Unveiling the generalization power of fine- 2024.
tuned large language models,” in North American Chap- [91] T. Wang, H. Lin, X. Han, L. Sun, X. Chen, H. Wang, and
ter of the Association for Computational Linguistics: Z. Zeng, “Dbcopilot: Scaling natural language querying
Human Language Technologies (NAACL-HLT), 2024. to massive databases,” arXiv preprint arXiv:2312.03463,
[78] S. Hochreiter and J. Schmidhuber, “Long short-term 2023.
memory,” Neural Computation, 1997. [92] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
[79] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J.-G. Lou, T. Liu, Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,
and D. Zhang, “Towards complex text-to-sql in cross- F. Azhar et al., “Llama: Open and efficient foundation
domain database with intermediate representation,” arXiv language models,” arXiv preprint arXiv:2302.13971,
preprint arXiv:1905.08205, 2019. 2023.
[80] X. Xu, C. Liu, and D. Song, “Sqlnet: Generating [93] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi,
structured queries from natural language without re- Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhos-
inforcement learning,” arXiv preprint arXiv:1711.04436, ale et al., “Llama 2: Open foundation and fine-tuned
2017. chat models,” arXiv preprint arXiv:2307.09288, 2023.
[81] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, [94] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan,
O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,”
“Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:2309.16609, 2023.
arXiv preprint arXiv:1907.11692, 2019. [95] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
[82] L. Dou, Y. Gao, X. Liu, M. Pan, D. Wang, W. Che, I. Sutskever et al., “Language models are unsupervised
D. Zhan, M.-Y. Kan, and J.-G. Lou, “Towards knowledge- multitask learners,” OpenAI blog, 2019.
intensive text-to-SQL semantic parsing with formulaic [96] L. Reynolds and K. McDonell, “Prompt programming for
knowledge,” in Empirical Methods in Natural Language large language models: Beyond the few-shot paradigm,”
Processing (EMNLP), 2022. in Conference on Human Factors in Computing Systems
[83] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever (CHI), 2021.
et al., “Improving language understanding by generative [97] X. Ye and G. Durrett, “The unreliability of explana-
pre-training,” OpenAI blog, 2018. tions in few-shot prompting for textual reasoning,” in
[84] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, Advances in Neural Information Processing Systems
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, (NeurIPS), 2022.
A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, [98] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, the limits of transfer learning with a unified text-to-text
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, transformer,” The Journal of Machine Learning Research
A. Radford, I. Sutskever, and D. Amodei, “Language (JMLR), 2020.
models are few-shot learners,” in Advances in Neural [99] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O.
Information Processing Systems (NeurIPS), 2020. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph,
[85] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, G. Brockman et al., “Evaluating large language models
F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, trained on code,” arXiv preprint arXiv:2107.03374, 2021.
S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint [100] P. P. Ray, “Chatgpt: A comprehensive review on
arXiv:2303.08774, 2023. background, applications, key challenges, bias, ethics,
[86] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and limitations and future scope,” Internet of Things and
A. Chadha, “A systematic survey of prompt engineering Cyber-Physical Systems, 2023.
in large language models: Techniques and applications,” [101] J. Zamfirescu-Pereira, R. Y. Wong, B. Hartmann, and
arXiv preprint arXiv:2402.07927, 2024. Q. Yang, “Why johnny can’t prompt: how non-ai experts
[87] J. Wang, E. Shi, S. Yu, Z. Wu, C. Ma, H. Dai, Q. Yang, try (and fail) to design llm prompts,” in Conference on
Y. Kang, J. Wu, H. Hu et al., “Prompt engineering Human Factors in Computing Systems (CHI), 2023.
for healthcare: Methodologies and applications,” arXiv [102] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia,
preprint arXiv:2304.14670, 2023. E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought
[88] B. Chen, Z. Zhang, N. Langrené, and S. Zhu, “Unleash- prompting elicits reasoning in large language models,”
ing the potential of prompt engineering in large lan- in Advances in Neural Information Processing Systems
guage models: a comprehensive review,” arXiv preprint (NeurIPS), 2022.
arXiv:2310.14735, 2023. [103] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang,
[89] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, D. Schuurmans, C. Cui, O. Bousquet, Q. Le et al., “Least-
B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned to-most prompting enables complex reasoning in large
language models are zero-shot learners,” arXiv preprint language models,” arXiv preprint arXiv:2205.10625,
17
and D. Bau, “Mass-editing memory in a transformer,” in [144] Z. Yuan, D. Liu, W. Pan, and Z. Ming, “Sql-rank++: A
International Conference on Learning Representations novel listwise approach for collaborative ranking with
(ICLR), 2023. implicit feedback,” in International Joint Conference on
[131] K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Lo- Neural Networks (IJCNN), 2022.
cating and editing factual associations in gpt,” Advances [145] H. Chen, Y. Bei, Q. Shen, Y. Xu, S. Zhou, W. Huang,
in Neural Information Processing Systems (NeurIPS), F. Huang, S. Wang, and X. Huang, “Macro graph neural
2022. networks for online billion-scale recommender systems,”
[132] C. Zheng, L. Li, Q. Dong, Y. Fan, Z. Wu, J. Xu, and in International World Wide Web Conference (WWW),
B. Chang, “Can we edit factual knowledge by in-context 2024.
learning?” in Empirical Methods in Natural Language
Processing (EMNLP), 2023.
[133] H. Luo, Z. Tang, S. Peng, Y. Guo, W. Zhang, C. Ma,
G. Dong, M. Song, W. Lin et al., “Chatkbqa: A generate-
then-retrieve framework for knowledge base question
answering with fine-tuned large language models,” arXiv
preprint arXiv:2310.08975, 2023.
[134] G. Xiong, J. Bao, and W. Zhao, “Interactive-kbqa:
Multi-turn interactions for knowledge base question
answering with large language models,” arXiv preprint
arXiv:2402.15131, 2024.
[135] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep-
ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey,
Z. Chen et al., “Palm 2 technical report,” arXiv preprint
arXiv:2305.10403, 2023.
[136] Z. Hong and J. Liu, “Towards better question gen-
eration in qa-based event extraction,” arXiv preprint
arXiv:2405.10517, 2024.
[137] Y. Liu, H. He, T. Han, X. Zhang, M. Liu, J. Tian,
Y. Zhang, J. Wang, X. Gao, T. Zhong et al., “Under-
standing llms: A comprehensive overview from training
to inference,” arXiv preprint arXiv:2401.02038, 2024.
[138] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding,
Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma,
Y. Xue, J. Zhai, W. Chen, Z. Liu, P. Zhang, Y. Dong,
and J. Tang, “GLM-130b: An open bilingual pre-
trained model,” in International Conference on Learning
Representations (ICLR), 2023.
[139] Q. Zhang, J. Dong, Q. Tan, and X. Huang, “Integrating
entity attributes for error-aware knowledge graph em-
bedding,” IEEE Transactions on Knowledge and Data
Engineering (TKDE), 2024.
[140] Q. Zhang, J. Dong, H. Chen, X. Huang, D. Zha,
and Z. Yu, “Knowgpt: Black-box knowledge in-
jection for large language models,” arXiv preprint
arXiv:2312.06185, 2023.
[141] F. Huang, Z. Yang, J. Jiang, Y. Bei, Y. Zhang, and
H. Chen, “Large language model interaction simulator
for cold-start item recommendation,” arXiv preprint
arXiv:2402.09176, 2024.
[142] Y. Bei, H. Xu, S. Zhou, H. Chi, M. Zhang, Z. Li,
and J. Bu, “Cpdg: A contrastive pre-training method
for dynamic graph neural networks,” arXiv preprint
arXiv:2307.02813, 2023.
[143] Y. Bei, H. Chen, S. Chen, X. Huang, S. Zhou, and
F. Huang, “Non-recursive cluster-scale graph interacted
model for click-through rate prediction,” in International
Conference on Information and Knowledge Management
(CIKM), 2023.