0% found this document useful (0 votes)
17 views

In Context Reinforcement Learning Based Retrieval Augmented Generation for Text to SQL

The document presents an innovative in-context reinforcement learning (ICRL) framework for improving Text-to-SQL systems by enhancing the generation of complex SQL queries from natural language questions. It addresses the challenges of traditional methods that struggle with the evolving nature of large databases and the need for schema awareness, demonstrating significant performance improvements in schema retrieval and SQL generation accuracy. The proposed approach outperforms existing state-of-the-art models by effectively retrieving relevant schemas and generating diverse, intricate queries through iterative feedback mechanisms.

Uploaded by

thisisforrest6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

In Context Reinforcement Learning Based Retrieval Augmented Generation for Text to SQL

The document presents an innovative in-context reinforcement learning (ICRL) framework for improving Text-to-SQL systems by enhancing the generation of complex SQL queries from natural language questions. It addresses the challenges of traditional methods that struggle with the evolving nature of large databases and the need for schema awareness, demonstrating significant performance improvements in schema retrieval and SQL generation accuracy. The proposed approach outperforms existing state-of-the-art models by effectively retrieving relevant schemas and generating diverse, intricate queries through iterative feedback mechanisms.

Uploaded by

thisisforrest6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

In-Context Reinforcement Learning based Retrieval-Augmented

Generation for Text-to-SQL

Rishit Toteja Arindam Sarkar Prakash Mandayam Comar


Amazon Amazon Amazon
[email protected] [email protected] [email protected]

Abstract cal expertise, underscoring the need for innovative


Text-to-SQL solutions to bridge the gap between
Text-to-SQL simplifies database interactions
by enabling non-experts to convert their natural natural language and data management. Recently
language (NL) questions to Structured Query Large Language Models (LLMs) finetuned for SQL
Language (SQL) queries. With advancements generation have shown state-of-the-art results on
in Large Language Models (LLM), in-context the representative Text-to-SQL benchmarks 1,2 . In
learning (ICL) has emerged as a popular choice typical real world systems, databases are constantly
for building Text-to-SQL systems. Real world, evolving, and to accommodate new fields or rela-
industry-scale databases, often comprise thou- tionships, the LLMs need to be continually fine-
sands of tables and hundreds of columns, and
tuned to maintain the quality of generation. How-
makes passing the entire schema as context to
an LLM infeasibly expensive. This requisites ever, the most performant LLMs have parameters
access to the correct database and the set of in billion-scale (Zhao et al., 2023), and finetun-
tables. Recently Retrieval Augmented Genera- ing these models is expensive and require tech-
tion (RAG) based methods have been proposed nical expertise. In-context and few-shot learning
for retrieving relevant subset of databases and has emerged a popular alternative, and has been
tables for a given query. However, we observe shown to be extremely effective on the Text-to-
that the existing methods of synthetic query
SQL tasks by works like DIN-SQL (Pourreza and
generation can generate predominantly simple
queries which might not be sufficiently repre- Rafiei, 2023). For syntactically correct SQL gener-
sentative of complex, real world queries, thus, ation, the LLM needs to be schema aware. Industry
negatively affecting the quality of the generated scale databases often consisting of thousands of
SQL. To address this, we propose an innova- tables, and hundreds of columns, passing the entire
tive in-context reinforcement learning (ICRL) schema as context to LLM is prohibitively expen-
based framework which refines the question sive. This requisites a framework which can fetch
generation process by enhancing the model’s relevant schemas for correct SQL generation. In
ability to produce intricate queries that practi-
contrast to using RAG for NLP tasks, where rele-
tioners may pose during inference. In contrast
to the existing approaches, our framework en- vant knowledge retrieval can be done via similarity
sures the generation of synthetic SQL queries search in embedding space, as the table schemas
which are diverse and complex. We demon- might not be syntactically relevant for a natural
strate the effectiveness of our approach via language query. Towards this, schema routing was
multiple experiments comparing against the proposed in DBCopilot (Wang et al., 2024b) for
representative state-of-the-art models on pub- effective synthetic data generation. It leverages the
lic benchmark datasets and observe substantial
foreign key linkage between tables to perform ran-
improvements in performance and scalability.
Our method achieves 15-20% higher recall in dom walks and generating corresponding synthetic
database/table retrieval task compared to the ex- natural language queries to aid in table retrieval for
isting state-of-the-art models for schema identi- a query. This approach was shown to have state-of-
fication and upto 2% higher execution accuracy the-art performance in database/table retrieval.
for SQL generation. However, simply generating synthetic queries
1 Introduction based on table relationships is not guaranteed to

The complexity of formulating effective database 1


https://yale-lily.github.io/spider/
2
queries demands significant manpower and techni- https://bird-bench.github.io/
be representative of human generated queries, and ing based techniques for Text-to-SQL tasks given
consequently might under-represent the complex the schema and relevant examples in context (Guo
queries involving diverse operators (Figure A.2). et al., 2023; Wang et al., 2024c). LLMs perfor-
Simple prompting based approach might gener- mance depends a lot on the demonstrations chosen
ate questions which require trivial SQL operators, for in-context learning as shown in FUSED (Wang
for e.g., "What is the most expensive book based et al., 2024a).
on purchase price?". Consequently for complex RAG for Large Databases RAG enhances the
queries this is susceptible to fetching irrelevant performance of LLMs (Lewis et al., 2020) on
schemas (examples in Table 5). In this work, we knowledge-intensive NLP tasks like Text-to-SQL
propose a novel in-context Reinforcement Learning by combining the strengths of pre-trained mod-
based framework to iteratively improve the quality els with knowledge contained in specialized data
of generated synthetic queries from a base LLM stores, (e.g., database/table metadata). The hybrid
by employing a Feedback LLM which generates approach helps to decrease the amount of context
instructions to modify the base generation to max- given to LLMs. With the arrival of large context
imize a reward function which encourages gen- length LLMs like Gemini 1.5 4 and Claude 5 , it is
eration of complex queries. The proposed ICRL possible to feed entire database schemas directly as
approach refines the preceding example to addition- context. However, we show that in many cases this
ally generate synthetic NL queries like "What is approach fails to identify the correct set of tables
the most expensive book based on purchase price relevant for solving a user question. Thus for mas-
for books written by authors whose last name starts sive databases containing thousands of tables, there
with ‘S’, and what are the author and title of that is a need for intelligent retrieval for determining a
book?" (refer to A.4 for more examples). This high recall subset of the database/tables to enhance
augmentation results in significant gains over the the SQL generation (Kothyari et al., 2023).
representative models for schema retrieval task, out- LLMs with Iterative Feedback LLMs exhibit
performing both finetuned and ICL based models. a remarkable capability for improving from feed-
When the proposed RAG mechanism is utilized for back (Kwon et al., 2023; Wang and Li, 2023).
few-shot SQL generation, it outperforms the state- (Madaan et al., 2023) proposes a method for re-
of-the-art ICL based models on SQL generation as fining model outputs through an iterative process
well, further concretizing the importance of correct of self-feedback. (Du et al., 2024) introduces a
schema retrieval for correct SQL generation. novel method to improve the response generation
of LLMs by incorporating multiple rounds of de-
2 Background bate between different agents. Here we propose
an in-context reward guided refinement of the base
Early work in Text-to-SQL models the problem as
LLM, which iteratively improves the model output.
a sequence-to-sequence task and proposed encoder-
decoder architectures (Yu et al., 2018a). (Qi et al.,
2022) introduce a novel architecture, by modifying 3 Methodology
the attention layer of encoder of T5 and including
relation embeddings into the key and value entries. Given a NL query, we first identify the most rele-
In contrast to training shallow Seq2Seq models, vant schemas from diverse databases. To limit the
recently LLMs like GPT-4 (OpenAI) have demon- schema search space, we reduce the scope from
strated to be effective in both zero-shot and few- a large array of databases D to a smaller super-
shot scenarios as shown in DAIL-SQL (Gao et al., set S ⊆ D. Our goal is to identify S so it re-
2024). Further performance improvement is ob- tains high recall of the databases and tables in the
served with supervised finetuning, which enhances ground truth SQL query. For efficient retrieval, we
LLMs using additional task-specific training data construct a graph to represent all databases (Wang
to make it more suitable for domain-specific SQL et al., 2024b), with each traversal corresponding
generation by finetuning LLMs like CodeLlama to a subset of schemas, and use the traversals to
34B and 7B released by Defog 3 achieve highest generate synthetic data to be stored in a knowledge
performance. Owing to the cost implication of fine- 4
https://blog.google/technology/ai/
tuning LLMs, there is increased interest in prompt- google-gemini-next-generation-model-february-2024/
5
https://www.anthropic.com/news/
3
https://huggingface.co/defog/ claude-3-family
A Graph Construction B In-Context RL Framework
SERIALIZATION

Schema: [bike_racing, bike] Augmenting LLM Context


LLM

Question
How many bikes cost more than $1000?

CodeLlama
Feedback
SQL Consider adding conditional logic

SELECT COUNT(*)
 and aggregation to the query.

FROM bikes

WHERE price > 1000;


For example, you could include

REWARD FEEDBACK additional conditions on the bike

Database Schema Graph LLM material or weight.

D Inference and Retrieval C Building Knowledge Base


RL Knowledge Base

Syn Q:

What classes taught by Teacher X have Question


Q.
Question
students who achieved Achievement Type Y?
How Question
many bikes cost more than $1500, and

Count the number of Traversal Schema:


what is the average weight of those bikes?
teachers who have taught ['Achievements', 'Classes',

students who have never 'Teachers']

won an achievement.

....

....

Non RL Knowledge Base

Schema:

Syn Q:

['Achievements', 'Classes', For a given student, what are the details of


'Teachers'] their loans and achievements?

Traversal Schema:

['Achievements', 'Student_Loans’]
KNOWLEDGE

BASE
....

....

Figure 1: Overview of the proposed In-Context RL based RAG architecture for schema retrieval.

base (KB) for the RAG mechanism. eters (θt ), comprising of schema information and
Schema Graph Construction The database/table previously generated questions. The action (at ) is
schema graph G is constructed by initializing a root the generated synthetic question (qst ) by the LLM.
node R, with type-1 edges to each database, type- For a pretrained LLM with frozen θt , the policy
2 edges to tables, and type-3 edges representing determining the action π(at |st , ..) is the probabil-
foreign key relationships. We generate all possi- ity of generating a sequence of tokens given the
ble traversals on G via fixed length random-walks context.
where each traversal represents a unique path from Reward Function To encourage the creation of
R through G, as detailed in Algorithms A.1, A.2. synthetic questions with the desired complexity,
Synthetic Question Generation In order to iden- we use a reward function based on keywords (kj )
tify S from the entire Database represented in from the intermediate SQL query generated by a
Graph G, we begin by selecting a particular traver- LLM when given the synthetic question as input.
sal from the graph. This step, referred to as se- Specifically, we define four keyword buckets: B1
rialization, involves mapping out a specific path (data retrieval and filtering), B2 (data modifica-
through the graph, starting from the root node and tion), B3 (conditional logic), and B4 (aggregation
following the edges through various databases and and sorting). The complexity score of each bucket
tables. Once a traversal is serialized, we carefully is calculated as:
prompt a LLM to generate the corresponding natu- P
ral language question(s) and SQL solution(s), and kj ∈St (kj ∈ Bi )
c(Bi ) = P P (1)
store the triplet in a KB keyed by the NL question. Bi kj ∈St (kj ∈ Bi )

Details of the carefully curated keyword scores


3.1 In-Context RL Framework are provided in A.3. These are selected to closely
To ascertain that the generated synthetic questions mimic the SQL operators used by human practi-
are relevant as well as sufficiently complex to re- tioners for complex NL queries. For instance, a
flect the nature of human queries, we propose an in- query with multiple JOIN/GROUP BY statements
context reinforcement learning (ICRL) framework is likely to be more complex compared to a query
to iteratively improve the LLM’s generation of syn- with only SELECT/AND/OR operators. The re-
thetic questions. For each traversal, we formulate sulting reward function R(St ) is based on bucket
the interactive process as a Markov Decision Pro- frequencies and their weights:
cess (MDP). The state (st ) at time t includes the X
R(St ) = f (Bi , St ) · c(Bi ) (2)
current context provided to the LLM and its param- i
Table 1: Table retrieval recall (%) for the Spider and Bird Dev datasets.

Spider-Dev Bird-Dev
Model
R@1 R@2 R@5 R@10 R@1 R@2 R@5 R@10
DBCopilot 49.94 81.01 85.34 85.34 34.30 56.73 61.02 61.02
RAG (BM25) 42.71 54.07 66.83 77.31 26.92 35.33 46.61 54.82
RAG (embedding) 64.92 76.52 89.24 93.99 43.54 59.32 79.79 90.61
RAG (emb.) + SXFMR 64.74 79.36 90.35 94.36 43.87 51.10 61.73 70.20
RAG (emb.) + ICRL (iter=1) 71.12 81.50 91.61 95.90 50.19 65.18 82.07 92.24
RAG (emb.) + ICRL (iter=2) 71.44 81.64 91.66 95.94 51.63 66.03 82.40 92.24

Table 2: LLM Aided RAG (Prompting LLM to per- Table 3: EX (Execution Accuracy) and I/P (input) tokens
form aggregation on top-k retrieved schemas). on the Spider-Dev dataset.
Top-K Spider-Dev (R@1) Bird-Dev (R@1) Approach Approach I/P Tokens EX (%)
(To be merged) DIN-SQL 9916.56 74.2
RAG RAG + ICRL RAG RAG + ICRL DBCopilot 199.24 74.4
5 86.21 89.10 73.01 75.22 DB schemas present FUSED 778.25 74.9
10 88.91 91.10 79.07 80.89 DAIL-SQL 922.21 75.5
15 89.98 91.80 80.70 82.98 Ours (0-shot) 249.24 75.9
20 90.59 92.50 80.96 82.26 Ours (1-shot) 387.26 76.6
All DBs provided DBCopilot 93.62 64.12
in-context (Claude) 61.62 80.76 DB schemas inferred Ours (0-shot) 196.39 65.2
Ours (1-shot) 318.93 69.6

where f (Bi , St ) is the frequency of bucket Bi retrieved SQL query. These along with the chosen
within the SQL query St . schema, are used to produce the final SQL query.
As shown in (Figure 1), initially, the base LLM
generates a synthetic question qs0 relevant to the 4 Experiments and Results
schema, with a corresponding SQL query S0 . The We use Claude (Sonnet) for synthetic question
reward R(S0 ) is calculated based on its complex- generation and schema retrieval. Cohere-Embed-
ity and keyword distribution. Subsequently, the English-v3 6 is used for generating embeddings
Feedback-LLM receives in-context examples, the for indexing in knowledge-base. Additionally, we
synthetic question qst , and the reward signal R(St ). compare the embedding based retrieval with BM25,
This Feedback-LLM generates textual feedback to a standard ranking function in search engines for
guide the base LLM in modifying qst . Once in- document relevance, and SXFMR (Reimers and
corporated in the context, this feedback effectively Gurevych, 2019) which applies contrastive learn-
modifies the generation policy, and as we show in ing to Transformer-based models for generic em-
our experiments, improves the base LLM’s gen- bedding retrieval on related text pairs. For SQL
eration for the next iteration. At each iteration t, generation, we use GPT-3.5 Turbo model via the
the state st+1 updates with the initial context and official API (OpenAI).
additional feedback. The framework iteratively re- Evaluation Metrics We employ standard evalua-
fines the synthetic question qs based on the reward tion metrics used for text-to-SQL task. For schema
signal to get the final qsfinal , which is indexed in the retrieval, we compute table recall, measuring the
knowledge base. percentage of top-k schemas retrieved that match
LLM Aided Schema Pooling We further im- the gold schema. We consider Execution Accu-
prove the recall of schema retrieval by leveraging racy (EX) for evaluation of generated SQL query.
the reasoning capabilities of LLM, wherein given All experiments were performed on Spider-Dev
the user query, first the top-k schemas are fetched (Yu et al., 2018b) and Bird-Dev (Li et al., 2023)
from the KB, and then given this context, a LLM is datasets (details in A.1).
prompted to select the most relevant schemas from Schema Retrieval Recall The feedback-based
the candidates. model variants achieve superior recall compared to
the representative baselines, including DB-Copilot
3.2 SQL Generation which is a finetuned model. Specifically, on Spider-
Once schema is selected, we prompt a LLM to Dev, R@1 improved by 21.5%, and on Bird-Dev,
generate the SQL query. The LLM is provided R@1 increased by 17.3%, demonstrating the ef-
with examples from KB in form (q, D, S) where fectiveness of ICRL approach. It can be observed
6
q is user question, D is database schema, and S is https://cohere.com/blog/introducing-embed-v3
from Table 2, directly providing all databases as References
context to the LLM does not always yield optimal BM25. Okapi bm25 — Wikipedia, the free encyclope-
performance, and might be attributed to distraction dia.
in presence of irrelevant tokens (Shi et al., 2023).
Yilun Du, Shuang Li, Antonio Torralba, Joshua B.
Execution Accuracy (EX): When the gold Tenenbaum, and Igor Mordatch. 2024. Improving
database schema is provided as context, our ap- factuality and reasoning in language models through
proach achieves high execution accuracy (EX), as multiagent debate.
shown in Table 3. Both zero-shot and 1-shot vari- Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun,
ants of our model achieve around ∼2% higher Yichen Qian, Bolin Ding, and Jingren Zhou. 2024.
EX, surpassing all the baselines. Compared to our Text-to-sql empowered by large language models:
model, DIN-SQL and DAIL-SQL, current state- A benchmark evaluation. Proc. VLDB Endow.,
17(5):1132–1145.
of-the-art models in this setting, use significantly
more tokens. In contrast, we achieve substantial Chunxi Guo, Zhiliang Tian, Jintao Tang, Pancheng
Wang, Zhihua Wen, Kang Yang, and Ting Wang.
cost reductions, requiring 25.6x and 2.38x fewer
2023. Prompting gpt-3.5 for text-to-sql with de-
tokens than DIN-SQL and DAIL-SQL, respec- semanticization and skeleton retrieval. In PRICAI
tively. When the gold schema context is absent, 2023: Trends in Artificial Intelligence: 20th Pacific
our schema retrieval method outperforms DBCopi- Rim International Conference on Artificial Intelli-
lot, achieving 69.6% EX in the 1-shot scenario. gence, PRICAI 2023, Jakarta, Indonesia, November
15–19, 2023, Proceedings, Part II, page 262–274,
Berlin, Heidelberg. Springer-Verlag.
5 Conclusion Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi,
and Soumen Chakrabarti. 2023. CRUSH4SQL:
While LLMs are trained on vast amounts of pub- Collective retrieval using schema hallucination for
Text2SQL. In Proceedings of the 2023 Conference
lic data, they are unable to readily handle domain on Empirical Methods in Natural Language Process-
specific/confidential industry scale databases. The ing, pages 14054–14066, Singapore. Association for
impracticality and cost of finetuning LLMs with Computational Linguistics.
dynamic databases underscores the importance of Minae Kwon, Sang Michael Xie, Kalesha Bullard, and
efficient schema retrieval methods as an important Dorsa Sadigh. 2023. Reward design with language
step in Text-to-SQL applications. In this work, we models. In The Eleventh International Conference
propose a novel in-context reinforcement learning on Learning Representations.
based RAG framework for efficient schema and Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
in-context example retrieval for Text-to-SQL tasks. Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
Our approach requires no specialized finetuning, rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020.
and is based on composable prompting based mod-
Retrieval-augmented generation for knowledge-
ules, and outperforms representative state-of-the- intensive nlp tasks. In Advances in Neural Infor-
art baselines for both schema retrieval and SQL mation Processing Systems, volume 33, pages 9459–
generation tasks. While we benchmark the pre- 9474. Curran Associates, Inc.
sented approach on the Text-to-SQL, the approach Jinyang Li, Binyuan Hui, GE QU, Jiaxi Yang, Binhua
is generalizable to other problems requiring itera- Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying
tive refinement on top of LLMs as well. Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guo-
liang Li, Kevin Chang, Fei Huang, Reynold Cheng,
and Yongbin Li. 2023. Can LLM already serve as
6 Limitations a database interface? a BIg bench for large-scale
database grounded text-to-SQLs. In Thirty-seventh
Conference on Neural Information Processing Sys-
Since we are not using fine-tuned LLMs for SQL tems Datasets and Benchmarks Track.
generation, they may still lack information or un-
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
derstanding about the specific databases in con- Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
text. Apart from this, while powerful, the proposed Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
model may not inherently understand or provide Shashank Gupta, Bodhisattwa Prasad Majumder,
meaningful interpretations of the database schemas Katherine Hermann, Sean Welleck, Amir Yazdan-
bakhsh, and Peter Clark. 2023. Self-refine: Itera-
they are working with, especially if those schemas tive refinement with self-feedback. In Advances in
do not have natural language descriptions. Neural Information Processing Systems, volume 36,
pages 46534–46594. Curran Associates, Inc.
OpenAI. Models overview - openai. https:// of the 2018 Conference of the North American Chap-
platform.openai.com/docs/models/overview/. ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 2 (Short Pa-
Mohammadreza Pourreza and Davood Rafiei. 2023. pers), pages 588–594, New Orleans, Louisiana. As-
Din-sql: Decomposed in-context learning of text- sociation for Computational Linguistics.
to-sql with self-correction. In Advances in Neural
Information Processing Systems, volume 36, pages
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
36339–36348. Curran Associates, Inc.
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir
Yu Cheng, Chenghu Zhou, Xinbing Wang, Quanshi Radev. 2018b. Spider: A large-scale human-labeled
Zhang, and Zhouhan Lin. 2022. RASAT: Integrating dataset for complex and cross-domain semantic pars-
relational structures into pretrained Seq2Seq model ing and text-to-SQL task. In Proceedings of the 2018
for text-to-SQL. In Proceedings of the 2022 Con- Conference on Empirical Methods in Natural Lan-
ference on Empirical Methods in Natural Language guage Processing, pages 3911–3921, Brussels, Bel-
Processing, pages 3215–3229, Abu Dhabi, United gium. Association for Computational Linguistics.
Arab Emirates. Association for Computational Lin-
guistics. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Be-
Nils Reimers and Iryna Gurevych. 2019. Sentence- ichen Zhang, Junjie Zhang, Zican Dong, Yifan Du,
BERT: Sentence embeddings using Siamese BERT- Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao
networks. In Proceedings of the 2019 Conference on Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang
Empirical Methods in Natural Language Processing Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.
and the 9th International Joint Conference on Natu- 2023. A survey of large language models. Preprint,
ral Language Processing (EMNLP-IJCNLP), pages arXiv:2303.18223.
3982–3992, Hong Kong, China. Association for Com-
putational Linguistics.
A Appendix
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
Scales, David Dohan, Ed H Chi, Nathanael Schärli, A.1 Dataset Description
and Denny Zhou. 2023. Large language models can
be easily distracted by irrelevant context. In Inter- We used two development sets for our experiments,
national Conference on Machine Learning, pages Spider and Bird. Spider and Bird are cross-domain
31210–31227. PMLR.
datasets in English widely used for benchmarking.
Danqing Wang and Lei Li. 2023. Learning from mis- Bird tries bridges the gap between text-to-SQL re-
takes via cooperative study assistant for large lan- search and real-world applications by dealing with
guage models. In Proceedings of the 2023 Confer-
ence on Empirical Methods in Natural Language Pro-
large and messy database values. The statistics be-
cessing, pages 10667–10685, Singapore. Association low include the total size and the distribution of
for Computational Linguistics. queries by difficulty levels.
Dingzirui Wang, Longxu Dou, Xuanliang Zhang,
Table 4: Statistics of Spider-Dev and Bird-Dev
Qingfu Zhu, and Wanxiang Che. 2024a. Improv-
ing demonstration diversity by human-free fusing for Easy Medium Hard
Dataset Extra
text-to-SQL. In Findings of the Association for Com- (Simple) (Moderate) (Challenging)
putational Linguistics: EMNLP 2024, pages 1193– Spider (2147) 470 857 463 357
Bird (1534) 925 465 144 -
1207, Miami, Florida, USA. Association for Compu-
tational Linguistics.
Tianshu Wang, Hongyu Lin, Xianpei Han, Le Sun, Xi- A.2 Construction of the Database Graph
aoyang Chen, Hao Wang, and Zhenyu Zeng. 2024b.
Dbcopilot: Scaling natural language querying to mas- The database graph G starts with a root node R.
sive databases. Preprint, arXiv:2312.03463. An edge of type-1 connects R to each database Di .
Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin From each database Di , edges of type-2 connect
Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Mi- to its constituent tables Tij . Additionally, edges
culicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, of type-3 represent relationships between tables
and Tomas Pfister. 2024c. Chain-of-table: Evolving
tables in the reasoning chain for table understanding. within the same database, specifically foreign key
In The Twelfth International Conference on Learning constraints. If a table Tij in database Di references
Representations. another table Tik , a type-3 edge connects Tij to
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir
Tik . (Algorithm A.1). Once the database graph is
Radev. 2018a. TypeSQL: Knowledge-based type- constructed, we proceed to generate traversals on
aware neural text-to-SQL generation. In Proceedings this graph following Algorithm A.2.
Ground Truth Simple Retrieval ICRL Augmented Retrieval

Question Gold Schema Synthetic Question Schema (R@1) Synthetic Question Schema (R@1)

What are the titles of books


What is the title of the book written by authors whose
[‘Book’, ‘Author_book’, Which books have the high- [‘Book’, ‘Author_book’,
written by George Orwell [‘Author_Book’, ‘Book’] books have a sale price be-
‘Author’] est sale price? ‘Author’]
that has the lowest sale price? tween $10 and $20, ordered
by the highest sale price?

Which items of a specific fla-


What are the distinct ids What are the details of cus- vor and food type were or-
of customers who bought [‘items’, ‘receipts’, ‘goods’] tomers with a specific cus- [‘Customers’] dered by customers with IDs [‘items’, ‘receipts’, ‘goods’]
lemon flavored cake? tomer ID? between 100 and 200 on a
given date range?

What is the most common


type of detention given, and
What is the least common de- for those detentions, retrieve
[‘Detention’, What is the description for a [‘Detention’,
tention type? Show the code [‘Ref_Detention_Type’] the detention summary and
‘Ref_Detention_Type’] given detention type code? ‘Ref_Detention_Type’]
and the description. other details where the sum-
mary contains the word ‘be-
havior’?

What is the most expensive


product of a certain color and
What are the prices and sizes Which food items have
size range, and how does its
of all products whose price is [‘Products’] a price above a certain [‘goods’] [‘Products’]
price compare to the average
above the mean? amount?
price of products in that cate-
gory?

Which car owners also spon-


Which make has more than What is the number of sor at least two teams, and
[‘team’] [‘team_driver’] [‘team’]
one team? drivers per team? retrieve the team names and
car makes for those teams?

Table 5: Comparison of retrieved queries and schemas across the different retrieval methods.

Non RL Retrieval RL Retrieval


A.3 Complexity Scores Spider (k=2) Spider (k=5) Spider (LLM Aided RAG) (k=1)
0.6 0.25
0.25
0.5
To encourage the LLM to generate diverse and com- 0.20
0.20
0.4
0.15
posite questions, we meticulously designed a re- 0.3
0.15
0.10 0.10
ward function based on keyword categories. These 0.2
%Incorrect Predictions

0.1 0.05 0.05


scores can be further adjusted to suit specific use 0.0 0.00 0.00
1 2 3 4 1 2 3 4 1 2 3 4
cases. Notably, these settings yielded the best re- Bird (k=2) Bird (k=5) Bird (LLM Aided RAG) (k=1)
0.6
0.8 0.6
sults in our experiments. 0.7
0.5
0.5
0.6 0.4
Data retrieval and filtering SELECT (1), FROM 0.5 0.4
0.3
0.4 0.3
(1), JOIN (2), INNER JOIN (3), LEFT JOIN (3), 0.3 0.2 0.2
0.2
0.1 0.1
RIGHT JOIN (3), 4 (score to be confirmed), ON (2), 0.1
0.0 0.0 0.0
1 2 3 4 1 2 3 4 1 2 3 4
WHERE (2), GROUP BY (3), HAVING (3), ORDER BY #No.of Tables in Schema
(2), DISTINCT (2), LIMIT (1).
Figure 2: Comparison of Incorrect Distributions (RL
Data modification INSERT (2), UPDATE (3),
and Non-RL Retrieval) on Spider and Bird Dev sets on
DELETE (4) top- k @2, @5
Conditional logic AND (1), OR (1), NOT (1), IN (2),
BETWEEN (2), LIKE (2), CASE (3), WHEN (2), THEN
(2), ELSE (2), END (1) methods in achieving lower error rates for complex
Aggregation and sorting AVG (3), SUM (3), queries, which are common across industries, with-
COUNT (3), MIN (3), MAX (3), ASC (1), DESC (1). out compromising accuracy on simpler queries.
Table A.5 highlights the effectiveness of ICRL aug-
A.4 Solving Complex User Questions using mented retrieval over simple retrieval in generating
ICRL synthetic questions and retrieving relevant schemas.
The ICRL approach retrieved schemas are more
We analyze the retrieval methods by their error aligned with the ground truth and the formulated
rates across query complexity levels, proxied by synthetic questions better capture the complexity
the number of tables in the ground truth query of the original queries. This demonstrates that rein-
schema. We compare three retrieval settings on forcement learning feedback significantly helps in
Spider and Bird Dev sets: top 2, top 5, and LLM- enhancing schema identification recall.
aided RAG (top 1), i.e., the number of retrieved
schemas from the knowledge base. Figure A.2 un-
derscores the effectiveness of RL-based retrieval
Algorithm 1 Construction of Database Schema Graph
Input: Metadata: List of databases and schemas
Output: Graph G representing the schema
1: Initialize root node R and graph G
2: G.add_node(R)
3: for each database D in databases do
4: G.add_node(D)
5: G.add_edge(R, D, 1) ▷ Type-1 edge
6: for each table T in D.tables do
7: G.add_node(T )
8: G.add_edge(D, T, 2) ▷ Type-2 edge
9: for each foreign key F K in T.foreign_keys do
10: G.add_edge(T, F K.related_table, 3) ▷ Type-3 edge
11: end for
12: end for
13: end for
14: return G

Algorithm 2 Serialization of Graph with Cutoff


Input: Graph G, starting node R, cutoff length k
Output: List of all traversals from R with cutoff length k
1: function GET _ ALL _ TRAVERSALS(G, R, k)
2: traversals ← []
3: function SERIALIZE(node, path, depth, visited)
4: path ← path + [node]
5: traversals ← traversals + [path]
6: visited[node] ← True
7: if depth < k then
8: if G.has_children(node) = True then
9: children ← G.get_children(node)
10: if length(children) > 1 and edge_type = 3 then
11: subsets_children ← power_set(children)
12: for each child c in children do
13: for each subset s in subsets_children do
14: if ∀n ∈ s, visited[n] = False then
15: SERIALIZE (c, path + s, depth + length(s) + 1, visited)
16: end if
17: end for
18: end for
19: else
20: for each child c in children do
21: if visited[c] = False then
22: SERIALIZE (c, path, depth + 1, visited)
23: end if
24: end for
25: end if
26: end if
27: end if
28: visited[node] ← False
29: path ← path − [node]
30: end function
31: SERIALIZE (R, [], 0, {False} for all nodes in G)
32: return traversals
33: end function
34: k ← specified cutoff length
35: samples ← GET _ ALL _ TRAVERSALS(G, R, k)

You might also like