In Context Reinforcement Learning Based Retrieval Augmented Generation for Text to SQL
In Context Reinforcement Learning Based Retrieval Augmented Generation for Text to SQL
Question
How many bikes cost more than $1000?
CodeLlama
Feedback
SQL Consider adding conditional logic
FROM bikes
WHERE price > 1000;
For example, you could include
Syn Q:
won an achievement.
....
....
Schema:
Syn Q:
Traversal Schema:
['Achievements', 'Student_Loans’]
KNOWLEDGE
BASE
....
....
Figure 1: Overview of the proposed In-Context RL based RAG architecture for schema retrieval.
base (KB) for the RAG mechanism. eters (θt ), comprising of schema information and
Schema Graph Construction The database/table previously generated questions. The action (at ) is
schema graph G is constructed by initializing a root the generated synthetic question (qst ) by the LLM.
node R, with type-1 edges to each database, type- For a pretrained LLM with frozen θt , the policy
2 edges to tables, and type-3 edges representing determining the action π(at |st , ..) is the probabil-
foreign key relationships. We generate all possi- ity of generating a sequence of tokens given the
ble traversals on G via fixed length random-walks context.
where each traversal represents a unique path from Reward Function To encourage the creation of
R through G, as detailed in Algorithms A.1, A.2. synthetic questions with the desired complexity,
Synthetic Question Generation In order to iden- we use a reward function based on keywords (kj )
tify S from the entire Database represented in from the intermediate SQL query generated by a
Graph G, we begin by selecting a particular traver- LLM when given the synthetic question as input.
sal from the graph. This step, referred to as se- Specifically, we define four keyword buckets: B1
rialization, involves mapping out a specific path (data retrieval and filtering), B2 (data modifica-
through the graph, starting from the root node and tion), B3 (conditional logic), and B4 (aggregation
following the edges through various databases and and sorting). The complexity score of each bucket
tables. Once a traversal is serialized, we carefully is calculated as:
prompt a LLM to generate the corresponding natu- P
ral language question(s) and SQL solution(s), and kj ∈St (kj ∈ Bi )
c(Bi ) = P P (1)
store the triplet in a KB keyed by the NL question. Bi kj ∈St (kj ∈ Bi )
Spider-Dev Bird-Dev
Model
R@1 R@2 R@5 R@10 R@1 R@2 R@5 R@10
DBCopilot 49.94 81.01 85.34 85.34 34.30 56.73 61.02 61.02
RAG (BM25) 42.71 54.07 66.83 77.31 26.92 35.33 46.61 54.82
RAG (embedding) 64.92 76.52 89.24 93.99 43.54 59.32 79.79 90.61
RAG (emb.) + SXFMR 64.74 79.36 90.35 94.36 43.87 51.10 61.73 70.20
RAG (emb.) + ICRL (iter=1) 71.12 81.50 91.61 95.90 50.19 65.18 82.07 92.24
RAG (emb.) + ICRL (iter=2) 71.44 81.64 91.66 95.94 51.63 66.03 82.40 92.24
Table 2: LLM Aided RAG (Prompting LLM to per- Table 3: EX (Execution Accuracy) and I/P (input) tokens
form aggregation on top-k retrieved schemas). on the Spider-Dev dataset.
Top-K Spider-Dev (R@1) Bird-Dev (R@1) Approach Approach I/P Tokens EX (%)
(To be merged) DIN-SQL 9916.56 74.2
RAG RAG + ICRL RAG RAG + ICRL DBCopilot 199.24 74.4
5 86.21 89.10 73.01 75.22 DB schemas present FUSED 778.25 74.9
10 88.91 91.10 79.07 80.89 DAIL-SQL 922.21 75.5
15 89.98 91.80 80.70 82.98 Ours (0-shot) 249.24 75.9
20 90.59 92.50 80.96 82.26 Ours (1-shot) 387.26 76.6
All DBs provided DBCopilot 93.62 64.12
in-context (Claude) 61.62 80.76 DB schemas inferred Ours (0-shot) 196.39 65.2
Ours (1-shot) 318.93 69.6
where f (Bi , St ) is the frequency of bucket Bi retrieved SQL query. These along with the chosen
within the SQL query St . schema, are used to produce the final SQL query.
As shown in (Figure 1), initially, the base LLM
generates a synthetic question qs0 relevant to the 4 Experiments and Results
schema, with a corresponding SQL query S0 . The We use Claude (Sonnet) for synthetic question
reward R(S0 ) is calculated based on its complex- generation and schema retrieval. Cohere-Embed-
ity and keyword distribution. Subsequently, the English-v3 6 is used for generating embeddings
Feedback-LLM receives in-context examples, the for indexing in knowledge-base. Additionally, we
synthetic question qst , and the reward signal R(St ). compare the embedding based retrieval with BM25,
This Feedback-LLM generates textual feedback to a standard ranking function in search engines for
guide the base LLM in modifying qst . Once in- document relevance, and SXFMR (Reimers and
corporated in the context, this feedback effectively Gurevych, 2019) which applies contrastive learn-
modifies the generation policy, and as we show in ing to Transformer-based models for generic em-
our experiments, improves the base LLM’s gen- bedding retrieval on related text pairs. For SQL
eration for the next iteration. At each iteration t, generation, we use GPT-3.5 Turbo model via the
the state st+1 updates with the initial context and official API (OpenAI).
additional feedback. The framework iteratively re- Evaluation Metrics We employ standard evalua-
fines the synthetic question qs based on the reward tion metrics used for text-to-SQL task. For schema
signal to get the final qsfinal , which is indexed in the retrieval, we compute table recall, measuring the
knowledge base. percentage of top-k schemas retrieved that match
LLM Aided Schema Pooling We further im- the gold schema. We consider Execution Accu-
prove the recall of schema retrieval by leveraging racy (EX) for evaluation of generated SQL query.
the reasoning capabilities of LLM, wherein given All experiments were performed on Spider-Dev
the user query, first the top-k schemas are fetched (Yu et al., 2018b) and Bird-Dev (Li et al., 2023)
from the KB, and then given this context, a LLM is datasets (details in A.1).
prompted to select the most relevant schemas from Schema Retrieval Recall The feedback-based
the candidates. model variants achieve superior recall compared to
the representative baselines, including DB-Copilot
3.2 SQL Generation which is a finetuned model. Specifically, on Spider-
Once schema is selected, we prompt a LLM to Dev, R@1 improved by 21.5%, and on Bird-Dev,
generate the SQL query. The LLM is provided R@1 increased by 17.3%, demonstrating the ef-
with examples from KB in form (q, D, S) where fectiveness of ICRL approach. It can be observed
6
q is user question, D is database schema, and S is https://cohere.com/blog/introducing-embed-v3
from Table 2, directly providing all databases as References
context to the LLM does not always yield optimal BM25. Okapi bm25 — Wikipedia, the free encyclope-
performance, and might be attributed to distraction dia.
in presence of irrelevant tokens (Shi et al., 2023).
Yilun Du, Shuang Li, Antonio Torralba, Joshua B.
Execution Accuracy (EX): When the gold Tenenbaum, and Igor Mordatch. 2024. Improving
database schema is provided as context, our ap- factuality and reasoning in language models through
proach achieves high execution accuracy (EX), as multiagent debate.
shown in Table 3. Both zero-shot and 1-shot vari- Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun,
ants of our model achieve around ∼2% higher Yichen Qian, Bolin Ding, and Jingren Zhou. 2024.
EX, surpassing all the baselines. Compared to our Text-to-sql empowered by large language models:
model, DIN-SQL and DAIL-SQL, current state- A benchmark evaluation. Proc. VLDB Endow.,
17(5):1132–1145.
of-the-art models in this setting, use significantly
more tokens. In contrast, we achieve substantial Chunxi Guo, Zhiliang Tian, Jintao Tang, Pancheng
Wang, Zhihua Wen, Kang Yang, and Ting Wang.
cost reductions, requiring 25.6x and 2.38x fewer
2023. Prompting gpt-3.5 for text-to-sql with de-
tokens than DIN-SQL and DAIL-SQL, respec- semanticization and skeleton retrieval. In PRICAI
tively. When the gold schema context is absent, 2023: Trends in Artificial Intelligence: 20th Pacific
our schema retrieval method outperforms DBCopi- Rim International Conference on Artificial Intelli-
lot, achieving 69.6% EX in the 1-shot scenario. gence, PRICAI 2023, Jakarta, Indonesia, November
15–19, 2023, Proceedings, Part II, page 262–274,
Berlin, Heidelberg. Springer-Verlag.
5 Conclusion Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi,
and Soumen Chakrabarti. 2023. CRUSH4SQL:
While LLMs are trained on vast amounts of pub- Collective retrieval using schema hallucination for
Text2SQL. In Proceedings of the 2023 Conference
lic data, they are unable to readily handle domain on Empirical Methods in Natural Language Process-
specific/confidential industry scale databases. The ing, pages 14054–14066, Singapore. Association for
impracticality and cost of finetuning LLMs with Computational Linguistics.
dynamic databases underscores the importance of Minae Kwon, Sang Michael Xie, Kalesha Bullard, and
efficient schema retrieval methods as an important Dorsa Sadigh. 2023. Reward design with language
step in Text-to-SQL applications. In this work, we models. In The Eleventh International Conference
propose a novel in-context reinforcement learning on Learning Representations.
based RAG framework for efficient schema and Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
in-context example retrieval for Text-to-SQL tasks. Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
Our approach requires no specialized finetuning, rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020.
and is based on composable prompting based mod-
Retrieval-augmented generation for knowledge-
ules, and outperforms representative state-of-the- intensive nlp tasks. In Advances in Neural Infor-
art baselines for both schema retrieval and SQL mation Processing Systems, volume 33, pages 9459–
generation tasks. While we benchmark the pre- 9474. Curran Associates, Inc.
sented approach on the Text-to-SQL, the approach Jinyang Li, Binyuan Hui, GE QU, Jiaxi Yang, Binhua
is generalizable to other problems requiring itera- Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying
tive refinement on top of LLMs as well. Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guo-
liang Li, Kevin Chang, Fei Huang, Reynold Cheng,
and Yongbin Li. 2023. Can LLM already serve as
6 Limitations a database interface? a BIg bench for large-scale
database grounded text-to-SQLs. In Thirty-seventh
Conference on Neural Information Processing Sys-
Since we are not using fine-tuned LLMs for SQL tems Datasets and Benchmarks Track.
generation, they may still lack information or un-
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
derstanding about the specific databases in con- Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
text. Apart from this, while powerful, the proposed Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
model may not inherently understand or provide Shashank Gupta, Bodhisattwa Prasad Majumder,
meaningful interpretations of the database schemas Katherine Hermann, Sean Welleck, Amir Yazdan-
bakhsh, and Peter Clark. 2023. Self-refine: Itera-
they are working with, especially if those schemas tive refinement with self-feedback. In Advances in
do not have natural language descriptions. Neural Information Processing Systems, volume 36,
pages 46534–46594. Curran Associates, Inc.
OpenAI. Models overview - openai. https:// of the 2018 Conference of the North American Chap-
platform.openai.com/docs/models/overview/. ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 2 (Short Pa-
Mohammadreza Pourreza and Davood Rafiei. 2023. pers), pages 588–594, New Orleans, Louisiana. As-
Din-sql: Decomposed in-context learning of text- sociation for Computational Linguistics.
to-sql with self-correction. In Advances in Neural
Information Processing Systems, volume 36, pages
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
36339–36348. Curran Associates, Inc.
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir
Yu Cheng, Chenghu Zhou, Xinbing Wang, Quanshi Radev. 2018b. Spider: A large-scale human-labeled
Zhang, and Zhouhan Lin. 2022. RASAT: Integrating dataset for complex and cross-domain semantic pars-
relational structures into pretrained Seq2Seq model ing and text-to-SQL task. In Proceedings of the 2018
for text-to-SQL. In Proceedings of the 2022 Con- Conference on Empirical Methods in Natural Lan-
ference on Empirical Methods in Natural Language guage Processing, pages 3911–3921, Brussels, Bel-
Processing, pages 3215–3229, Abu Dhabi, United gium. Association for Computational Linguistics.
Arab Emirates. Association for Computational Lin-
guistics. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Be-
Nils Reimers and Iryna Gurevych. 2019. Sentence- ichen Zhang, Junjie Zhang, Zican Dong, Yifan Du,
BERT: Sentence embeddings using Siamese BERT- Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao
networks. In Proceedings of the 2019 Conference on Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang
Empirical Methods in Natural Language Processing Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.
and the 9th International Joint Conference on Natu- 2023. A survey of large language models. Preprint,
ral Language Processing (EMNLP-IJCNLP), pages arXiv:2303.18223.
3982–3992, Hong Kong, China. Association for Com-
putational Linguistics.
A Appendix
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
Scales, David Dohan, Ed H Chi, Nathanael Schärli, A.1 Dataset Description
and Denny Zhou. 2023. Large language models can
be easily distracted by irrelevant context. In Inter- We used two development sets for our experiments,
national Conference on Machine Learning, pages Spider and Bird. Spider and Bird are cross-domain
31210–31227. PMLR.
datasets in English widely used for benchmarking.
Danqing Wang and Lei Li. 2023. Learning from mis- Bird tries bridges the gap between text-to-SQL re-
takes via cooperative study assistant for large lan- search and real-world applications by dealing with
guage models. In Proceedings of the 2023 Confer-
ence on Empirical Methods in Natural Language Pro-
large and messy database values. The statistics be-
cessing, pages 10667–10685, Singapore. Association low include the total size and the distribution of
for Computational Linguistics. queries by difficulty levels.
Dingzirui Wang, Longxu Dou, Xuanliang Zhang,
Table 4: Statistics of Spider-Dev and Bird-Dev
Qingfu Zhu, and Wanxiang Che. 2024a. Improv-
ing demonstration diversity by human-free fusing for Easy Medium Hard
Dataset Extra
text-to-SQL. In Findings of the Association for Com- (Simple) (Moderate) (Challenging)
putational Linguistics: EMNLP 2024, pages 1193– Spider (2147) 470 857 463 357
Bird (1534) 925 465 144 -
1207, Miami, Florida, USA. Association for Compu-
tational Linguistics.
Tianshu Wang, Hongyu Lin, Xianpei Han, Le Sun, Xi- A.2 Construction of the Database Graph
aoyang Chen, Hao Wang, and Zhenyu Zeng. 2024b.
Dbcopilot: Scaling natural language querying to mas- The database graph G starts with a root node R.
sive databases. Preprint, arXiv:2312.03463. An edge of type-1 connects R to each database Di .
Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin From each database Di , edges of type-2 connect
Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Mi- to its constituent tables Tij . Additionally, edges
culicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, of type-3 represent relationships between tables
and Tomas Pfister. 2024c. Chain-of-table: Evolving
tables in the reasoning chain for table understanding. within the same database, specifically foreign key
In The Twelfth International Conference on Learning constraints. If a table Tij in database Di references
Representations. another table Tik , a type-3 edge connects Tij to
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir
Tik . (Algorithm A.1). Once the database graph is
Radev. 2018a. TypeSQL: Knowledge-based type- constructed, we proceed to generate traversals on
aware neural text-to-SQL generation. In Proceedings this graph following Algorithm A.2.
Ground Truth Simple Retrieval ICRL Augmented Retrieval
Question Gold Schema Synthetic Question Schema (R@1) Synthetic Question Schema (R@1)
Table 5: Comparison of retrieved queries and schemas across the different retrieval methods.