0% found this document useful (0 votes)
44 views

ChatGPT SQL

Text to sql research paper publish in 2022

Uploaded by

Abdul Bari Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

ChatGPT SQL

Text to sql research paper publish in 2022

Uploaded by

Abdul Bari Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL

capability
Aiwei Liu1 , Xuming Hu1 , Lijie Wen1 , Philip S. Yu1,2
1
Tsinghua University
2
University of Illinois at Chicago
1
{liuaw20, hxm19}@mails.tsinghua.edu.cn
1
[email protected] 2 [email protected]

Abstract have begun to analyze the zero-shot ability of Chat-


GPT in various natural language processing tasks,
This paper presents the first comprehensive
such as information extraction (Wei et al., 2023),
analysis of ChatGPT’s Text-to-SQL ability.
text summarization (Wang et al., 2023), and mathe-
arXiv:2303.13547v1 [cs.CL] 12 Mar 2023

Given the recent emergence of large-scale


conversational language model ChatGPT and matical abilities (Frieder et al., 2023). Due to Chat-
its impressive capabilities in both conver- GPT’s strong ability in code generation and the fact
sational abilities and code generation, we that code generation models usually require a large
sought to evaluate its Text-to-SQL perfor- amount of annotated data to produce good results,
mance. We conducted experiments on 12 a zero-shot code generation model is very impor-
benchmark datasets with different languages,
tant. This paper first conducts a comprehensive
settings, or scenarios, and the results demon-
strate that ChatGPT has strong text-to-SQL evaluation of ChatGPT’s zero-shot performance on
abilities. Although there is still a gap from a challenging code generation task: Text-to-SQL.
the current state-of-the-art (SOTA) model per- The Text-to-SQL task involves converting user
formance, considering that the experiment was input text into SQL statements that can be exe-
conducted in a zero-shot scenario, ChatGPT’s cuted on a database, allowing non-expert users
performance is still impressive. Notably, in
to better access the contents of a database. The
the ADVETA (RPL) scenario, the zero-shot
ChatGPT even outperforms the SOTA model design of Text-to-SQL models is typically chal-
that requires fine-tuning on the Spider dataset lenging because they need to work across different
by 4.1%, demonstrating its potential for use databases and consider various user text input text
in practical applications. To support further and database structures. Due to the complexity
research in related fields, we have made the of the Text-to-SQL task, a comprehensive eval-
data generated by ChatGPT publicly available uation of its performance requires consideration
at https://github.com/THU-BPM/chatgpt-sql.
of a variety of scenarios in addition to the clas-
1 Introduction sic Spider dataset (Yu et al., 2018). For example,
Spider-SYN (Gan et al., 2021a) focuses on scenar-
With the increasing attention given to large-scale ios where the data schema mentioned in the user
language models, they have become an essential text input is synonymous with the database schema,
component in natural language processing. As the Spider-DK (Gan et al., 2021b) considers scenarios
size of pre-trained models grows, their usage is where the input question contains additional knowl-
also gradually changing. Different from models edge, Spider-CG (Gan et al., 2022) emphasizes
such as BERT (Devlin et al., 2018) and T5 (Raffel the combination generalization ability of models,
et al., 2020), which require fine-tuning with a small and ADVETA (Pi et al., 2022) considers scenarios
amount of data, models such as GPT-3 (Brown where column names in the database have been
et al., 2020), require the prompt design to generate modified. Additionally, to better reflect real-world
target outputs. The recent ChatGPT1 model, which scenarios, SParC(Yu et al., 2019b) and CoSQL (Yu
employs Reinforcement Learning for Human Feed- et al., 2019a) incorporate multi-turn interaction be-
back (RLHF) (Christiano et al., 2017), simplifies tween the user and the system. Finally, to evaluate
prompt design, enabling better utilization of the models’ multilingual capabilities, CSpider (Min
zero-shot ability of large-scale pre-trained models et al., 2019) and DuSQL (Wang et al., 2020) evalu-
in a conversational way. Based on this, many works ate Text-to-SQL performance in Chinese.
1
https://chat.openai.com/ During our experiments, we evaluate the ability
of ChatGPT on 12 different Text-to-SQL bench- ### Complete sqlite SQL query only and with no explanation
### Sqlite SQL tables, with their properties:
mark datasets. Based on the experimental results, #
# AREA_CODE_STATE(area_code,state);
# CONTESTANTS(contestant_number,contestant_name);
we conclude the following observations: # VOTES(vote_id,phone_number,state,contestant_number,created).
#
### Return the names of the contestants whose names contain the
substring 'Al' .
1. Compared to the current state-of-the-art SELECT

(SOTA) model that uses complete training SELECT contestant_name from contestants where contestant_name
like %al%
data, ChatGPT without using task-specific
training data only performs 14% worse. This ###
###Complete
###
Complete
###Sqlite SQL
Sqlite
sqlite
SQL
SQLSQL
sqlite
tables,
query
withwith
tables,
only and
query only
theirtheir
with
and
properties:
nowith
explanation
properties:
no explanation
#
#
already demonstrates that ChatGPT is a strong # airlines(uid,Airline,Abbreviation,Country);
# airports(City,AirportCode,AirportName,Country,CountryAbbrev);
# airlines(uid,Airline,Abbreviation,Country);
# flights(Airline,FlightNo,SourceAirport,DestAirport).
# airports(City,AirportCode,AirportName,Country,CountryAbbrev);
zero-shot Text-to-SQL converter. # flights(Airline,FlightNo,SourceAirport,DestAirport).
#
# What are all the abbreviations?
###
SELECT
### What are all the abbreviations?
SELECT
2. The robustness of ChatGPT in generating SQL
SELECT * FROM AIRLINES
statements is very strong, and the performance
gap between ChatGPT and the SOTA models ### What is the abbreviation for Jetblue Airways?
SELECT
is only 7.8% on some robustness settings of
the Spider dataset, which is lower than the SELECT * FROM AIRLINES WHERE Airline = “JetBlue Airways”

14% gap on the standard Spider dataset. ### What is the country corresponding it?
SELECT

3. In the ADVETA (Pi et al., 2022) scenario SELECT Country FROM AIRLINES WHERE Airline = “JetBlue
Airways”
where the column names in the database
are adversarially modified, ChatGPT’s per-
Figure 1: Example prompts for Text-to-SQL using
formance even surpasses that of the current
ChatGPT. The prompt at the top is for a single-turn sce-
SOTA models by 4.1%. nario, while the one below is for multi-turn scenarios
where only new questions are added in each interaction.
4. The Exact Match metric of the data generated
by ChatGPT is very low because there are
many different ways to express SQLs with scenario, where the prompt for the first interaction
the same purpose. Therefore, we mainly use is the same as that in the single-turn scenario, and
execution accuracy as the evaluation metric. for subsequent interactions, only the new questions
are required.
Overall, our experiments demonstrate that Chat-
GPT has strong Text-to-SQL capabilities and ro-
3 Experiment
bustness, and it outperforms SOTA models in cer-
tain scenarios. 3.1 Experiment Setup

2 Method Datasets. We conduct extensive experiments on


twelve public benchmark datasets as follows:
In order to enable ChatGPT to generate accurate (1) Spider (Yu et al., 2018) is a large-scale
SQL outputs, we utilized the prompt as shown cross-domain Text-to-SQL benchmark. It contains
in Figure 1. To ensure a fair demonstration of 8659 training samples across 146 databases and
ChatGPT’s Text-to-SQL capabilities, we directly 1034 evaluation samples across 20 databases. (2)
adopted the Text-to-SQL prompt used in the Ope- Spider-SYN (Gan et al., 2021a) is a challenging
nAI demo webwite2 without conducting further variant of the Spider evaluation dataset. Spider-
prompt exploration. SYN is constructed by manually modifying natural
The upper half of Figure 1 represents the prompt language questions with synonym substitutions.
in a single-turn Text-to-SQL scenario, where only (3) Spider-DK (Gan et al., 2021b) is a human-
the database and question information is required curated dataset based on Spider, which samples
in the prompt. Meanwhile, in order to facilitate 535 question-SQL pairs across 10 databases
further evaluations, we emphasize in the prompt from the Spider development set and modifies
that the generated SQL statements can be executed them to incorporate the domain knowledge. (4)
in an SQLite database. The lower half of Figure 1 Spider-Realistic (Deng et al., 2020) is a new
represents the prompt in a multi-turn Text-to-SQL evaluation set based on the Spider dev set with
2
https://platform.openai.com/examples/default-sql- explicit mentions of column names removed,
translate which contains 508 samples. (5) Spider-CG(SUB)
S PIDER S PIDER -SYN S PIDER -R EALISTIC
Methods / Datasets
VA EX TS VA EX TS VA EX TS
T5-3B + PICARD 98.4 79.3 69.4 98.2 69.8 61.8 97.1 71.4 61.7
RASAT + PICARD 98.8 80.5 70.3 98.3 70.7 62.4 97.4 71.9 62.6
RESDSQL-3B + NatSQL 99.1 84.1 73.5 98.8 76.9 66.8 98.4 81.9 70.1
ChatGPT 97.7 70.1(14↓) 60.1 96.2 58.6(18.3↓) 48.5 96.8 63.4(18.5 ↓) 49.2

Table 1: Comparison of the performance of ChatGPT and other models on Spider, Spider-SYN, and Spider-
Realistic datasets.

S PIDER -DK ADVETA( RPL ) ADVETA( ADD )


Methods / Datasets
VA EX TS VA EX TS VA EX TS
T5-3B + PICARD 97.8 62.5 - 92.7 50.6 - 97.2 69.4 -
RASAT + PICARD 98.5 63.9 - 92.9 51.5 - 97.4 70.7 -
RESDSQL-3B + NatSQL 98.8 66.0 - 93.9 54.4 - 97.9 71.9 -
ChatGPT 96.4 62.6(3.4 ↓) - 91.4 58.5(4.1 ↑) - 93.1 68.1(3.8 ↓) -

Table 2: Performance of different methods on the Spider-DK, ADVETA(RPL) and ADVETA(ADD) benchmark
datasets.

and Spider-CG(APP) (Gan et al., 2022) are two use the main-stream exact match accuracy, as SQL
evaluation datasets to measure the compositional queries that achieve the same goal can often be
generalization of models, which is constructed expressed in different ways, making it difficult for
by sub-sentence substitution between different zero-shot ChatGPT models to achieve high exact
examples and appending a sub-sentence into match accuracy.
another sentence separately. (6) ADVETA(rpl)
and ADVETA(add) (Pi et al., 2022) are two Baselines. Due to our exclusive reliance on
challenging test datasets for the Spider dataset execution-based evaluation, we did not employ
which are composed of adversarial replacements baselines such as RatSQL (Wang et al., 2019)
of column names and the addition of new column and LGESQL (Cao et al., 2021), which generate
names, respectively. (7) CSpider (Min et al., only SQL skeletons without generating values. In-
2019) dataset is constructed by translating Spider stead, we primarily utilized three baselines: (1)
into Chinese, which is the same size as the origin PICARD (Scholak et al., 2021) is a method for
Spider dataset (8) DuSQL (Wang et al., 2020) is constraining auto-regressive decoders of language
a larger scale Chinese Text-to-SQL dataset with models through incremental parsing. (2) RASAT
23,797 question/SQL pairs. (9) SParC (Yu et al., (Qi et al., 2022) introduces relation-aware self-
2019b) and CoSQL (Yu et al., 2019a) are two attention into transformer models and also utilizes
multi-turn Text-to-SQL dataset with 1625 and constrained auto-regressive decoders. (3) RESD-
1007 questions in the dev set separately. SQL (Li et al., 2023) proposes a ranking-enhanced
encoding and skeleton-aware decoding framework
to decouple the schema linking and the skeleton
Evaluation Metrics. We mainly adopt three parsing. Among those, PICARD and RASAT are
evaluation metrics which are valid SQL (VA), based on T5-3B (Raffel et al., 2020) model.
execution accuracy(EX), and test-suite accuracy
(TS). Valid SQL (VA) is the proportion of SQL
3.2 Main Experiment
statements that can be executed successfully.
Execution accuracy (EX) is the proportion of data Evaluation on Spider Dataset. In Table 1,
where the execution results match the standard we present a comparison between ChatGPT
SQL statements. Test-suite accuracy (TS) (Zhong and the current state-of-the-art (SOTA) models.
et al., 2020) could achieve high code coverage Overall, ChatGPT exhibits a strong Text-to-SQL
from a distilled test suite of the database, which ability.Despite the 14% gap in execution accuracy
is also based on execution. Note that we do not compared to the current SOTA models and a 13.4%
S PIDER -CG(SUB) S PIDER -CG(APP)
Methods / Datasets
VA EX TS VA EX TS
T5-3B + PICARD 98.4 82.1 74.3 95.8 68.0 60.5
RASAT + PICARD 99.0 82.6 76.1 96.2 68.6 61.0
RESDSQL-3B + NatSQL 99.4 83.3 77.5 96.4 69.4 62.4
ChatGPT 98.3 76.6(6.7 ↓) 67.2 91.2 61.3(8.1 ↓) 47.9

Table 3: Performance of different methods on the Spider-CG(SUB) and Spider-CG(APP) benchmark datasets.

gap in test suite accuracy, it is remarkable that SPAR C C O SQL


ChatGPT achieved such results in a zero-shot Methods / Datasets
VA EX VA EX
scenario considering that it was not fine-tuned on
T5-3B + PICARD - - 97.5 64.7
the Spider training set.
RASAT + PICARD 98.4 74.0 97.8 66.3
ChatGPT 97.3 63.1 95.8 60.7
Evaluation on Spider-SYN and Spider-
Realistic Datasets. Table 1 also includes a Table 4: The performance of ChatGPT on two multi-
comparison of ChatGPT’s performance on the turn Text-to-SQL datasets: SParC and CoSQL.
Spider-SYN and Spider-Realistic datasets. The
main difference between these datasets and the CS PIDER D U SQL
Spider dev set is that they eliminate the explicit Methods / Datasets
appearance of the database schema in the questions. VA EX VA EX
Overall, although ChatGPT still performs well ChatGPT 96.0 65.1 82.7 53.7
on these two settings, the performance gap
Table 5: The performance of ChatGPT on two Chinese
between ChatGPT and the original SOTA models
Text-to-SQL datasets: CSpider and DuSQL.
becomes slightly larger than that on the Spider
dataset. This suggests that the current models have
already achieved sufficient robustness in these two 3, we further analyze ChatGPT’s ability in the
scenarios. compositional generalization scenario. We found
that in Spider-CG (SUB), SQL substructures are
Evaluation on Spider-DK and ADVETA replaced to form combinations that do not exist in
Datasets. In Table 2, we further compare and the training set. In this scenario, ChatGPT even
analyze ChatGPT’s performance on Spider-DK, outperforms the original Spider dev set. Even on
ADVETA (RPL), and ADVETA (ADD). We find the more challenging Spider-CG (APP) dataset,
that ChatGPT performs exceptionally well on ChatGPT achieves strong performance, and the
these datasets, with very small performance gaps performance gap with SOTA models is relatively
compared to the current SOTA models. In fact, smaller than that on the original Spider dataset.
ChatGPT outperforms all current SOTA models on Overall, since ChatGPT is a zero-shot model, it is
ADVETA (RPL). For the Spider-DK dataset, we not as affected by compositional generalization
speculate that ChatGPT’s excellent performance as the SOTA models. Overall, zero-shot models
is due to its additional knowledge provided by have greater advantages in the compositional
the large-scale pretraining. As for scenarios generalization setting.
such as ADVETA, where the dataset’s column
names undergo adversarial modifications, the poor Evaluation on multi-turn Text-to-SQL scenar-
generalization performance of current models may ios. Given ChatGPT’s strong contextual modeling
be due to the significant distribution difference ability, we further evaluate its performance on
from the original dataset. Overall, ChatGPT multi-turn Text-to-SQL scenarios: SPAR C and
exhibits strong robustness in scenarios that require C O SQL. As shown in Table 4, ChatGPT exhibits
additional knowledge or adversarial modifications strong multi-turn Text-to-SQL ability. Although
are applied to the database column names. there is still a gap compared to the current SOTA
models, the gap is relatively smaller compared to
Evaluation on Spider-CG Dataset. In Table the single-turn Spider dataset. Meanwhile, Chat-
GPT also performs better on CoSQL datasets with Question: Show the stadium name and the number of
more average interactions, which also indicates concerts in each stadium.
that ChatGPT’s strong contextual modeling ability ChatGPT: SELECT stadium.Name,
is very helpful for multi-turn Text-to-SQL. COUNT(concert.concert_ID) FROM stadium
LEFT JOIN concert ON stadium.Stadium_ID =
concert.Stadium_ID GROUP BY stadium.Name;
Evaluation on Chinese Text-to-SQL scenarios.
Gold: SELECT T2.name , count(*) FROM concert
We further evaluate ChatGPT’s Text-to-SQL abil-
AS T1 JOIN stadium AS T2 ON T1.stadium_id =
ity on other languages in Table 5. The experiments T2.stadium_id GROUP BY T1.stadium_id
are mainly conducted on two datasets, CSpider and Question:How many car models were produced by the
DuSQL, where only the questions are in Chinese maker with full name American Motor Company?
for CSpider and both the schema names and ques- ChatGPT: SELECT COUNT(*) FROM model_list
tions are in Chinese for DuSQL. The results show WHERE Maker = ’American Motor Company’
that while ChatGPT performs well in the Chinese Gold: SELECT count(*) FROM CAR_MAKERS AS T1
Text-to-SQL scenario, there is still a performance JOIN model_list AS T2 ON T1.Id = T2.Maker
gap compared to the English Text-to-SQL scenario. WHERE T1.FullName = ’American Motor Company’;
Moreover, the performance is even worse when the Question: How many cars have a larger accelerate than
table names and column names are also in Chinese, the car with the largest horsepower?
with a large number of generated SQL queries be- ChatGPT: SELECT COUNT(*) FROM cars_data WHERE
Accelerate > (SELECT MAX(Horsepower) FROM
ing non-executable and a lower execution accuracy.
cars_data)
This suggests the cross-lingual generalization abil-
Gold: SELECT COUNT(*) FROM CARS_DATA WHERE
ity of ChatGPT requires further improvement. Accelerate > (SELECT Accelerate FROM
CARS_DATA ORDER BY Horsepower DESC LIMIT 1);
3.3 Case Study
Question: What is the abbreviation of Airline "JetBlue
In Table 6, we present four typical prediction er- Airways"?
rors made by ChatGPT on the Spider dev dataset. ChatGPT: SELECT Abbreviation FROM airlines
The first error case shows that ChatGPT tends to WHERE Airline = ’Jetblue Airways’ ;
design JOIN statements more finely by using LEFT Gold: SELECT Abbreviation FROM AIRLINES WHERE
JOIN, but this level of granularity is not present in Airline = "JetBlue Airways";
the original Spider dev dataset. The second error
Table 6: Case study: We selected four cases of incor-
case arises from ChatGPT’s confusion regarding
rect predictions generated by ChatGPT on the Spider
the database structure, and it is not clear which col- development set for analysis.
umn the term "full name" specifically refers to. The
third example’s error was due to the generated SQL
statement lacking correct semantic interpretation,
resulting in incorrect output for the "where" clauses users into SQL statements that can be executed
with nested SQL statements. The fourth case of on a database. On the classic Spider dataset (Yu
error is due to errors in copying specific values, et al., 2018), many classic works such as RatSQL
where the case sensitivity of the original value was (Wang et al., 2019) and LGESQL (Cao et al., 2021)
not preserved when regenerating the value. have achieved excellent results. Since Text-to-SQL
is a very complex task involving both user input
In summary, ChatGPT’s errors mostly occur in
questions and database structure, the robustness
small details, and some of these issues can be ad-
of the model is crucial. To further explore this is-
dressed and improved in later stages of develop-
sue, Gan et al. (2021a) proposed the Spider-SYN
ment, such as in the first, third, and fourth cases.
dataset to evaluate the robustness of models under
However, for errors like the second case, which
synonym substitution scenarios. Some works, such
indicate a lack of understanding of the database
as Proton (Wang et al., 2022) and ISESL-SQL (Liu
schema, further improvements to the model’s abil-
et al., 2022), are also devoted to improving the ro-
ity may be necessary to resolve them.
bustness of models in this scenario. Meanwhile,
4 Related Work many works explore the robustness of the Text-
to-SQL task in other scenarios. The Spider-DK
Text-to-SQL is an important semantic parsing task dataset (Gan et al., 2021b) evaluates the robustness
that converts natural language questions posed by of models in scenarios requiring additional knowl-
edge. The Spider-Realistic dataset (Deng et al., uate ChatGPT’s ability. And in future work, better
2020) removes the explicit appearance of dataset prompts could be designed to explore ChatGPT’s
schema information in user questions, thereby in- Text-to-SQL ability.
creasing the difficulty of the original task. The
Spider-CG dataset (Gan et al., 2022) evaluates the 6 Future work
robustness of models in compositional generaliza-
In future work, we will primarily consider the fol-
tion scenarios. The ADVETA dataset (Pi et al.,
lowing two directions to further explore ChatGPT’s
2022) evaluates the robustness of models in scenar-
capabilities in the Text-to-SQL task. Firstly, we
ios involving adversarial modifications of database
will conduct more interactions with ChatGPT to
table information. In addition, to verify the robust-
address the issue of generating non-executable SQL
ness of models in cross-lingual scenarios, CSpider
statements. We can design ChatGPT to engage in
(Min et al., 2019) and DuSQL (Wang et al., 2020)
multi-turn dialogues with the provided database
have been proposed to evaluate the robustness of
error messages to further ensure the validity of
models in the Chinese language. To evaluate the
generated SQL statements. Secondly, we will add
performance of Text-to-SQL in more realistic sce-
more highly correlated in-context examples to the
narios, SParC (Yu et al., 2019b) and CoSQL (Yu
prompt to enhance ChatGPT’s ability to generate
et al., 2019a) have been proposed to evaluate the
Text-to-SQL.
performance of multi-turn Text-to-SQL. Models
such as STA R (Cai et al., 2022) and CQR-SQL
(Xiao et al., 2022) have also achieved good results References
in this scenario.
Currently, several methods have been attempted Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
to explore the improvement of large-scale language Neelakantan, Pranav Shyam, Girish Sastry, Amanda
models for Text-to-SQL models. The PICARD Askell, et al. 2020. Language models are few-shot
(Scholak et al., 2021) and RASAT (Qi et al., 2022) learners. Advances in neural information processing
utilize the T5-3B model, but still require the train- systems, 33:1877–1901.
ing data for fine-tuning. Rajkumar et al. (2022) Zefeng Cai, Xiangyu Li, Binyuan Hui, Min Yang,
investigated the Text-to-SQL capabilities of the Bowen Li, Binhua Li, Zheng Cao, Weijie Li, Fei
GPT3 model in a zero-shot setting. Cheng et al. Huang, Luo Si, et al. 2022. Star: Sql guided pre-
(2022) proposed the BINDER model based on the training for context-dependent text-to-sql parsing.
arXiv preprint arXiv:2210.11888.
GPT3 codex, which has similar Text-to-SQL gen-
eration capabilities with the need for in-context Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao,
exemplar annotations. However, these works do Su Zhu, and Kai Yu. 2021. Lgesql: line graph en-
not provide a comprehensive evaluation of Text- hanced text-to-sql model with mixed local and non-
to-SQL and are limited to a few datasets without local relations. arXiv preprint arXiv:2106.01093.
other robustness settings. In this work, we are the Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu
first to evaluate the comprehensive Text-to-SQL Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong,
capabilities of ChatGPT. Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer,
et al. 2022. Binding language models in symbolic
languages. arXiv preprint arXiv:2210.02875.
5 Conclusion
Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
In this work, we conducted a comprehensive analy- tic, Shane Legg, and Dario Amodei. 2017. Deep re-
sis of ChatGPT’s zero-shot ability in Text-to-SQL. inforcement learning from human preferences. Ad-
We found that even without using any training data, vances in neural information processing systems, 30.
ChatGPT still has strong Text-to-SQL ability, al-
Xiang Deng, Ahmed Hassan Awadallah, Christopher
though there is still some gap compared to the cur- Meek, Oleksandr Polozov, Huan Sun, and Matthew
rent SOTA models. Additionally, ChatGPT demon- Richardson. 2020. Structure-grounded pretraining
strated strong robustness, performing relatively bet- for text-to-sql. arXiv preprint arXiv:2010.12773.
ter on most robustness benchmarks and even sur-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
passing the current SOTA models on the ADVETA Kristina Toutanova. 2018. Bert: Pre-training of deep
benchmark. Although this paper has made some bidirectional transformers for language understand-
findings, we only utilize a common prompt to eval- ing. arXiv preprint arXiv:1810.04805.
Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- Torsten Scholak, Nathan Schucher, and Dzmitry Bah-
fiths, Tommaso Salvatori, Thomas Lukasiewicz, danau. 2021. Picard: Parsing incrementally for
Philipp Christian Petersen, Alexis Chevalier, and constrained auto-regressive decoding from language
Julius Berner. 2023. Mathematical capabilities of models. arXiv preprint arXiv:2109.05093.
chatgpt. arXiv preprint arXiv:2301.13867.
Bailin Wang, Richard Shin, Xiaodong Liu, Olek-
Yujian Gan, Xinyun Chen, Qiuping Huang, and sandr Polozov, and Matthew Richardson. 2019.
Matthew Purver. 2022. Measuring and improving Rat-sql: Relation-aware schema encoding and
compositional generalization in text-to-sql via com- linking for text-to-sql parsers. arXiv preprint
ponent alignment. arXiv:1911.04942.
Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew Jiaan Wang, Yunlong Liang, Fandong Meng, Zhixu
Purver, John R Woodward, Jinxia Xie, and Peng- Li, Jianfeng Qu, and Jie Zhou. 2023. Cross-
sheng Huang. 2021a. Towards robustness of text- lingual summarization via chatgpt. arXiv preprint
to-sql models against synonym substitution. In Pro- arXiv:2302.14229.
ceedings of the 59th Annual Meeting of the Associa-
tion for Computational Linguistics and the 11th In- Lihan Wang, Bowen Qin, Binyuan Hui, Bowen Li, Min
ternational Joint Conference on Natural Language Yang, Bailin Wang, Binhua Li, Jian Sun, Fei Huang,
Processing (Volume 1: Long Papers), pages 2505– Luo Si, et al. 2022. Proton: Probing schema link-
2515. ing information from pre-trained language models
for text-to-sql parsing. In Proceedings of the 28th
Yujian Gan, Xinyun Chen, and Matthew Purver. ACM SIGKDD Conference on Knowledge Discovery
2021b. Exploring underexplored limitations of and Data Mining, pages 1889–1898.
cross-domain text-to-sql generalization. arXiv
preprint arXiv:2109.05157. Lijie Wang, Ao Zhang, Kun Wu, Ke Sun, Zhenghua
Li, Hua Wu, Min Zhang, and Haifeng Wang. 2020.
Haoyang Li, Jing Zhang, Cuiping Li, and Hong Dusql: A large-scale and pragmatic chinese text-to-
Chen. 2023. Decoupling the skeleton parsing and sql dataset. In Proceedings of the 2020 Conference
schema linking for text-to-sql. arXiv preprint on Empirical Methods in Natural Language Process-
arXiv:2302.05965. ing (EMNLP), pages 6923–6935.
Aiwei Liu, Xuming Hu, Li Lin, and Lijie Wen. 2022. Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang,
Semantic enhanced text-to-sql parsing via iteratively Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu,
learning schema linking graph. In Proceedings of Yufeng Chen, Meishan Zhang, et al. 2023. Zero-
the 28th ACM SIGKDD Conference on Knowledge shot information extraction via chatting with chatgpt.
Discovery and Data Mining, pages 1021–1030. arXiv preprint arXiv:2302.10205.
Qingkai Min, Yuefeng Shi, and Yue Zhang. 2019. A Dongling Xiao, Linzheng Chai, Qian-Wen Zhang,
pilot study for chinese sql semantic parsing. arXiv Zhao Yan, Zhoujun Li, and Yunbo Cao. 2022.
preprint arXiv:1909.13293. Cqr-sql: Conversational question reformulation en-
hanced context-dependent text-to-sql parsers. arXiv
Xinyu Pi, Bing Wang, Yan Gao, Jiaqi Guo, Zhoujun
preprint arXiv:2205.07686.
Li, and Jian-Guang Lou. 2022. Towards robustness
of text-to-SQL models against natural and realistic Tao Yu, Rui Zhang, He Yang Er, Suyi Li, Eric Xue,
adversarial table perturbation. In Proceedings of the Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi,
60th Annual Meeting of the Association for Compu- Zihan Li, et al. 2019a. Cosql: A conversational
tational Linguistics (Volume 1: Long Papers), pages text-to-sql challenge towards cross-domain natural
2007–2022, Dublin, Ireland. Association for Com- language interfaces to databases. arXiv preprint
putational Linguistics. arXiv:1909.05378.
Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Chenghu Zhou, Xinbing Wang, Quanshi Zhang, and Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
Zhouhan Lin. 2022. Rasat: Integrating relational ing Yao, Shanelle Roman, et al. 2018. Spider: A
structures into pretrained seq2seq model for text-to- large-scale human-labeled dataset for complex and
sql. arXiv preprint arXiv:2205.06983. cross-domain semantic parsing and text-to-sql task.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine arXiv preprint arXiv:1809.08887.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern
Wei Li, and Peter J Liu. 2020. Exploring the limits Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li,
of transfer learning with a unified text-to-text trans- Bo Pang, Tao Chen, et al. 2019b. Sparc: Cross-
former. The Journal of Machine Learning Research, domain semantic parsing in context. arXiv preprint
21(1):5485–5551. arXiv:1906.02285.
Nitarshan Rajkumar, Raymond Li, and Dzmitry Bah- Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Seman-
danau. 2022. Evaluating the text-to-sql capabil- tic evaluation for text-to-sql with distilled test suites.
ities of large language models. arXiv preprint arXiv preprint arXiv:2010.02840.
arXiv:2204.00498.

You might also like