ChatGPT SQL
ChatGPT SQL
capability
Aiwei Liu1 , Xuming Hu1 , Lijie Wen1 , Philip S. Yu1,2
1
Tsinghua University
2
University of Illinois at Chicago
1
{liuaw20, hxm19}@mails.tsinghua.edu.cn
1
[email protected] 2 [email protected]
(SOTA) model that uses complete training SELECT contestant_name from contestants where contestant_name
like %al%
data, ChatGPT without using task-specific
training data only performs 14% worse. This ###
###Complete
###
Complete
###Sqlite SQL
Sqlite
sqlite
SQL
SQLSQL
sqlite
tables,
query
withwith
tables,
only and
query only
theirtheir
with
and
properties:
nowith
explanation
properties:
no explanation
#
#
already demonstrates that ChatGPT is a strong # airlines(uid,Airline,Abbreviation,Country);
# airports(City,AirportCode,AirportName,Country,CountryAbbrev);
# airlines(uid,Airline,Abbreviation,Country);
# flights(Airline,FlightNo,SourceAirport,DestAirport).
# airports(City,AirportCode,AirportName,Country,CountryAbbrev);
zero-shot Text-to-SQL converter. # flights(Airline,FlightNo,SourceAirport,DestAirport).
#
# What are all the abbreviations?
###
SELECT
### What are all the abbreviations?
SELECT
2. The robustness of ChatGPT in generating SQL
SELECT * FROM AIRLINES
statements is very strong, and the performance
gap between ChatGPT and the SOTA models ### What is the abbreviation for Jetblue Airways?
SELECT
is only 7.8% on some robustness settings of
the Spider dataset, which is lower than the SELECT * FROM AIRLINES WHERE Airline = “JetBlue Airways”
14% gap on the standard Spider dataset. ### What is the country corresponding it?
SELECT
3. In the ADVETA (Pi et al., 2022) scenario SELECT Country FROM AIRLINES WHERE Airline = “JetBlue
Airways”
where the column names in the database
are adversarially modified, ChatGPT’s per-
Figure 1: Example prompts for Text-to-SQL using
formance even surpasses that of the current
ChatGPT. The prompt at the top is for a single-turn sce-
SOTA models by 4.1%. nario, while the one below is for multi-turn scenarios
where only new questions are added in each interaction.
4. The Exact Match metric of the data generated
by ChatGPT is very low because there are
many different ways to express SQLs with scenario, where the prompt for the first interaction
the same purpose. Therefore, we mainly use is the same as that in the single-turn scenario, and
execution accuracy as the evaluation metric. for subsequent interactions, only the new questions
are required.
Overall, our experiments demonstrate that Chat-
GPT has strong Text-to-SQL capabilities and ro-
3 Experiment
bustness, and it outperforms SOTA models in cer-
tain scenarios. 3.1 Experiment Setup
Table 1: Comparison of the performance of ChatGPT and other models on Spider, Spider-SYN, and Spider-
Realistic datasets.
Table 2: Performance of different methods on the Spider-DK, ADVETA(RPL) and ADVETA(ADD) benchmark
datasets.
and Spider-CG(APP) (Gan et al., 2022) are two use the main-stream exact match accuracy, as SQL
evaluation datasets to measure the compositional queries that achieve the same goal can often be
generalization of models, which is constructed expressed in different ways, making it difficult for
by sub-sentence substitution between different zero-shot ChatGPT models to achieve high exact
examples and appending a sub-sentence into match accuracy.
another sentence separately. (6) ADVETA(rpl)
and ADVETA(add) (Pi et al., 2022) are two Baselines. Due to our exclusive reliance on
challenging test datasets for the Spider dataset execution-based evaluation, we did not employ
which are composed of adversarial replacements baselines such as RatSQL (Wang et al., 2019)
of column names and the addition of new column and LGESQL (Cao et al., 2021), which generate
names, respectively. (7) CSpider (Min et al., only SQL skeletons without generating values. In-
2019) dataset is constructed by translating Spider stead, we primarily utilized three baselines: (1)
into Chinese, which is the same size as the origin PICARD (Scholak et al., 2021) is a method for
Spider dataset (8) DuSQL (Wang et al., 2020) is constraining auto-regressive decoders of language
a larger scale Chinese Text-to-SQL dataset with models through incremental parsing. (2) RASAT
23,797 question/SQL pairs. (9) SParC (Yu et al., (Qi et al., 2022) introduces relation-aware self-
2019b) and CoSQL (Yu et al., 2019a) are two attention into transformer models and also utilizes
multi-turn Text-to-SQL dataset with 1625 and constrained auto-regressive decoders. (3) RESD-
1007 questions in the dev set separately. SQL (Li et al., 2023) proposes a ranking-enhanced
encoding and skeleton-aware decoding framework
to decouple the schema linking and the skeleton
Evaluation Metrics. We mainly adopt three parsing. Among those, PICARD and RASAT are
evaluation metrics which are valid SQL (VA), based on T5-3B (Raffel et al., 2020) model.
execution accuracy(EX), and test-suite accuracy
(TS). Valid SQL (VA) is the proportion of SQL
3.2 Main Experiment
statements that can be executed successfully.
Execution accuracy (EX) is the proportion of data Evaluation on Spider Dataset. In Table 1,
where the execution results match the standard we present a comparison between ChatGPT
SQL statements. Test-suite accuracy (TS) (Zhong and the current state-of-the-art (SOTA) models.
et al., 2020) could achieve high code coverage Overall, ChatGPT exhibits a strong Text-to-SQL
from a distilled test suite of the database, which ability.Despite the 14% gap in execution accuracy
is also based on execution. Note that we do not compared to the current SOTA models and a 13.4%
S PIDER -CG(SUB) S PIDER -CG(APP)
Methods / Datasets
VA EX TS VA EX TS
T5-3B + PICARD 98.4 82.1 74.3 95.8 68.0 60.5
RASAT + PICARD 99.0 82.6 76.1 96.2 68.6 61.0
RESDSQL-3B + NatSQL 99.4 83.3 77.5 96.4 69.4 62.4
ChatGPT 98.3 76.6(6.7 ↓) 67.2 91.2 61.3(8.1 ↓) 47.9
Table 3: Performance of different methods on the Spider-CG(SUB) and Spider-CG(APP) benchmark datasets.