0% found this document useful (0 votes)
42 views13 pages

Dusql

Uploaded by

cardinalshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views13 pages

Dusql

Uploaded by

cardinalshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset

Lijie Wang1 , Ao Zhang1 , Kun Wu2 , Ke Sun1 , Zhenghua Li2 ,


Hua Wu1 , Min Zhang2 , Haifeng Wang1
1. Baidu Inc, Beijing, China
2. Institute of Artificial Intelligence, School of Computer Science and Technology,
Soochow University, Suzhou, China
{wanglijie,zhangao,sunke,wu_hua,wanghaifeng}@baidu.com
[email protected]; {zhli13,minzhang}@suda.edu.cn

Abstract
Due to the lack of labeled data, previous re-
search on text-to-SQL parsing mainly focuses
on English. Representative English datasets in-
clude ATIS, WikiSQL, Spider, etc. This paper
presents DuSQL, a larges-scale and pragmatic
Chinese dataset for the cross-domain text-to-
SQL task, containing 200 databases, 813 ta-
bles, and 23,797 question/SQL pairs. Our new
dataset has three major characteristics. First,
by manually analyzing questions from several
representative applications, we try to figure out
the true distribution of SQL queries in real-life
needs. Second, DuSQL contains a consider-
able proportion of SQL queries involving row
or column calculations, motivated by our analy-
Figure 1: Illustration of the text-to-SQL task.
sis on the SQL query distributions. Finally, we
adopt an effective data construction framework
via human-computer collaboration. The basic
idea is automatically generating SQL queries
training and evaluation, such as ATIS (Iyer et al.,
based on the SQL grammar and constrained 2017), GeoQuery (Popescu et al., 2003), WikiSQL
by the given database. This paper describes in (Zhong et al., 2017), and Spider (Yu et al., 2018b).
detail the construction process and data statis- Formally, given a natural language (NL) ques-
tics of DuSQL. Moreover, we present and com- tion and a relational database, the text-to-SQL task
pare performance of several open-source text- aims to produce a legal and executable SQL query
to-SQL parsers with minor modification to ac-
that leads directly to the correct answer, as depicted
commodate Chinese, including a simple yet ef-
fective extension to IRNet for handling calcula- in Figure 1. A database is composed of multiple
tion SQL queries. tables and denoted as DB = {T1 , T2 , ..., Tn }. A ta-
ble is composed of multiple columns and denoted
1 Introduction as Ti = {col1 , col2 , ..., colm }. Tables are usually
In the past few decades, a large amount of research linked with each other by foreign keys.
has focused on searching answers from unstruc- The earliest datasets include ATIS (Iyer et al.,
tured texts given natural questions, which is also 2017) , GeoQuery (Popescu et al., 2003), Restau-
known as the question answering (QA) task (Burke rants (Tang and Mooney, 2001), Academic (Li and
et al., 1997; Kwok et al., 2001; Allam and Hag- Jagadish, 2014), etc. Each dataset only has a sin-
gag, 2012; Nguyen et al., 2016). However, a lot of gle database containing a certain number of ta-
high-quality knowledge or data are actually stored bles. All question/SQL pairs of train/dev/test sets
in databases in the real world. It is thus extremely are generated against the same database. Many in-
useful to allow ordinary users to directly inter- teresting approaches are proposed to handle those
act with databases via natural questions. To meet datasets (Iyer et al., 2017; Yaghmazadeh et al.,
this need, researchers have proposed the text-to- 2017; Finegan-Dollak et al., 2018).
SQL task with released English datasets for model However, real-world applications usually in-

6923
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6923–6935,
November 16–20, 2020. c 2020 Association for Computational Linguistics
volve more than one database, and require the
model to be able to generalize to and handle unseen
databases during evaluation. To accommodate this
need, the WikiSQL dataset is then released by
Zhong et al. (2017). It consists of 80,654 ques-
tion/SQL pairs for 24,241 single-table databases.
They propose a new data split setting to ensure that
databases in train/dev/test do not overlap. However,
they focus on very simple SQL queries containing
one SELECT statement with one WHERE clause.
In addition, Sun et al. (2020) released TableQA, a
Chinese dataset similar to the WikiSQL dataset. Figure 2: The SQL query distributions of the three ap-
Yu et al. (2018b) released a more challenging plications. Please kindly note that a query may belong
Spider dataset, consisting of 10,181 question/SQL to multiple types.
pairs against 200 multi-table databases. Compared
with WikiSQL and TableQA, Spider is much more
complex due to two reasons: 1) the need of select- generate a pseudo question by traversing it in the
ing relevant tables; 2) many nested queries and ad- execution order and then ask annotators to para-
vanced SQL clauses like GROUP BY and ORDER phrase it into a NL question.
BY. • We conduct experiments on DuSQL using
As far as we know, most existing datasets are three open-source parsing models. In par-
constructed for English. Another issue is that they ticular, we extend the state-of-the-art IRNet
do not refer to the question distribution in real- (Guo et al., 2019) model to accommodate
world applications during data construction. Tak- the characteristics of DuSQL. Results and
ing Spider as an example. Given a database, anno- analysis show that DuSQL is a very chal-
tators are asked to write many SQL queries from lenging dataset. We will release our data at
scratch. The only requirement is that SQL queries https://github.com/luge-ai/luge-ai/
have to cover a list of SQL clauses and nested tree/master/semantic-parsing.
queries. Meanwhile, the annotators write NL ques-
tions corresponding to SQL queries. In particular, 2 SQL Query Distribution
all these datasets contain very few questions involv-
ing calculations between rows or columns, which As far as we know, existing text-to-SQL datasets
we find are very common in real applications. mainly consider the complexity of SQL syntax
This paper presents DuSQL, a large-scale and when creating SQL queries. For example, Wik-
pragmatic Chinese text-to-SQL dataset, contain- iSQL has only simple SQL queries containing SE-
ing 200 databases, 813 tables, and 23,797 ques- LECT and WHERE clauses. Spider covers 15 SQL
tion/SQL pairs. Specifically, our contributions are clauses including SELECT, WHERE, ORDER BY,
summarized as follows. GROUP BY, etc, and allows nested queries.
However, to build a pragmatic text-to-SQL sys-
• In order to determine a more realistic distribution tem that allows ordinary users to directly interact
of SQL queries, we collect user questions from with databases via NL questions, it is very impor-
three representative database-oriented applica- tant to know the SQL query distribution in real-
tions and perform manual analysis. In particular, world applications, from the aspect of user need
we find that a considerable proportion of ques- rather than SQL syntax. Our analysis shows that
tions require row/column calculations, which are Spider mainly covers three types of SQL queries,
not included in existing datasets. i.e., matching, sorting, and clustering, whereas
WikiSQL only has matching queries. Neither of
• We adopt an effective data construction frame- them contains the calculation type, which we find
work via human-computer collaboration. The ba- composes a large portion of questions in certain
sic idea is automatically generating SQL queries real-world applications.
based on the SQL grammar and constrained by To find out the SQL query distribution in real-
the given database. For each SQL query, we first life applications, we consider the following three

6924
Row Calculation
representative types of database-oriented applica-
How much bigger is Guangzhou than Shenzhen?
tions, and conduct manual analysis against user
SELECT a.area(km2 ) - b.area(km2 ) FROM
questions. We ask annotators to divide user ques-
(SELECT area(km2 ) FROM T1 WHERE name = ‘Guangzhou’) a,
tions into five categories (see Appendix B for de-
(SELECT area(km2 ) FROM T1 WHERE name = ‘Shenzhen’) b
tails), i.e., matching, sorting, clustering, calcula-
tion, and others. Column Calculation
Information retrieval applications. We use What is the population density of Hefei?

Baidu, the Chinese search engine, as a typical in- SELECT population / area(km2 ) FROM T1 WHERE name = ‘Hefei’

formation retrieval application. Nowadays, search Calculation with a Constant


engines are still the most important way for web How old is Jenny?
users to acquire answers. Thanks to the progress in SELECT curdate - birthday FROM student WHERE name = ‘Jenny’
knowledge graph research, search engines can re- How far is Beijing’s population from 23 million?
turn structured tables or even direct answers from SELECT 23000000 - population FROM T1 WHERE name = ‘Beijing’
infobox websites such as Wikipedia and Baidu En-
Figure 3: Examples in the calculation type, including
cyclopedia. From one-day Baidu search logs, we questions and SQL queries. The first example of calcu-
randomly select 1,000 questions for which one of lation with a constant is based on a database that has
returned top-10 relevant web sites is from infobox a “student” table with the schema of {name, birthday,
websites. Then, we manually classify each ques- height, age}. Other examples are based on the database
tion into the above five types. in Figure 1.
Customer service robots. Big companies build
AI robots to answer questions of customers, which
usually require the access to industrial databases.
We provide a free trial API1 to create customer
service robots for developers. With the permis-
sion of the developers, we randomly select 1,500
questions and corresponding databases from their
created robots. These questions cover multiple do-
mains such as banks, airlines, and communication Figure 4: The construction workflow of DuSQL.
carriers, etc.
Data analysis robots. Every day, innumerous
tables are generated, such as financial statements, the SQL syntax, i.e., row calculation, column cal-
business orders, etc. To perform data analysis over culation, and calculation with a constant. Figure 3
such data, companies hire professionals to write shows some examples.
SQL queries. Obviously, it is extremely useful to
build robots that allow financial experts to directly 3 Corpus Construction
perform data analysis using NL questions. We col-
Building a large-scale text-to-SQL dataset with
lect 500 questions from our data analysis robot.
multi-table databases is extremely challenging.
Figure 2 shows the query distributions of the
First, though there are a large amount of indepen-
three applications. It is obvious that calculation
dent tables on the Internet, connections among the
questions occupy a considerable proportion in all
tables are usually unavailable. Therefore, great
three applications. For customer service robots,
efforts are needed to create multi-table databases.
users mainly try to search information, and there-
Second, it is usually difficult to obtain NL ques-
fore most questions belong to the matching type.
tions against certain databases. Third, given a ques-
Yet, 8% questions require calculation SQL queries
tion and the corresponding database, we need profi-
to be answered. For data analysis robots, calcu-
cient annotators to write a SQL query for the ques-
lation questions dominate the distribution, since
tion who understand both the database schema and
users try to figure out useful clues behind the data.
the SQL syntax.
To gain more insights, we further divide calcu-
Different from previous works, which usually
lation questions into three subtypes according to
rely on human to create both NL questions and
1
The API is publicly available at https://ai.baidu. SQL queries (Yu et al., 2018b), we build our
com/unit/v2#/innovationtec/home. dataset via a human-computer collaboration way,

6925
as illustrated in Figure 4. The key idea is to auto- Source Proportion
Baike 40.3
matically generate SQL queries paired with pseudo Vertical domain websites 31.3
questions given a database. Then pseudo ques- Annual report 23.4
tions are paraphrased to NL questions by humans. Others 5.0
Finally, to guarantee data quality, low-confidence Table 1: The distribution of database sources.
SQL queries and NL questions detected according
to their overlapping and similarity metrics, and are
further checked by humans. joined tables. We choose 200 graphs to create
databases, and manually check and correct foreign
3.1 Database Creation keys for each database.
Most of mature databases used in industry are not Overall, we create 200 databases with 813 ta-
publicly available. So we collect our databases bles, covering about 70% of Baike entries from
mainly from the Internet. However, databases more than 160 domains such as movies, actors,
available on the Internet are in the form of inde- cities, animals, foods, etc. Since some tables are
pendent tables, which need to be linked with other sensitive, we use the column header of each table,
tables. We create databases in three steps: table ac- and populate it with randomly selected values from
quisition, table merging, and foreign key creation. the original table.
We collect websites to crawl tables, ensuring
that they cover multiple domains. As the largest 3.2 Automatic Generation of SQL Queries
Chinese encyclopedia, Baidu Baike contains more Given a database, we want to generate as many
than 17 million entries across more than 200 do- common SQL queries as possible. Both manu-
mains. We start with all the entries in Baike as the ally writing SQL queries and quality-checking take
initial sites, and extend the collection based on the a significant amount of time. Obviously, SQL
reference sites in each entry page. We keep sites queries can be automatically generated from the
where tables are crawled. The final collection con- grammar. We utilize production rules from the
tains entries of Baike, annual report websites2 , ver- grammar to automatically generate SQL queries,
tical domain websites3 , and other websites such as instead of asking annotators to write them. Accord-
community forums4 . Table 1 shows the data distri- ing to the difficulty5 and semantic correctness of a
bution regarding database sources. SQL query, we prune the rule paths in the genera-
To make a domain correspond to a database, we tion. Then, we sample the generated SQL queries
merge tables with the same schema to a new ta- according to the distribution in Figure 2 and carry
ble with a new schema, e.g., tables about China out the follow-up work based on them.
cities with the schema of {population, area, ...} As illustrated in Figure 5, the SQL query can
are merged to a new table with the schema be represented as a tree using the rule sequence
of {termid, name, population, area, ...}, where of {SQLs = SQL, SQL = Select Where, Select
termid is randomly generated as primary key and = SELECT A, Where = WHERE Conditions, ...},
name is the name of the city. Meanwhile, we add all of which are production rules of the grammar.
a type for each column according to the form of Guided by the SQL query distributions in real ap-
its value, where the column type consists of text, plications, we design production rules to ensure
number and date. that all common SQL queries can be generated,
We create foreign keys between two tables via e.g., the rule of {C = table.column mathop ta-
entity linking, e.g., a table named “Livable cities ble.column} allows calculations between columns
in 2019” with the schema of {city_name, ranker, or rows. By exercising every rule of the grammar,
...} joins to a table named “China cities” with the we can generate SQL queries covering patterns of
schema of {term_id, name, area, ...} through the different complexity.
links of entities in “city_name” and “name”. Ac- We consider two aspects in the automatic SQL
cording to foreign keys, all tables are split into generation: the difficulty and semantic correct-
separate graphs, each of which consists of several
5
2
We observe that very complex queries are rare in search
QuestMobile, 199it, tianyancha, etc. logs. Since our SQL queries are automatically gener-
3
State Statistical Bureau, China Industrial Information ated, without complexity control, the proportion of complex
Network, Shopping websites, Booking websites, etc. queries would dominate the space, thus deviating from the
4
Baidu Tieba, Newsmth, Hupu, etc. real query distribution.

6926
Figure 5: An example of SQL query generation from the grammar. We show a part of production rules (all rules
are shown in Appendix A). The leaf nodes in red are from the database.

ness of a SQL query. To control the difficulty of


the generated queries, we make some restrictions
based on our analysis on real-life questions: first,
a SQL query contains only one nested query; sec-
ond, there are no more than three conditions in a
where clause and no more than four answers in a se-
lect statement; third, a SQL query has at most one
math operation; forth, most text values are from
databases6 . To ensure the semantics correctness
of the generated query, we abide by preconditions
of each clause and expression in the generation,
e.g., the expression of {A > SQL} requires that the
nested SQL returns a number value. The full list of
preconditions is shown in Appendix C.
Under these requirements, we generate a large
amount of candidate SQL queries against 200 Figure 6: An example of the pseudo question generation
databases. Among them, only a tiny proportion of according to the execution order of the SQL query. The
SQL queries are of the calculation type, since only numbers in circles represent the order of execution.
few columns support calculation operations. We
keep all queries in the calculation type, randomly
select ones with sorting and clustering types of the 3.3 Semi-automatic Generation of Questions
same size, and select ones with the matching type7 For each SQL query, we automatically generate a
of three times the size. We make sure that these se- pseudo question to explain it. Then pseudo ques-
lected queries are spread across all 200 databases. tions are shown to annotators who can understand
Then these queries are used as input for the follow- them and paraphrase them to NL questions without
up work. looking at databases and SQL queries.
We generate a pseudo question for a SQL query
6
according to its execution order. As shown in Fig-
The text values in a SQL query are from the database to
reduce the difficulty of SQL prediction. We plan to remove ure 6, the entire pseudo question of the SQL query
this restriction in the next release version of DuSQL. consists of pseudo descriptions of all clauses ac-
7
Including combinations of matching type and other types, cording to their execution orders. The pseudo de-
e.g., the SQL query of {SELECT ... WHERE ... ORDER
BY ... } represents the combination of matching and sorting scription of a clause consists of pseudo descrip-
types. tions of all its components. We give a description

6927
for each component, e.g., list for SELECT, average
for the aggregator of avg. Appendix D shows the
descriptions for all components. To ensure that the
pseudo question is clear and reflects the meaning
of the SQL query, intermediate variables are intro-
duced to express sub-SQL queries, e.g., “v1” in
the example of Figure 6 represents the result of the
nested query and is used as a value in other expres-
sions.
We ask two annotators8 to reformulate pseudo
questions into NL questions9 , and filter two kinds
of questions: 1) incomprehensible ones which are
semantically unclear; 2) unnatural ones which are
not the focus of humans10 . During the process of
paraphrasing, 6.7% of question/SQL pairs are fil- Figure 7: SQL query examples.
tered, among which 76.5% are complex queries.
Then we ask other annotators to check the correct-
ness of reformulated questions, and find 8% of examples in Figure 7. DuSQL contains enough
questions are inaccurate. question/SQL pairs for all common types. Wik-
iSQL and TableQA are simple datasets, only con-
3.4 Review and Checking
taining matching questions. Spider and CSpider
To guarantee data quality, we automatically detect (Min et al., 2019) mainly cover matching, sort-
low-quality question/SQL pairs according to the ing, clustering and their combinations. There are
following evaluation metrics. very few questions in the calculation type, and
• Overlap. To ensure the naturalness of our all of them only need column calculations. Spi-
questions, we calculate the overlap between the der does not focus on questions that require the
pseudo question and the corresponding NL ques- common knowledge and math operation. Accord-
tion. The question with an overlap higher than ing to our analysis in Figure 2, the calculation
0.6 is considered to be of low quality. type is very common, accounting for 8% to 65%
in different applications. DuSQL, a pragmatic
• Similarity. To ensure that the question contains industry-oriented dataset, conforms to the distribu-
enough information for the SQL query, we train tion of SQL queries in real applications. Mean-
a similarity model based on question/SQL pairs. while, DuSQL is larger, twice the size of other
The question with a similarity score less than 0.8 complex datasets. DuSQL contains 200 databases,
is considered to be of low quality. covering about 70% of entries in Baike and more
In the first round, about 18% of question/SQL than 160 domains, e.g., cities, singers, movies, an-
pairs are of low quality. We ask annotators to check imals, etc. We provide content for each database.
these pairs and correct the error pairs. This process All the values of a SQL query can be found in the
iterates through the collaboration of human and database, except for numeric values. All table and
computer until the above metrics no longer chang- column names in the database are clear and self-
ing. It iterates twice in the construction of DuSQL. contained. In addition, we provide English schema
for each database, including table names and col-
3.5 Dataset Statistics umn headers.
We summarize the statistics of DuSQL and other
cross-domain datasets in Table 2, and give some 4 Benchmark Approaches
8
They are full-time employees and familiar with SQL lan- All existing text-to-SQL works focus on English
guage. Meanwhile, they have lots of experience in annotating
QA data. datasets. Considering that DuSQL is the most sim-
9
Some values in SQL queries are rewritten as synonyms. ilar with Spider, we choose the following three rep-
10
E.g., “When province is Sichuan, list the total rank of resentative publicly available parsers as our bench-
these cities.” for the SQL query {SELECT sum(rank) From
T2 WHERE province = ‘Sichuan’} is considered as an unnat- mark approaches, to understand the performance
ural question, as the total rank would not be asked by humans. of existing approaches on our new Chinese dataset.

6928
Calculation
Dataset Size DB Table/DB Matching Sorting Clustering Others
Column Row Constant
WikiSQL 80,654 26,251 1 80,654 0 0 0 0 0 0
TableQA 49,974 5,291 1 49,974 0 0 0 0 0 0
Spider 9,693 200 5.1 6,450 863 1,059 13 0 0 1,308
CSpider 9,691 166 5.3 6,448 862 1,048 13 0 0 1,318
Ours 23,797 200 4.1 6,440 2,276 3,768 1,760 1,385 1,097 7,071

Table 2: Statistics and comparisons of all existing cross-domain text-to-SQL datasets. The statistics of Spider are
based on published data, only containing train and development sets. Others consists of combinations between
matching, sorting and clustering types.

We also extend the state-of-the-art IRNet model of


Z ::= +RR| − RR| × RR| ÷ RR
Guo et al. (2019) to accommodate the two charac- Filter ::= = A V | != A V | > A V | < A V
teristics of our data, i.e., calculation questions and | >= A V | <= A V | like A V
the need of value prediction. Superlative ::= des A V | asc A V
A ::= max MathA | min MathA | count MathA
Seq2Seq+Copying (Zhong et al., 2017) incor-
| sum MathA | avg MathA | none MathA
porates the database schemas into the model input MathA ::= +AA|− AA|×AA|÷AA
and uses a copying mechanism in the decoder. V ::= value

SyntaxSQLNet (Yu et al., 2018a) proposes a


SQL syntax tree-based network to generate SQL Figure 8: The extended grammar for SemQL.
structures, and uses generation path history and
table-aware column attention in the decoder. or number/date to determine candidate values for
IRNet (Guo et al., 2019) designs an interme- the predicated SQL query. The values are used in
diate representation called SemQL for encoding the same way as the columns and tables in the IR-
higher-level abstraction structures than SQL, and Net model.
then uses a grammar-based decoder (Yin and Neu-
big, 2017) to synthesize a SemQL query. At 5 Experiments
present, IRNet reports the state-of-the-art results
on Spider dataset. Data Settings Following WikiSQL, we split our
dataset into train/dev/test in a way so that databases
Both SyntaxSQLNet and IRNet utilize a gram-
are non-overlapping among the three subsets. In
mar to guide SQL generation and conduct experi-
other words, all question/SQL pairs for the same
ments on Spider dataset. However, neither of their
database are in the same subset. This is also re-
grammars can handle calculation questions. An-
ferred to as cross-domain parsing problem, since
other major difference between our dataset and Spi-
some database schemes in dev/test do not appear
der is that our evaluation metric (see Section §5)
in train. At last, 200 databases are split into
also considers value prediction, since values in a
160/17/23, and 23,979 question/SQL pairs are split
SQL query are from the corresponding question
into 18,602/2,039/3,156.
or database both of which are available inputs to
the model. Please refer to our discussion in Sec- Evaluation Metrics Evaluation metrics for the
tion §3 for details. Due to the characteristics of text-to-SQL task include component matching, ex-
our dataset, all the three models perform poorly on act matching, and execution accuracy. Component
DuSQL. Therefore, we extend the IRNet model to matching (Yu et al., 2018b) uses F1 score to evalu-
accommodate DuSQL as follows. ate the performance of the model on each clause.
Firstly, we extend the grammar of SemQL to ac- Exact matching, namely the percentage of ques-
commodate the two characteristics of our dataset, tions whose predicted SQL query is equivalent to
as shown in Figure 8. The production rules in bold the gold SQL query, is widely used in text-to-SQL
are added to parse calculation questions. Other pro- tasks. Execution accuracy, namely the percentage
duction rules are modified based on original rules of questions whose predicted SQL query obtains
to support value prediction (Due to space limita- the correct answer, assumes that each SQL query
tion, we attach the full list of extended grammar has an answer.
in Appendix F.). Then we use all the n-grams of We use exact matching as the main metric, and
length 1-6 in the question to match database cells follow Xu et al. (2017) and Yu et al. (2018b) to

6929
Calculation
Methods Matching Sorting Clustering Others
Column Row Constant All
IRNet 0 0 0 0 25.0 32.8 34.2 8.7
IRNetExt 22.0 34.3 37.9 29.7 52.1 68.7 60.8 52.5

Table 3: Performances of different SQL query types.

w/o values w/ values the generation of where clauses can be improved


Methods
Dev Test Dev Test
Seq2SeqCopying 6.6 3.9 2.6 1.9 by leveraging the column-cell relationship.
SyntaxSQLNet 14.6 8.6 7.1 5.2
IRNet 38.4 34.2 18.4 15.4 Analysis. Table 3 shows performance of differ-
IRNetExt 59.8 54.3 56.2 50.1 ent SQL query types. Firstly, the grammar ex-
w/o calculation 50.2 48.2 46.6 45.6 tension is effective, the accuracy of all types is
w/o value 50.1 43.5 19.4 17.9
significantly improved. Second, the accuracy of
Table 4: Performance of the benchmark approaches. calculation type is lower than that of other types,
as many calculation questions require incorporat-
ing common knowledge, e.g., age = dateOfDeath
handle the “ordering issue”. Finally, we give the - dateOfBirth. How to represent and incorporate
model performance with (w) and without (w/o) such knowledge into the model is very challeng-
value evaluation. ing. Third, questions requiring common knowl-
Main results. Table 4 shows performance of edge perform poorly, as they need understanding
the benchmark approaches. The performance of rather than matching, such as the matching issue
Seq2SeqCopying is the lowest. It uses the copy- of “the oldest” and “age”.
ing mechanism to reduce errors posed by out-of- 6 Related Work
domain words in the databases of test set. But it
predicts lots of invalid SQL queries with grammat- Semantic parsing. Semantic parsing aims to map
ical errors, since its decoder does not consider SQL NL utterances into semantic representations, such
structures at all. as logical forms (Liang, 2013), SQL queries (Tang
SyntaxSQLNet and IRNet outperform and Mooney, 2001), Python code (Ling et al.,
Seq2SeqCopying by utilizing a grammar from 2016), etc. In order to facilitate model training and
SQL structures to guide SQL generation. In evaluation, researchers release a variety of datasets.
particular, IRNet utilize SemQL as an abstraction ATIS and GeoQuery are two popular early datasets
representation of SQL queries. However, neither originally in logical forms, and are converted into
of the two vanilla models handles calculation SQL queries (Iyer et al., 2017; Popescu et al.,
questions and value directions properly. The basic 2003). As two recently released datasets, WikiSQL
IRNet achieves only 34.2/15.4 accuracy on the test (Zhong et al., 2017) and Spider (Yu et al., 2018b)
set w/o and w/ value evaluation. have attracted extensive research attention. It is
We can see that by simply extending IRNet to also noteworthy that Min et al. (2019) propose the
parse calculation questions and predict values, the CSpider dataset by translating English questions of
IRNetExt model achieves much higher accuracy Spider into Chinese.
(54.3/50.1). Data construction methods. As discussed in
Section §3, creating a large-scale semantic parsing
Ablation study. We perform ablation study to dataset is extremely challenging. To construct Spi-
gain more insights on the contribution of our ex- der, Yu et al. (2018b) ask annotators to write both
tensions. As shown in table 4, the accuracy on test questions and SQL queries given a database. Both
set drops 4.5 by excluding production rules from Iyer et al. (2017) and Herzig and Berant (2019) as-
the grammar of SemQL. The accuracy of calcula- sume that the database and questions are given and
tion type is 0, which composes 20.7% of the ques- try to reduce the effort of creating semantic repre-
tions in the test set. After excluding the prediction sentations. Our data construction is most closely
of values, the test performance drops significantly related to Overnight (Wang et al., 2015), who
for two reasons. First, there are a large number of proposes to automatically generate logical forms
questions that contain values, accounting for about based on a hand-crafted grammar and ask annota-
75% in the dev set and 70% in the test set. Second, tors to paraphrase pseudo questions into NL ques-

6930
tions. Overnight focuses on logic form (LF) based Acknowledgments
semantic representation, while our work on SQL
representation. The differences are two-fold. First, We thank three anonymous reviewers for their
databases of Overnight are much simpler, com- helpful feedback and discussion on this work.
posed of a set of entity-property-entity triples. Sec- Zhenghua Li and Min Zhang were supported by
ond, LF operations of Overnight are much simpler, National Natural Science Foundation of China
consisting of only matching and aggregation opera- (Grant No. 61525205, 61876116), and a Project
tions, such as count, min, max. Our dataset is more Funded by the Priority Academic Program Devel-
complex and thus imposes more challenges on the opment (PAPD) of Jiangsu Higher Education Insti-
data construction. tutions.

Text-to-SQL parsing approaches. Seq2Seq


models achieve the state-of-the-art results on References
single-database datasets such as ATIS and Geo-
Query (Dong and Lapata, 2016). With the release Ali Mohamed Nabil Allam and Mohamed Hassan Hag-
gag. 2012. The question answering systems: A sur-
of WikiSQL dataset, researchers make efforts to vey. International Journal of Research and Reviews
handle unseen databases by using database schema in Information Sciences (IJRRIS), 2(3).
as inputs. Two mainstream approaches are the
Seq2Seq model with copy mechanism (Sun et al., Ben Bogin, Jonathan Berant, and Matt Gardner. 2019.
2018) and the Seq2Set model (Xu et al., 2017). Representing schema structure with graph neural net-
works for text-to-sql parsing. In Proceedings of the
With BERT representations (Devlin et al., 2019), 57th Annual Meeting of the Association for Compu-
the execution accuracy exceeds 90% (He et al., tational Linguistics, pages 4560–4565.
2019; Guo and Gao, 2019).
Robin D Burke, Kristian J Hammond, Vladimir Ku-
For the more challenging Spider dataset with lyukin, Steven L Lytinen, Noriko Tomuro, and Scott
multi-table databases, Guo et al. (2019) introduces Schoenberg. 1997. Question answering from fre-
an intermediate representation (SemQL) for SQL quently asked question files: Experiences with the
queries, and uses a grammar-based decoder to faq finder system. AI magazine, 18(2):57–57.
generate SemQL, reporting state-of-the-art perfor-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
mance. Bogin et al. (2019) proposes to encode the Kristina Toutanova. 2019. Bert: Pre-training of deep
database schema with graph neural network. Re- bidirectional transformers for language understand-
cently, Wang et al. (2019) proposes RATSQL to ing. In Proceedings of the 2019 Conference of the
use relation-aware self-attention to better encode North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
the question and database schema simultaneously. gies, Volume 1 (Long and Short Papers), pages 4171–
4186.

7 Conclusion Li Dong and Mirella Lapata. 2016. Language to logi-


cal form with neural attention. In Proceedings of the
54th Annual Meeting of the Association for Compu-
We present the first large-scale and pragmatic Chi- tational Linguistics (Volume 1: Long Papers), pages
nese dataset for cross-domain text-to-SQL parsing. 33–43.
Based on the analysis on questions from real-world
Catherine Finegan-Dollak, Jonathan K Kummerfeld,
applications, our dataset contains a considerable Li Zhang, Karthik Ramanathan, Sesh Sadasivam,
proportion of questions that require row/column Rui Zhang, and Dragomir Radev. 2018. Improving
calculations. We extend the state-of-the-art IR- text-to-sql evaluation methodology. In Proceedings
Net model on Spider to accommodate DuSQL, and of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
obtain substantial performance boost. Yet, there
pages 351–360.
is still a large room for improvement, especially
on calculation questions which usually require in- Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,
corporation of common-sense knowledege into the Jian-Guang Lou, Ting Liu, and Dongmei Zhang.
model. For future work, we will continually im- 2019. Towards complex text-to-sql in cross-domain
database with intermediate representation. In Pro-
prove the scale and quality of our dataset, to fa- ceedings of the 57th Annual Meeting of the Asso-
cilitate future research and to meet the need of ciation for Computational Linguistics, pages 4524–
database-oriented applications. 4535.

6931
Tong Guo and Huilin Gao. 2019. Content enhanced Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Guihong
bert-based text-to-sql generation. arXiv preprint Cao, Xiaocheng Feng, Bing Qin, Ting Liu, and Ming
arXiv:1910.07179. Zhou. 2018. Semantic parsing with syntax-and table-
aware sql generation. In Proceedings of the 56th An-
Pengcheng He, Yi Mao, Kaushik Chakrabarti, and nual Meeting of the Association for Computational
Weizhu Chen. 2019. X-sql: reinforce schema Linguistics (Volume 1: Long Papers), pages 361–
representation with context. arXiv preprint 372.
arXiv:1908.08113.
Lappoon R Tang and Raymond J Mooney. 2001. Us-
Jonathan Herzig and Jonathan Berant. 2019. Dont para- ing multiple clause constructors in inductive logic
phrase, detect! rapid and effective data collection for programming for semantic parsing. In European
semantic parsing. In Proceedings of the 2019 Confer- Conference on Machine Learning, pages 466–477.
ence on Empirical Methods in Natural Language Pro- Springer.
cessing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
pages 3801–3811. Polozov, and Matthew Richardson. 2019. Rat-sql:
Relation-aware schema encoding and linking for text-
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant
to-sql parsers. arXiv preprint arXiv:1911.04942.
Krishnamurthy, and Luke Zettlemoyer. 2017. Learn-
ing a neural semantic parser from user feedback. In Yushi Wang, Jonathan Berant, and Percy Liang. 2015.
Proceedings of the 55th Annual Meeting of the As- Building a semantic parser overnight. In Proceed-
sociation for Computational Linguistics (Volume 1: ings of the 53rd Annual Meeting of the Association
Long Papers), pages 963–973. for Computational Linguistics and the 7th Interna-
Cody CT Kwok, Oren Etzioni, and Daniel S Weld. 2001. tional Joint Conference on Natural Language Pro-
Scaling question answering to the web. In Proceed- cessing (Volume 1: Long Papers), pages 1332–1342.
ings of the 10th international conference on World Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:
Wide Web, pages 150–161. Generating structured queries from natural language
Fei Li and HV Jagadish. 2014. Constructing an in- without reinforcement learning. arXiv preprint
teractive natural language interface for relational arXiv:1711.04436.
databases. Proceedings of the VLDB Endowment,
Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and
8(1):73–84.
Thomas Dillig. 2017. Sqlizer: query synthesis from
Percy Liang. 2013. Lambda dependency-based compo- natural language. Proceedings of the ACM on Pro-
sitional semantics. arXiv preprint arXiv:1309.4408. gramming Languages, 1(OOPSLA):1–26.

Wang Ling, Phil Blunsom, Edward Grefenstette, Pengcheng Yin and Graham Neubig. 2017. A syntac-
Karl Moritz Hermann, Tomáš Kočiskỳ, Fumin Wang, tic neural model for general-purpose code generation.
and Andrew Senior. 2016. Latent predictor networks In Proceedings of the 55th Annual Meeting of the As-
for code generation. In Proceedings of the 54th An- sociation for Computational Linguistics (Volume 1:
nual Meeting of the Association for Computational Long Papers), pages 440–450.
Linguistics (Volume 1: Long Papers), pages 599–
609. Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,
Dongxu Wang, Zifan Li, and Dragomir Radev.
Qingkai Min, Yuefeng Shi, and Yue Zhang. 2019. A 2018a. Syntaxsqlnet: Syntax tree networks for com-
pilot study for chinese sql semantic parsing. In Pro- plex and cross-domain text-to-sql task. In Proceed-
ceedings of the 2019 Conference on Empirical Meth- ings of the 2018 Conference on Empirical Methods
ods in Natural Language Processing and the 9th In- in Natural Language Processing, pages 1653–1663.
ternational Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 3643–3649. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, ing Yao, Shanelle Roman, et al. 2018b. Spider: A
Saurabh Tiwary, Rangan Majumder, and Li Deng. large-scale human-labeled dataset for complex and
2016. Ms marco: a human-generated machine read- cross-domain semantic parsing and text-to-sql task.
ing comprehension dataset. In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. 3911–3921.
2003. Towards a theory of natural language inter-
faces to databases. In Proceedings of the 8th in- Victor Zhong, Caiming Xiong, and Richard Socher.
ternational conference on Intelligent user interfaces, 2017. Seq2sql: Generating structured queries
pages 149–157. from natural language using reinforcement learning.
arXiv preprint arXiv:1709.00103.
Ningyuan Sun, Xuefeng Yang, and Yunfeng Liu. 2020.
Tableqa: a large-scale chinese text-to-sql dataset
for table-aware sql generation. arXiv preprint
arXiv:2006.06434.

6932
Matching
SQLs ::= SQL intersect SQLs | SQL union SQLs List cities with a population less than 10 million.

| SQL except SQLs | SQL SELECT name FROM T1 WHERE population < 10000000
SQL ::= Select | Select Where Sorting
| Select GroupC | Select Where GroupC Give the top 5 cities with the largest population.
| Select OrderC | Select Where OrderC SELECT name FROM T1 ORDER BY population DESC LIMIT 5
| Select from SQL, SQL
Clustering
Select ::= select A | select A A
Give the total population of each province.
| select A A A | select A A A A
SELECT province, sum(population) FROM T1 GROUP BY province
Where ::= where Conditions
GroupC ::= group by C Figure 10: Examples of types in Figure 2. All of them
| group by C having Conditions are based on the database in Figure 1.
| group by C OrderC
OrderC ::= order by C Dir | order by C Dir limit value
| order A Dir limit value do not have corresponding SQL queries.
Dir ::= asc / desc
Conditions ::= Condition | Condition and Conditions C Preconditions in SQL Generation
| Condition or Conditions
Condition ::= A op value | A op SQL
To ensure the semantic correctness of the gener-
A ::= min C | max C | count C | sum C | avg C | C ated SQL query, we define the preconditions for
C ::= table.column each production rule, and abide by these precondi-
| table.column1 mathop table.column2 tions in the SQL query generation.
| table1.column mathop table2.column
mathop ::= +|-|*|/ • For the generation of SQL query with multiple
op ::= = | != | > | >= | < | <= | like | in | not in SQLs, e.g., {SQLs ::=SQL union SQLs}: the
columns in the select clause of the previous SQL
Figure 9: The production rules for SQL generation. match the columns in the select clause of the
subsequent SQL, i.e., the columns of the two se-
lect clauses are the same or connected by foreign
A The Grammar for SQL Generation keys.
Figure 9 shows production rules used for SQL gen-
eration. All kinds of SQL queries can be generated • For the rule of generating GroupC: the C is gen-
by exercising each rule, e.g., the rule of {Condition erated from the rule of {C ::= table.column},
= A op SQL} for nested query generation, the rule where the column can perform the clustering op-
of {C = table.column1 mathop table.column2} and eration, that is to say, the table can be divided
{C = table1.column mathop table2.column}for cal- into several sub-tables according to the values of
culation query generation. this column.

B Query Type Definition • For the rule of {Condition ::= A op value}: op ∈


{<, <=, >, >=, =, !=, like}. If op ∈ {<, <=, >,
Question classification is mostly based on the oper- >=}, A and value must be in the type of number
ations used in corresponding SQL queries. Match- or date. If op is like, A must be in text type.
ing means the answer can be directly obtained from
the database. Sorting means we need to sort the re- • For the rule of {Condition ::= A op SQL}: op
turned results or only return top-k results. Cluster- ∈ {<, <=, >, >=, =, !=, in, not in}. If op ∈
ing means we have to perform aggregations (count, {<, <=, >, >=, =, !=}, A and SQL must be in
min/max, etc.) on each cluster. Calculation means the type of number, and {>= min, <= max} are
we need to calculate between columns or rows to invalid. If op ∈ {in, not in}, SQL must return a
get the answer. Other usually corresponds to ques- set.
tions requiring reasoning or subjective questions,
e.g., “Is Beijing bigger than Shanghai?", and “Is • For the rule of generating A: {avg C | sum C}
the ticket expensive?". Figure 10 shows some ex- require the C is in number type, {min C | max
amples for types in Figure 2, except for the calcula- C} require the C is in number or date type, and
tion type (shown in Figure 3) and other type which {count C} requires the C is in text type.

6933
Calculation
Dataset Size DB Table/DB Order Group Having Nest
Column Row Constant
WikiSQL 80,654 26,251 1 0 0 0 0 0 0 0
TableQA 49,974 5,291 1 0 0 0 0 0 0 0
Spider 10,181 200 5.1 1,335 1,491 388 844 13 0 0
CSpider 9,691 166 5.3 1,052 1,123 505 913 13 0 0
Ours 23,797 200 4.1 4,959 3,029 3,432 2,208 1,760 1,385 1,097

Table 5: Comparisons of cross-domain text-to-SQL datasets. The statistics of Spider are from Yu et al. (2018b).
The statistics of CSpider are based on the released data.

• For the rule of {C ::= t1.column mathop


t2.column}: the two columns are of the same Z ::= intersect R R | union R R | except R R | R
type, either number or date. Then we have | + RR| − RR| × RR| ÷ RR
R ::= Select | Select Filter
to make sure that the columns are comparable
| Select Order | Select Order Filter
based on rules built by search log analysis.
| Select Superlative | Select Superlative Filter
• For the rule of {C ::= t1.column1 mathop Select ::= A | A A | A A A | A... A
t1.column2}: the numerical units of these two Filter ::= and Filter Filter | or Filter Filter
| = A V | != A V | > A V | < A V
columns can perform corresponding mathemati-
| >= A V | <= A V | like A V | not_like A V
cal operations, e.g., CNY/per × person = CNY.
| = A R | != A R | > A R | < A R
D Descriptions of SQL Components | >= A R | <= A R | in A VR | not_in A R
Order ::= des A | asc A
We provide a description for each basic component, Superlative ::= des A V | asc A V
as follows: A ::= max C T | min C T | count C T
| sum C T | avg C T | none C T
• The descriptions for aggregators of {min, max, | max MathA | min MathA | count MathA
count, sum, avg} are {minimum, maximum, the | sum MathA | avg MathA | none MathA
number of, total, average}. MathA ::= +AA|− AA|×AA|÷AA
C ::= column
• The descriptions for operators of {=, !=, >, >=, T ::= table
<, <=, like, in, not in} are based on the column V ::= value
type. The descriptions for {=, !=, like, in, not in}
with the text type are {is, is not, contain, in, not Figure 11: The extended grammar for SemQL.
in}, descriptions for {=, !=, >, >=, <, <=} with
the number type are {is equal to, is not equal to,
E Dataset Statistics From Spider
more than, no less than, less than, no more than},
and descriptions for {=, !=, >, >=, <, <=} with Table 5 shows the statistics of our dataset and other
the date type are {in, not in, after, in or after, be- cross-domain datasets in the way of Spider. We
fore, in or before}. provide enough examples for both advanced SQL
clauses and the calculation type.
• The descriptions for math operators of {+, -, *,
/ } are {sum, difference, product, times}. F The extended grammar of SemQL
• The descriptions for the condition relations {and, We extend the grammar used in IRNet model to
or} are {and, or}. accommodate DuSQL, as shown in Figure 11. The
Figure 8 shows the main changes.
• The descriptions for {asc, desc} are {in the as-
cending, in the descending}.

• The descriptions for columns, tables, and values


are equal to themselves.

Meanwhile, we provide the description for each


production rule, as shown in Figure 12.

6934
Components Pseudo Descriptions
SQL intersect SQLs SQL, as set1, SQLs, as set2,
belong to set1 and set2
SQL union SQLs SQL, as set1, SQLs, as set2,
belong to set1 or set2
SQL except SQLs SQL, as set1, SQLs, as set2,
belong to set1 but not belong to set2
select A ... A list A, ... and A
where Conditions when Conditions
group by C for each C
group by C having Conditions the C that Conditions
group by C OrderC the C with OrderC
order by C Dir sorted by C Dir
order by C Dir limit value the top value sorted by C Dir
order by A Dir limit value the top value sorted by A Dir
A op value A op value
A op SQL SQL as v1, A op v1
agg C agg C
count * the number of table
T1.C + T2.C the sum of T1.C and T2.C
T1.C − T2.C the difference between T1.C and T2.C
T1.C ∗ T2.C the product of T1.C and T2.C
T1.C / T2.C T1.C is times of T2.C

Figure 12: The pseudo descriptions for all production rules.

6935

You might also like