Dusql
Dusql
Abstract
Due to the lack of labeled data, previous re-
search on text-to-SQL parsing mainly focuses
on English. Representative English datasets in-
clude ATIS, WikiSQL, Spider, etc. This paper
presents DuSQL, a larges-scale and pragmatic
Chinese dataset for the cross-domain text-to-
SQL task, containing 200 databases, 813 ta-
bles, and 23,797 question/SQL pairs. Our new
dataset has three major characteristics. First,
by manually analyzing questions from several
representative applications, we try to figure out
the true distribution of SQL queries in real-life
needs. Second, DuSQL contains a consider-
able proportion of SQL queries involving row
or column calculations, motivated by our analy-
Figure 1: Illustration of the text-to-SQL task.
sis on the SQL query distributions. Finally, we
adopt an effective data construction framework
via human-computer collaboration. The basic
idea is automatically generating SQL queries
training and evaluation, such as ATIS (Iyer et al.,
based on the SQL grammar and constrained 2017), GeoQuery (Popescu et al., 2003), WikiSQL
by the given database. This paper describes in (Zhong et al., 2017), and Spider (Yu et al., 2018b).
detail the construction process and data statis- Formally, given a natural language (NL) ques-
tics of DuSQL. Moreover, we present and com- tion and a relational database, the text-to-SQL task
pare performance of several open-source text- aims to produce a legal and executable SQL query
to-SQL parsers with minor modification to ac-
that leads directly to the correct answer, as depicted
commodate Chinese, including a simple yet ef-
fective extension to IRNet for handling calcula- in Figure 1. A database is composed of multiple
tion SQL queries. tables and denoted as DB = {T1 , T2 , ..., Tn }. A ta-
ble is composed of multiple columns and denoted
1 Introduction as Ti = {col1 , col2 , ..., colm }. Tables are usually
In the past few decades, a large amount of research linked with each other by foreign keys.
has focused on searching answers from unstruc- The earliest datasets include ATIS (Iyer et al.,
tured texts given natural questions, which is also 2017) , GeoQuery (Popescu et al., 2003), Restau-
known as the question answering (QA) task (Burke rants (Tang and Mooney, 2001), Academic (Li and
et al., 1997; Kwok et al., 2001; Allam and Hag- Jagadish, 2014), etc. Each dataset only has a sin-
gag, 2012; Nguyen et al., 2016). However, a lot of gle database containing a certain number of ta-
high-quality knowledge or data are actually stored bles. All question/SQL pairs of train/dev/test sets
in databases in the real world. It is thus extremely are generated against the same database. Many in-
useful to allow ordinary users to directly inter- teresting approaches are proposed to handle those
act with databases via natural questions. To meet datasets (Iyer et al., 2017; Yaghmazadeh et al.,
this need, researchers have proposed the text-to- 2017; Finegan-Dollak et al., 2018).
SQL task with released English datasets for model However, real-world applications usually in-
6923
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6923–6935,
November 16–20, 2020. c 2020 Association for Computational Linguistics
volve more than one database, and require the
model to be able to generalize to and handle unseen
databases during evaluation. To accommodate this
need, the WikiSQL dataset is then released by
Zhong et al. (2017). It consists of 80,654 ques-
tion/SQL pairs for 24,241 single-table databases.
They propose a new data split setting to ensure that
databases in train/dev/test do not overlap. However,
they focus on very simple SQL queries containing
one SELECT statement with one WHERE clause.
In addition, Sun et al. (2020) released TableQA, a
Chinese dataset similar to the WikiSQL dataset. Figure 2: The SQL query distributions of the three ap-
Yu et al. (2018b) released a more challenging plications. Please kindly note that a query may belong
Spider dataset, consisting of 10,181 question/SQL to multiple types.
pairs against 200 multi-table databases. Compared
with WikiSQL and TableQA, Spider is much more
complex due to two reasons: 1) the need of select- generate a pseudo question by traversing it in the
ing relevant tables; 2) many nested queries and ad- execution order and then ask annotators to para-
vanced SQL clauses like GROUP BY and ORDER phrase it into a NL question.
BY. • We conduct experiments on DuSQL using
As far as we know, most existing datasets are three open-source parsing models. In par-
constructed for English. Another issue is that they ticular, we extend the state-of-the-art IRNet
do not refer to the question distribution in real- (Guo et al., 2019) model to accommodate
world applications during data construction. Tak- the characteristics of DuSQL. Results and
ing Spider as an example. Given a database, anno- analysis show that DuSQL is a very chal-
tators are asked to write many SQL queries from lenging dataset. We will release our data at
scratch. The only requirement is that SQL queries https://github.com/luge-ai/luge-ai/
have to cover a list of SQL clauses and nested tree/master/semantic-parsing.
queries. Meanwhile, the annotators write NL ques-
tions corresponding to SQL queries. In particular, 2 SQL Query Distribution
all these datasets contain very few questions involv-
ing calculations between rows or columns, which As far as we know, existing text-to-SQL datasets
we find are very common in real applications. mainly consider the complexity of SQL syntax
This paper presents DuSQL, a large-scale and when creating SQL queries. For example, Wik-
pragmatic Chinese text-to-SQL dataset, contain- iSQL has only simple SQL queries containing SE-
ing 200 databases, 813 tables, and 23,797 ques- LECT and WHERE clauses. Spider covers 15 SQL
tion/SQL pairs. Specifically, our contributions are clauses including SELECT, WHERE, ORDER BY,
summarized as follows. GROUP BY, etc, and allows nested queries.
However, to build a pragmatic text-to-SQL sys-
• In order to determine a more realistic distribution tem that allows ordinary users to directly interact
of SQL queries, we collect user questions from with databases via NL questions, it is very impor-
three representative database-oriented applica- tant to know the SQL query distribution in real-
tions and perform manual analysis. In particular, world applications, from the aspect of user need
we find that a considerable proportion of ques- rather than SQL syntax. Our analysis shows that
tions require row/column calculations, which are Spider mainly covers three types of SQL queries,
not included in existing datasets. i.e., matching, sorting, and clustering, whereas
WikiSQL only has matching queries. Neither of
• We adopt an effective data construction frame- them contains the calculation type, which we find
work via human-computer collaboration. The ba- composes a large portion of questions in certain
sic idea is automatically generating SQL queries real-world applications.
based on the SQL grammar and constrained by To find out the SQL query distribution in real-
the given database. For each SQL query, we first life applications, we consider the following three
6924
Row Calculation
representative types of database-oriented applica-
How much bigger is Guangzhou than Shenzhen?
tions, and conduct manual analysis against user
SELECT a.area(km2 ) - b.area(km2 ) FROM
questions. We ask annotators to divide user ques-
(SELECT area(km2 ) FROM T1 WHERE name = ‘Guangzhou’) a,
tions into five categories (see Appendix B for de-
(SELECT area(km2 ) FROM T1 WHERE name = ‘Shenzhen’) b
tails), i.e., matching, sorting, clustering, calcula-
tion, and others. Column Calculation
Information retrieval applications. We use What is the population density of Hefei?
Baidu, the Chinese search engine, as a typical in- SELECT population / area(km2 ) FROM T1 WHERE name = ‘Hefei’
6925
as illustrated in Figure 4. The key idea is to auto- Source Proportion
Baike 40.3
matically generate SQL queries paired with pseudo Vertical domain websites 31.3
questions given a database. Then pseudo ques- Annual report 23.4
tions are paraphrased to NL questions by humans. Others 5.0
Finally, to guarantee data quality, low-confidence Table 1: The distribution of database sources.
SQL queries and NL questions detected according
to their overlapping and similarity metrics, and are
further checked by humans. joined tables. We choose 200 graphs to create
databases, and manually check and correct foreign
3.1 Database Creation keys for each database.
Most of mature databases used in industry are not Overall, we create 200 databases with 813 ta-
publicly available. So we collect our databases bles, covering about 70% of Baike entries from
mainly from the Internet. However, databases more than 160 domains such as movies, actors,
available on the Internet are in the form of inde- cities, animals, foods, etc. Since some tables are
pendent tables, which need to be linked with other sensitive, we use the column header of each table,
tables. We create databases in three steps: table ac- and populate it with randomly selected values from
quisition, table merging, and foreign key creation. the original table.
We collect websites to crawl tables, ensuring
that they cover multiple domains. As the largest 3.2 Automatic Generation of SQL Queries
Chinese encyclopedia, Baidu Baike contains more Given a database, we want to generate as many
than 17 million entries across more than 200 do- common SQL queries as possible. Both manu-
mains. We start with all the entries in Baike as the ally writing SQL queries and quality-checking take
initial sites, and extend the collection based on the a significant amount of time. Obviously, SQL
reference sites in each entry page. We keep sites queries can be automatically generated from the
where tables are crawled. The final collection con- grammar. We utilize production rules from the
tains entries of Baike, annual report websites2 , ver- grammar to automatically generate SQL queries,
tical domain websites3 , and other websites such as instead of asking annotators to write them. Accord-
community forums4 . Table 1 shows the data distri- ing to the difficulty5 and semantic correctness of a
bution regarding database sources. SQL query, we prune the rule paths in the genera-
To make a domain correspond to a database, we tion. Then, we sample the generated SQL queries
merge tables with the same schema to a new ta- according to the distribution in Figure 2 and carry
ble with a new schema, e.g., tables about China out the follow-up work based on them.
cities with the schema of {population, area, ...} As illustrated in Figure 5, the SQL query can
are merged to a new table with the schema be represented as a tree using the rule sequence
of {termid, name, population, area, ...}, where of {SQLs = SQL, SQL = Select Where, Select
termid is randomly generated as primary key and = SELECT A, Where = WHERE Conditions, ...},
name is the name of the city. Meanwhile, we add all of which are production rules of the grammar.
a type for each column according to the form of Guided by the SQL query distributions in real ap-
its value, where the column type consists of text, plications, we design production rules to ensure
number and date. that all common SQL queries can be generated,
We create foreign keys between two tables via e.g., the rule of {C = table.column mathop ta-
entity linking, e.g., a table named “Livable cities ble.column} allows calculations between columns
in 2019” with the schema of {city_name, ranker, or rows. By exercising every rule of the grammar,
...} joins to a table named “China cities” with the we can generate SQL queries covering patterns of
schema of {term_id, name, area, ...} through the different complexity.
links of entities in “city_name” and “name”. Ac- We consider two aspects in the automatic SQL
cording to foreign keys, all tables are split into generation: the difficulty and semantic correct-
separate graphs, each of which consists of several
5
2
We observe that very complex queries are rare in search
QuestMobile, 199it, tianyancha, etc. logs. Since our SQL queries are automatically gener-
3
State Statistical Bureau, China Industrial Information ated, without complexity control, the proportion of complex
Network, Shopping websites, Booking websites, etc. queries would dominate the space, thus deviating from the
4
Baidu Tieba, Newsmth, Hupu, etc. real query distribution.
6926
Figure 5: An example of SQL query generation from the grammar. We show a part of production rules (all rules
are shown in Appendix A). The leaf nodes in red are from the database.
6927
for each component, e.g., list for SELECT, average
for the aggregator of avg. Appendix D shows the
descriptions for all components. To ensure that the
pseudo question is clear and reflects the meaning
of the SQL query, intermediate variables are intro-
duced to express sub-SQL queries, e.g., “v1” in
the example of Figure 6 represents the result of the
nested query and is used as a value in other expres-
sions.
We ask two annotators8 to reformulate pseudo
questions into NL questions9 , and filter two kinds
of questions: 1) incomprehensible ones which are
semantically unclear; 2) unnatural ones which are
not the focus of humans10 . During the process of
paraphrasing, 6.7% of question/SQL pairs are fil- Figure 7: SQL query examples.
tered, among which 76.5% are complex queries.
Then we ask other annotators to check the correct-
ness of reformulated questions, and find 8% of examples in Figure 7. DuSQL contains enough
questions are inaccurate. question/SQL pairs for all common types. Wik-
iSQL and TableQA are simple datasets, only con-
3.4 Review and Checking
taining matching questions. Spider and CSpider
To guarantee data quality, we automatically detect (Min et al., 2019) mainly cover matching, sort-
low-quality question/SQL pairs according to the ing, clustering and their combinations. There are
following evaluation metrics. very few questions in the calculation type, and
• Overlap. To ensure the naturalness of our all of them only need column calculations. Spi-
questions, we calculate the overlap between the der does not focus on questions that require the
pseudo question and the corresponding NL ques- common knowledge and math operation. Accord-
tion. The question with an overlap higher than ing to our analysis in Figure 2, the calculation
0.6 is considered to be of low quality. type is very common, accounting for 8% to 65%
in different applications. DuSQL, a pragmatic
• Similarity. To ensure that the question contains industry-oriented dataset, conforms to the distribu-
enough information for the SQL query, we train tion of SQL queries in real applications. Mean-
a similarity model based on question/SQL pairs. while, DuSQL is larger, twice the size of other
The question with a similarity score less than 0.8 complex datasets. DuSQL contains 200 databases,
is considered to be of low quality. covering about 70% of entries in Baike and more
In the first round, about 18% of question/SQL than 160 domains, e.g., cities, singers, movies, an-
pairs are of low quality. We ask annotators to check imals, etc. We provide content for each database.
these pairs and correct the error pairs. This process All the values of a SQL query can be found in the
iterates through the collaboration of human and database, except for numeric values. All table and
computer until the above metrics no longer chang- column names in the database are clear and self-
ing. It iterates twice in the construction of DuSQL. contained. In addition, we provide English schema
for each database, including table names and col-
3.5 Dataset Statistics umn headers.
We summarize the statistics of DuSQL and other
cross-domain datasets in Table 2, and give some 4 Benchmark Approaches
8
They are full-time employees and familiar with SQL lan- All existing text-to-SQL works focus on English
guage. Meanwhile, they have lots of experience in annotating
QA data. datasets. Considering that DuSQL is the most sim-
9
Some values in SQL queries are rewritten as synonyms. ilar with Spider, we choose the following three rep-
10
E.g., “When province is Sichuan, list the total rank of resentative publicly available parsers as our bench-
these cities.” for the SQL query {SELECT sum(rank) From
T2 WHERE province = ‘Sichuan’} is considered as an unnat- mark approaches, to understand the performance
ural question, as the total rank would not be asked by humans. of existing approaches on our new Chinese dataset.
6928
Calculation
Dataset Size DB Table/DB Matching Sorting Clustering Others
Column Row Constant
WikiSQL 80,654 26,251 1 80,654 0 0 0 0 0 0
TableQA 49,974 5,291 1 49,974 0 0 0 0 0 0
Spider 9,693 200 5.1 6,450 863 1,059 13 0 0 1,308
CSpider 9,691 166 5.3 6,448 862 1,048 13 0 0 1,318
Ours 23,797 200 4.1 6,440 2,276 3,768 1,760 1,385 1,097 7,071
Table 2: Statistics and comparisons of all existing cross-domain text-to-SQL datasets. The statistics of Spider are
based on published data, only containing train and development sets. Others consists of combinations between
matching, sorting and clustering types.
6929
Calculation
Methods Matching Sorting Clustering Others
Column Row Constant All
IRNet 0 0 0 0 25.0 32.8 34.2 8.7
IRNetExt 22.0 34.3 37.9 29.7 52.1 68.7 60.8 52.5
6930
tions. Overnight focuses on logic form (LF) based Acknowledgments
semantic representation, while our work on SQL
representation. The differences are two-fold. First, We thank three anonymous reviewers for their
databases of Overnight are much simpler, com- helpful feedback and discussion on this work.
posed of a set of entity-property-entity triples. Sec- Zhenghua Li and Min Zhang were supported by
ond, LF operations of Overnight are much simpler, National Natural Science Foundation of China
consisting of only matching and aggregation opera- (Grant No. 61525205, 61876116), and a Project
tions, such as count, min, max. Our dataset is more Funded by the Priority Academic Program Devel-
complex and thus imposes more challenges on the opment (PAPD) of Jiangsu Higher Education Insti-
data construction. tutions.
6931
Tong Guo and Huilin Gao. 2019. Content enhanced Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Guihong
bert-based text-to-sql generation. arXiv preprint Cao, Xiaocheng Feng, Bing Qin, Ting Liu, and Ming
arXiv:1910.07179. Zhou. 2018. Semantic parsing with syntax-and table-
aware sql generation. In Proceedings of the 56th An-
Pengcheng He, Yi Mao, Kaushik Chakrabarti, and nual Meeting of the Association for Computational
Weizhu Chen. 2019. X-sql: reinforce schema Linguistics (Volume 1: Long Papers), pages 361–
representation with context. arXiv preprint 372.
arXiv:1908.08113.
Lappoon R Tang and Raymond J Mooney. 2001. Us-
Jonathan Herzig and Jonathan Berant. 2019. Dont para- ing multiple clause constructors in inductive logic
phrase, detect! rapid and effective data collection for programming for semantic parsing. In European
semantic parsing. In Proceedings of the 2019 Confer- Conference on Machine Learning, pages 466–477.
ence on Empirical Methods in Natural Language Pro- Springer.
cessing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
pages 3801–3811. Polozov, and Matthew Richardson. 2019. Rat-sql:
Relation-aware schema encoding and linking for text-
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant
to-sql parsers. arXiv preprint arXiv:1911.04942.
Krishnamurthy, and Luke Zettlemoyer. 2017. Learn-
ing a neural semantic parser from user feedback. In Yushi Wang, Jonathan Berant, and Percy Liang. 2015.
Proceedings of the 55th Annual Meeting of the As- Building a semantic parser overnight. In Proceed-
sociation for Computational Linguistics (Volume 1: ings of the 53rd Annual Meeting of the Association
Long Papers), pages 963–973. for Computational Linguistics and the 7th Interna-
Cody CT Kwok, Oren Etzioni, and Daniel S Weld. 2001. tional Joint Conference on Natural Language Pro-
Scaling question answering to the web. In Proceed- cessing (Volume 1: Long Papers), pages 1332–1342.
ings of the 10th international conference on World Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:
Wide Web, pages 150–161. Generating structured queries from natural language
Fei Li and HV Jagadish. 2014. Constructing an in- without reinforcement learning. arXiv preprint
teractive natural language interface for relational arXiv:1711.04436.
databases. Proceedings of the VLDB Endowment,
Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and
8(1):73–84.
Thomas Dillig. 2017. Sqlizer: query synthesis from
Percy Liang. 2013. Lambda dependency-based compo- natural language. Proceedings of the ACM on Pro-
sitional semantics. arXiv preprint arXiv:1309.4408. gramming Languages, 1(OOPSLA):1–26.
Wang Ling, Phil Blunsom, Edward Grefenstette, Pengcheng Yin and Graham Neubig. 2017. A syntac-
Karl Moritz Hermann, Tomáš Kočiskỳ, Fumin Wang, tic neural model for general-purpose code generation.
and Andrew Senior. 2016. Latent predictor networks In Proceedings of the 55th Annual Meeting of the As-
for code generation. In Proceedings of the 54th An- sociation for Computational Linguistics (Volume 1:
nual Meeting of the Association for Computational Long Papers), pages 440–450.
Linguistics (Volume 1: Long Papers), pages 599–
609. Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,
Dongxu Wang, Zifan Li, and Dragomir Radev.
Qingkai Min, Yuefeng Shi, and Yue Zhang. 2019. A 2018a. Syntaxsqlnet: Syntax tree networks for com-
pilot study for chinese sql semantic parsing. In Pro- plex and cross-domain text-to-sql task. In Proceed-
ceedings of the 2019 Conference on Empirical Meth- ings of the 2018 Conference on Empirical Methods
ods in Natural Language Processing and the 9th In- in Natural Language Processing, pages 1653–1663.
ternational Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 3643–3649. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, ing Yao, Shanelle Roman, et al. 2018b. Spider: A
Saurabh Tiwary, Rangan Majumder, and Li Deng. large-scale human-labeled dataset for complex and
2016. Ms marco: a human-generated machine read- cross-domain semantic parsing and text-to-sql task.
ing comprehension dataset. In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. 3911–3921.
2003. Towards a theory of natural language inter-
faces to databases. In Proceedings of the 8th in- Victor Zhong, Caiming Xiong, and Richard Socher.
ternational conference on Intelligent user interfaces, 2017. Seq2sql: Generating structured queries
pages 149–157. from natural language using reinforcement learning.
arXiv preprint arXiv:1709.00103.
Ningyuan Sun, Xuefeng Yang, and Yunfeng Liu. 2020.
Tableqa: a large-scale chinese text-to-sql dataset
for table-aware sql generation. arXiv preprint
arXiv:2006.06434.
6932
Matching
SQLs ::= SQL intersect SQLs | SQL union SQLs List cities with a population less than 10 million.
| SQL except SQLs | SQL SELECT name FROM T1 WHERE population < 10000000
SQL ::= Select | Select Where Sorting
| Select GroupC | Select Where GroupC Give the top 5 cities with the largest population.
| Select OrderC | Select Where OrderC SELECT name FROM T1 ORDER BY population DESC LIMIT 5
| Select from SQL, SQL
Clustering
Select ::= select A | select A A
Give the total population of each province.
| select A A A | select A A A A
SELECT province, sum(population) FROM T1 GROUP BY province
Where ::= where Conditions
GroupC ::= group by C Figure 10: Examples of types in Figure 2. All of them
| group by C having Conditions are based on the database in Figure 1.
| group by C OrderC
OrderC ::= order by C Dir | order by C Dir limit value
| order A Dir limit value do not have corresponding SQL queries.
Dir ::= asc / desc
Conditions ::= Condition | Condition and Conditions C Preconditions in SQL Generation
| Condition or Conditions
Condition ::= A op value | A op SQL
To ensure the semantic correctness of the gener-
A ::= min C | max C | count C | sum C | avg C | C ated SQL query, we define the preconditions for
C ::= table.column each production rule, and abide by these precondi-
| table.column1 mathop table.column2 tions in the SQL query generation.
| table1.column mathop table2.column
mathop ::= +|-|*|/ • For the generation of SQL query with multiple
op ::= = | != | > | >= | < | <= | like | in | not in SQLs, e.g., {SQLs ::=SQL union SQLs}: the
columns in the select clause of the previous SQL
Figure 9: The production rules for SQL generation. match the columns in the select clause of the
subsequent SQL, i.e., the columns of the two se-
lect clauses are the same or connected by foreign
A The Grammar for SQL Generation keys.
Figure 9 shows production rules used for SQL gen-
eration. All kinds of SQL queries can be generated • For the rule of generating GroupC: the C is gen-
by exercising each rule, e.g., the rule of {Condition erated from the rule of {C ::= table.column},
= A op SQL} for nested query generation, the rule where the column can perform the clustering op-
of {C = table.column1 mathop table.column2} and eration, that is to say, the table can be divided
{C = table1.column mathop table2.column}for cal- into several sub-tables according to the values of
culation query generation. this column.
6933
Calculation
Dataset Size DB Table/DB Order Group Having Nest
Column Row Constant
WikiSQL 80,654 26,251 1 0 0 0 0 0 0 0
TableQA 49,974 5,291 1 0 0 0 0 0 0 0
Spider 10,181 200 5.1 1,335 1,491 388 844 13 0 0
CSpider 9,691 166 5.3 1,052 1,123 505 913 13 0 0
Ours 23,797 200 4.1 4,959 3,029 3,432 2,208 1,760 1,385 1,097
Table 5: Comparisons of cross-domain text-to-SQL datasets. The statistics of Spider are from Yu et al. (2018b).
The statistics of CSpider are based on the released data.
6934
Components Pseudo Descriptions
SQL intersect SQLs SQL, as set1, SQLs, as set2,
belong to set1 and set2
SQL union SQLs SQL, as set1, SQLs, as set2,
belong to set1 or set2
SQL except SQLs SQL, as set1, SQLs, as set2,
belong to set1 but not belong to set2
select A ... A list A, ... and A
where Conditions when Conditions
group by C for each C
group by C having Conditions the C that Conditions
group by C OrderC the C with OrderC
order by C Dir sorted by C Dir
order by C Dir limit value the top value sorted by C Dir
order by A Dir limit value the top value sorted by A Dir
A op value A op value
A op SQL SQL as v1, A op v1
agg C agg C
count * the number of table
T1.C + T2.C the sum of T1.C and T2.C
T1.C − T2.C the difference between T1.C and T2.C
T1.C ∗ T2.C the product of T1.C and T2.C
T1.C / T2.C T1.C is times of T2.C
6935