Ftoon Kedwan - NLP Application - Natural Language Questions and SQL Using Computational Linguistics-CRC Press (2023)
Ftoon Kedwan - NLP Application - Natural Language Questions and SQL Using Computational Linguistics-CRC Press (2023)
Ftoon Kedwan - NLP Application - Natural Language Questions and SQL Using Computational Linguistics-CRC Press (2023)
Key Features:
Ftoon Kedwan
Cover Image Credit: Shutterstock.com
Reasonable efforts have been made to publish reliable data and information, but the author and
publisher cannot assume responsibility for the validity of all materials or the consequences of
their use. The authors and publishers have attempted to trace the copyright holders of all material
reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and
let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.
copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/b23367
Typeset in Times
by SPi Technologies India Pvt Ltd (Straive)
Contents
Preface ix
1 Introduction 1
Basic Research Framework Organization 4
2 Background Study 7
NLQ Input Processing Interface, the NLIDB 7
Interactive Form-Based Interface 7
Keyword-Based Query Interface 8
NLQ-Based Interface 8
Part of Speech (POS) Recognition 9
Linguistic Components Layers 9
Syntactic parser (rule-based) 9
Semantic parser (rule-based) 10
Lexicon 11
Intermediate Language Representation Layer 11
Annotator 11
Disambiguation 12
Matcher/Mapper 13
SQL Template Generator 14
SQL Execution and Result 15
3 Literature Review 17
Related Works 17
NLP 17
ML Algorithms 21
NLQ to SQL Mapping 23
Current Research Work Justification 32
Authoring Interface-Based Systems 33
Enriching the NLQ/SQL Pair 33
Using MLA Algorithms 33
Restricted NLQ Input 33
Lambda Calculus 33
v
vi Contents
4 Implementation Plan 47
NLQ Input Interface 47
POS Recognition 48
Disambiguation 54
Matcher/Mapper 55
Mapping NLQ Tokens into RDB Elements 56
Mapping RDB Lexica into SQL Clauses 59
SQL Template Generator 60
SQL Execution and Result 62
Appendix 1 103
Appendix 2 105
Appendix 3 107
Appendix 4 109
Appendix 5 111
Appendix 6 113
Appendix 7 115
Appendix 8 117
Appendix 9 119
Glossary 147
References 149
Index 161
Preface
ix
Introduction
1
NLP is a subfield of computer science and engineering under the field of
Artificial Intelligence (AI), as illustrated in Figure 1.
It is developed from the study of language and computational linguistics
[1, 2] and often used to interpret an input Natural Language Question (NLQ)
[3]. NLP’s goal is to analyze and facilitate the interaction between human
language and computing machines. HCI becomes a part of NLP when the
interaction involves the use of natural language. Under NLP, there are several
subareas, including Question Answering Systems (QAS) [4], such as Siri for
iPhones, [5] and summarization tools [6]. Such tools produce a summary of a
long document’s contents or even generate slide presentations. Machine real-
time translation [7], such as Google Translate [8] or BabelFish [9], are among
other examples of NLP subareas. In addition, document classification [10] via
learning models is a famous NLP subarea. It is used to train the classifica-
tion algorithm to identify the category a document should be placed under,
for example, news articles or spam filtering classification. Speech Recognition
Models [11] are another NLP subarea that recognizes spoken language words
which work best only in specific domains.
In the current research, the framework starts with a processing of simple
Online Transactional Processing (OLTP) type of queries. OLTP queries are
simple SELECT, FROM and WHERE statements of the Structured Query
Language (SQL), which is the simplest query form. As in Figure 1, NLP uses
deep learning techniques as part of the AI area. It focusses on computational
linguistics to analyze HCI in terms of the language used for this interactive
communication. Basically, NLP bridges the gap between computers and
humans and facilitates information exchange and retrieval from an adopted
Computer Deep
AI NLP HCI NLIDB
Science Learning
DOI: 10.1201/b23367-1 1
2 NLP Application
DataBase (DB), which is, in this case, a Natural Language Interface for
DataBase (NLIDB).
A Relational DataBase (RDB) model [12], which was originally intro-
duced in 1970, is used as the basis of a background data storage and manage-
ment structure. RDB has been chosen as a data storage media for the proposed
research because of the relationships between its entities, including their table
attributes and subfields and their values. These relationships hold significant
information themselves as if they were whole separate entities. In addition,
the information stored on those relationships proved to increase the accuracy
of data retrieval as will be demonstrated later in Chapter 6 on implementation
testing and performance measurements.
RDB’s elements (i.e., Table, Attribute, Relationship, etc.) representation in
Figure 2 describes the relationships between the entity sets to express parts of
Post-Traumatic Stress Disorder (PTSD) RDB semantics. Therefore, an Entity-
Relational Diagram (ERD) [13] was used to demonstrate the RDB structure
because this makes data relationships more visible as follows:
In 1976, Chen [13] was the first to graphically model RDB schema entities
using ERD to represent NLQ constructs. In [14], ERD was used to represent
NLQ constructs by analyzing the NLQ constructs’ inter-relationship with the
ERD or even with the Class Diagram Conceptual Schema [15]. However, NLQ
Table 1 is an example of the NLQ MetaTable that breaks down the entered
NLQ into its subsequent tokens. Table 2 is an example of the RDB elements
MetaTable that explains each RDB element in terms of its nature, category,
syntactic role etc. These two tables will be referenced to and elaborated on
frequently throughout this research document.
4 NLP Application
DOI: 10.1201/b23367-2 7
8 NLP Application
Lexicon
Lexicon studies aim at studying a language vocabulary, that is words and
phrases. In the current research work, RDB lexica are stored in the RDB
MetaTable which is used to map the NLQ words to their formal representa
tions in an RDB (i.e., table names, attribute names, etc.). Lexica are analyzed
by both syntactic and semantic parsers.
Annotator
Annotations are metadata that provide additional information on particular
data. Natural language datasets are called corpora (the plural of corpus). When
a corpus is annotated, this is called an annotated corpus. Corpora are used to
train Machine Learning Algorithms (MLA). The annotation types include:
• POS Annotation.
• Phrase Structure Annotation.
• Dependency Structure Annotation.
12 NLP Application
In the current research work, all of the above underlying NLP layers are con
sidered and implemented as part of the POS recognition module, with the
exception of the intermediate language representation layer. The employed
POS recognition Python library is “speech_recognition”. This module is essen
tial to the thorough understanding of the given NLQ for proper and accurate
translation.
DISAMBIGUATION
In the current research work, disambiguation is only required when the mapper
finds more than one matching RDB element to a certain NLQ token. Generally,
a word’s meaning disambiguation has special techniques and algorithms. An
example is the statistical keyword meaning disambiguation process using
N-Gram Vectors [33], where N ranges from 1 to the length of the text. N-Gram
Vectors’ statistics are gathered using a training corpus of English language,
while a customised corpus gather statistics as the system is being used. The
latter corpus requires that the user reviews the presented NLQ interpretation
and makes any necessary changes before submitting it for execution. N-Gram
Vectors are principally used for capturing lexical context. It is considered as
a measure of how likely a given token has a particular meaning in a particular
user NLQ input. This is because every token’s meaning depends on the context
in which it was used. This meaning comparison procedure follows a vector
similarity ranking measure where the higher the meaning rank the closer it is
to the true meaning.
Other words’ meaning disambiguation studies involve computational lin
guistics and statistical language understanding. Such studies focus on word
sense disambiguation within a large corpus. Another method to disambigu
ate a word sense is using word collocations. The word collocations method
measures the meaning likelihood of more than one word when they all exist in
the same NLQ. Iftikhar [34] proposed a solution to the disambiguation prob
lem by parsing domain-specific English NLQs and generate SQL queries by
using the Stanford Parser [21]. This approach is widely used with the Not Only
Structured Query Language (NoSQL) DBs for automatic query and design.
Rajender [35] used a controlled NLIDB interface and recommended SQL
query features to reduce the ambiguity in an NLQ input. This is because the
less ambiguous the domain the more accurate results the system can produce.
What can also help resolve such ambiguity is the weighted relationships or
links [36] between RDB elements. In this method, the relationship’s weight
increases by one each time that particular relationship is used. As such, it is
2 • Background Study 13
a given that the bigger the relationship weight is the more often the related
RDB elements are queried. This helps the NLIDB system recommend smarter
options for the user to select from with ordered options, where the topmost
option is the most likely.
In the current research work, NLQ disambiguation is not the main focus.
Thus, a simple NLQ disambiguation module is used by applying Stanford
CoreNLP [21] and “nltk.corpus” Python library [37]. Those tools are solely
used to check for NLQ validity. A syntactic rules checker is also used for any
NLQ grammatical mistakes. The adopted procedure is interactive where an
error message pops up to the user, asking them to rephrase a word or choose
from a few potential spelling corrections. A Naïve Bayes Classifier [38] is
implemented to simply classify user’s response as positive or negative (i.e.,
Yes or No).
MATCHER/MAPPER
The Matcher/Mapper module is considered the most complicated part in NLP
science [14]. Therefore, a keyword-based search has attracted many research
ers as the simplest mapping approach since keywords are explicitly identified.
Researchers used it to improve the information retrieval from DBs and solve
(or avoid) the challenge of understanding NLQ tokens and mapping them into
the underlying DB or schema model [39]. The mapper can be Entity-Attribute-
Value (EAV) Mapper [40], Entity Relational (ER) Mapper [13], or eXtensible
Markup Language (XML) Documents Mapper [41]. NLQs are translated into
SQL queries if the used DB or scheme model is EAV or ER. NLQ can be trans
lated to XML documents only if the used system employs a document-based
data model in the underlying information system. More background informa
tion on this module is given in Chapter 3, the literature review.
In the current research work, the adopted mapper is the EAV mapper in
addition to the RDB relationships. NLQ tokens are mapped into RDB lexica
using the NLQ and RDB MetaTables, Tables 1 and 2, respectively. The match
ing lexica will then be mapped with the SQL clauses. This mapping uses the
proposed rule-based algorithm that is based on the observational assumptions
table discussed later in Table 4 (in Chapter 4).
Table 1 is an example of the NLQ MetaTable that breaks down the entered
NLQ into its subsequent tokens. Table 2 is an example of the RDB elements
MetaTable that explains each RDB element in terms of its nature, category,
syntactic role etc. These two tables will be referenced to and elaborated on
frequently throughout this research document.
14 NLP Application
In the proposed research work, the focus is on the DQL, which has one com
mand phrase (SELECT) in addition to other supplementary clauses (e.g.,
WHERE, AS, FROM). DQL, though it has one main command phrase, is the
most used phrase among SQL’s other phrases, especially when it comes into
RDB operations. The SELECT keyword is used to send an inquiry to the RDB
seeking a particular piece of information. This could be done via a command
line prompt (i.e., Terminal) or through an Application Program Interface (API).
This research proposes a translation algorithm from NLQ into SQL using the
SELECT command phrase and its supplementary clauses.
NLP
LUNAR [42–44] was developed in 1971 to take NLQs about a moon’s rock
sample and present answers from 2 RDBs using Woods’ Procedural Semantics
to reference literature and an Augmented Transition Network (ATN) parser
for chemical data analysis. However, it only handled 78% of the NLQs due to
linguistic limitations as it manages a very narrow and specific domain.
Philips Question Answering Machine (Philiqa) [4] was developed in 1977.
Philiqa separates the syntactic and semantics parsing of NLQ as the semantic
parser is composed of three layers, namely English Formal Language, World
Model Language, and schema DB metadata.
LIFER/LADDER [45] was developed a year later, in 1978, as a DB NLP
system interface to retrieve information regarding US Navy ships. It used a
semantic grammar to parse NLQs. Although LIFER/LADDER supported que-
rying distributed DBs, it could only retrieve data for queries related to 1 single
table, or more than 1 table queries having easy join conditions.
In 1983, ASK [46] was developed as a learning and information manage-
ment system. ASK had the ability to communicate with several external DBs
via the user’s NLQ interface. ASK is considered a learning system due to its
ability to learn new concepts and enhance its performance through user’s inter-
action with the system.
TEAM [16] was developed in 1987 as an NLIDB with high portability and
easy configurability on any DBA system without compatibility issues, which
negatively affected TEAM’s core functionality. NLIDB is an NL-based query
interface which the user can use as a means of interaction with the DB to
DOI: 10.1201/b23367-3 17
18 NLP Application
access or retrieve data using NLQs. This interface tries to understand the NLQ
by parsing and tokenizing it to tokens or lexicons, and then applying syntactic
or semantic analysis to identify terms used in the SQL query formation.
In 1997, [17] a method of conceptual query statement filtration and pro-
cessing interface using NLP was defined to essentially analyze predicates
using full-fledged NL parsing to finally generate a structured query statement.
In 2002, Clinical Data Analytics Language (CliniDAL) [27] initiated
a solution for the mapping problem of keyword-based search using the
similarity-based Top-k algorithm. Top-k algorithm searches for k-records of a
dictionary with a significant similarity matches to a certain NLQ compared to
a predefined similarity threshold. This algorithm was successful with accuracy
of around 84%.
In 2002, a Vietnamese NLIDB interface was developed for the economic
survey of DBs. This proposal also included a WordNet-based NLI to RDBs to
access and query DBs using a user’s NL [18].
In the same year, 2002, DBXplorer employed two preprocessing steps,
PUBLISH which builds symbol tables and associated structures, and SEARCH
which fetches the matching rows from published DBs. Together, PUBLISH
and SEARCH enable Keyword-based search to RDBs [12].
In 2004, PRECISE [47] was developed to use the Semantically Tractable
Sentences concept. The semantic interpretation of the sentences in PRECISE
is done by analyzing language dictionaries and semantic constraints. PRECISE
matches the NLQ tokens with the corresponding DB structures in two stages.
Stage 1, it narrows down possible DB matches to the NLQ tokens using the
Maximum Flow Algorithm (MFA). MFA finds the best single Flow Network
as a directed graph to finally specify one source and one path to increase the
flow’s strength. MFA returns the maximum # of keywords back to the system.
Stage 2 is analyzing the sentence syntactic structure. After that, PRECISE uses
the information returned from both stages to accurately transform the NLQ to
an equivalent SQL query. However, PRECISE has a poor knowledge base as
it only retrieves results that are keyword-based due to NL general complexity
and ambiguity.
In 2005, an NLP question answering system on RDB was defined for NLQ
to SQL query analysis and processing to improve the work on XML processing
for the structural analysis with DB support. Query mapping was used to derive
the DB results [48].
In 2006, NUITS system was implemented as a search algorithm using
DB schema structure and content-level clustering to translate a conditional
keyword-based query and retrieve resulting tuples [49].
The Mayo Clinic information extraction system [50] extracts information
from free text fields (i.e., clinical notes), including named entities (i.e. diseases,
signs, symptoms, procedures, etc.) and their related attributes (i.e. context,
3 • Literature Review 19
so the Mapper can map them to their internal conceptual representation in the
Clinical Information System (CIS). This mapping process is done using the
similarity-based Top-k algorithm, in addition to embedded NLP tools (e.g.,
tokenization, abbreviation expansion and lemmatization). The mapping results
are stored in a generic context model which feeds the translation process with
necessary information just like an index. As such, the corresponding tables and
fields for the NLQ tokens are extracted from the data source context model to
be fed into the SQL SELECT clause, and the value tables are extracted from
the SELECT clause to generate an SQL FROM clause. The query (parse) tree
leaf nodes represent query constraint categories which reflect conjunction or
disjunction using algebraic computations for clinical attributes and their val-
ues. CliniDAL uses the unique unified term, TERM_ID, and its synonyms as
the internal identifier for software processing purposes, as in composing SQL
statements.
In 2017, the special purpose CliniDAL [42] was introduced to integrate
a concept-based free-text search to its underlying structure of parsing, map-
ping, translation and temporal expression recognition, which are indepen-
dently functioning apart from the CIS, to be able to query both structured
and unstructured schema fields (e.g. patient notes) for a thorough knowledge
retrieval from the CIS. This translation is done using the pivoting approach
by joining fact tables computationally instead of using the DBMS functional-
ities. Aggregation or statistical functions require further post-processing to be
applied and computed.
In 2018, an NLI to RDB called QUEST [56] was developed on top of IBM
Watson UIMA pipeline (McCord) and Cognos. QUEST emphasized focus on
nested queries, rather than simple queries, without restricting the user with
guided NLQs. QUEST workflow consists of two major components, QUEST
Engine Box and Cognos Engine Box. Quest Engine Box includes the schema-
independent rule templates that work by extracting lexicalized rules from the
schema annotation file in the online part of this box, besides the rule-based
semantic parsing module that generates lexicalized rules used in the seman-
tic parsing. Quest Engine Box also includes a semantic parser built on many
Watson NLP components, including English Slot Grammar (ESG) parser,
Predicate Argument Structure (PAS), and Subtree Pattern Matching Framework
(SPMF). This box will finally produce a list of SQL sub-queries that are later
fed into the QUEST other box, the Cognos Engine Box. The latter box focuses
on final SQL statement generation and execution on IBM DB2 server. QUEST
proved to be quite accurate compared to previous similar attempts.
Despite the success of above attempts on NLIDB, token-based [57], form/
template-based [58], menu-based [59] or controlled NL-based search, are simi-
lar but simpler approaches require much more effort as the accuracy of the
translation process depends on the accuracy of the mapping process.
3 • Literature Review 21
ML Algorithms
NLP tools and ML techniques are among the most advanced practices for
information extraction at present [42]. In respect of ML techniques, they could
be either rule-based or hybrid approaches for features identification and selec-
tion or rules classification processes. A novel supervised learning model was
proposed for i2b2 [60], that integrates rule-based engines and two ML algo-
rithms for medication information extraction (i.e. drug names, dosage, mode,
frequency, duration). This integration proved efficiency during the information
extraction of the drug administration reason from unstructured clinical records
with an F-score of 0.856.
ML algorithms have a wide range of applications in dimensionality reduc-
tion, clustering, classification, and multilinear subspace learning [61, 62]. In
an NLP, ML is used to extract query patterns to improve the response time by
creating links between the NLP input sources and prefetching predicted sets of
SQL templates into a temporary cache memory [63]. ML algorithms are typi-
cally bundled in a library and integrated with query and analytics systems. The
main ML properties include scalability, distributed execution and lightweight
algorithms [62].
NLP and knowledge-based ML algorithms are used in knowledge process-
ing and retrieval since the 1980s [64]. Boyan et al. [65] optimized web search
engines by indexing plain text documents, and then used reinforcement learn-
ing techniques to adjust documents’ rankings by propagating rewards through
a graph. Chen et al. [66] used inductive learning techniques, including sym-
bolic ID3 learning, genetic algorithms, and simulated annealing to enhance
information processing and retrieval and knowledge representation. Similarly,
Hazlehurst et al. [67] used ML in his query engine system to facilitate auto-
matic information retrieval based on query similarity measures through the
development of an Intelligent Query Engine (IQE) system.
Unsupervised learning by probabilistic Latent Semantic Analysis is used
in information retrieval, NLP, and ML [68]. Hofmann used text and linguis-
tic datasets to develop an automated document indexing technique using a
temperature-controlled version of the Expectation Maximization algorithm for
model fitting [68]. Further, Popov et al. introduced the Knowledge and Information
Management framework for automatic annotation, indexing, extraction and
retrieval of documents from RDF repositories based on semantic queries [69].
Rukshan et al. [70] developed a rule-based NL Web Interface for DB
(NLWIDB). They built their rules by teaching the system how to recognise
rules that represent several different tables and attributes in NLWIDB system,
what are the escape words and ignore them, in addition to DB data dictionaries,
Rules for the aggregate function MAX, and rules indicating several different
22 NLP Application
ways to represent an ‘and’ or ‘as well as’ concept, or interval ‘equal’ concept.
Data dictionaries are often used to define the relationship between the attri-
butes to know, for example, which attribute comes first in a comparative opera-
tion, and which falls afterwards in the comparative structure. This NLWIDB is
similar to the current research idea, however, we intend on building a simpler
and more generic algorithm that could be applied on various domains systems.
Our algorithm does not use DB elements as the basis or rules identification, as
in Rukshan et al.’s research; rather, it uses general sentence structure pattern
recognition to form an equivalent SQL statement.
SPARK [71] maps query keywords to ontology resources. The translation
result is a ranked list of queries in an RDF-based Query Language (QL) for-
mat, called SPARQL, created in SPARK using a probabilistic ranking model.
The ontology resources used by SPARK are mapped items represented in a
graph format used to feed SPARQL queries.
Similar to SPARK, PANTO [60] translates the keyword-based queries to
SPARQL, but PANTO can handle complex query keywords (i.e. negation, com-
parative and superlatives). Also, PANTO uses a parse tree, instead of graph rep-
resentation, to represent the intermediate results to generate a SPARQL query.
i2b2 medication extraction challenge [60] proposed a high accuracy infor-
mation extraction of medication concepts from clinical notes using Named
Entity Recognition approaches with pure ML methods or hybrid approaches of
ML and rule-based systems for concept identification and relation extraction.
Keyword++ framework [43] improves NLIDB and addresses NLQs’
incompleteness and imprecision when searching a DB. Keyword++ works by
translating keyword-based queries into SQL via mapping the keywords to their
predicates. The scoring process is done using deferential query pairs.
Keymantic system [72] handles keyword-based queries over RDBs using
schema data types and other intentional knowledge means in addition to web-
based lexical resources or ontologies. Keymantic generates mapping con-
figurations of keywords to their consequent DB terms to determine the best
configuration to be used in the SQL generation
HeidelTime [73, 74] is a temporal tagger that uses a hybrid rule-based and
ML approach for extracting and classifying temporal expressions on clinical
textual reports which also successfully solved the i2b2 NLP challenge.
CliniDAL [27] composes Restricted Natural Language Query (RNLQs)
to extract knowledge from CISs for analytics purposes. CliniDAL’s RNLQ to
SQL mapping and translation algorithms are enhanced by adopting a temporal
analyzer component that employs a two-layer rule-based method to interpret
the temporal expressions of the query, whether they are absolute times or rela-
tive times/events. The Temporal Analyzer automatically finds and maps those
expressions to their corresponding temporal entities of the underlying data ele-
ments of the CIS’s different data design models.
3 • Literature Review 23
Tseng and Chen [79] aim at validating the conceptual data modeling power
in the NLIDB area via extending the Unified Modeling Language (UML) [80,
81] concepts using the extended UML class diagram’s representations to cap-
ture and transform NLQs with fuzzy semantics into the logical form of SQLs
for DB access with the help of a Structured Object Model (SOM) representation
[82] that is applied to transform class diagrams into SQLs for query execution
[50]. This approach maps semantic roles to a class diagram schema [80, 81,
83] and their application concepts, which is one of the UML 9 diagrams used
to demonstrate the relationships (e.g. Generalization and Association) among a
group of classes. Carlson described several constraints to build semantic roles
in English sentences [84].
UML is a standard graphical notation of Object-Oriented (OO) model-
ing and information systems design tool used for requirement analysis and
software design. UML class diagrams are used to model the DB’s static rela-
tionships and static data models (the DB schema) by referring to the DB’s
conceptual schema. SOM methodology is a conceptual data-model-driven
programming tool used to navigate, analyze, and design DB applications and
process DB queries [79].
Authors of [79] aim to explore NLQ constructs’ relationships with the OO
world for the purpose of mapping NLQ constructs that contain vague terms
specified in fuzzy modifiers (i.e. ‘good’ or ‘bad’) into the corresponding class
diagrams through an NLI, to eventually form an SQL statement which, upon
execution, delivers answers and a corresponding degree of vagueness. Authors
focused on the fuzzy set theory [49] because it is a method of representing
vague data with imprecise terms or linguistic variables [85, 86]. Linguistic
variables consist of NL words or sentences (i.e. old, young), excluding num-
bers (i.e. 20 or 30), yet imprecise NLQ terms and concepts can be precisely
modeled using these linguistic variables by specifying natural and simple spec-
ifications and characterizations of imprecise concepts and values.
In [79] real-world objects’ connectivity paths are mapped to SQLs during
the NLQ execution by extracting the class diagram from the NLQ in a form
of a sub-graph/tree (a validation sub-tree) of the SOM diagram that contains
relevant objects connecting the source and the target, that have been identified
by the user earlier in a form of objects and attributes. The source is objects and
their associations that have valued attributes to illustrate the relationship of
objects and attributes of interest, while the object of ultimate destination is the
target. Results are then sent to the connectivity matrix to look for any existing
logical path between the source and the target to eventually map the logical
path to an equivalent QL statement which can be simplified by inner joins.
Schema and membership function represented in class diagram are used to link
each fuzzy modifier with their corresponding fuzzy classes.
26 NLP Application
method in the syntax analysis. Since the LIFER/LADDER system only sup-
ports simple SQLs formation, this translation architecture is largely restricted.
For RDBMSs, Gage [105] proposed a method of an AI application in addi-
tion to fuzzy logic applications, phrase recognition and substitution, multilin-
gual solutions and SQL keyword mapping to transform NLQs into a SQLs.
Alessandra [24] used Syntactic Pairing for semantic mapping between
NLQs and SQLs for eventual NLQ translation using SVM algorithm to design
an RDB of syntax trees for NLQs and SQLs pairs, and kernel functions to
encode those pairs.
Gauri [106] also used semantic grammar for NLQ into SQL translation.
In the semantic analysis, the author used the Lexicon to store all grammatical
words, and the post-preprocessor to transform NLQs’ semantic representations
into SQL. However, this architecture can only translate simple NLQs, but not
flexible.
Karande and Patil [51] used grammar and parsing in an NLIDB system for
data selection and extraction by performing simple SQLs (i.e. SQL with a join
operation or few constraints) on a DB. This architecture used an ATN parser to
generate parse trees.
Ott [107] explained the process of SQLs Automatic Generation via an
NLIDBs using an internal intermediate semantic representation language
based on formal logic of the NLQs that is then mapped to SQL+. This approach
is based on First Order Predicate Calculus Logic resembled by DB-Oriented
Logical Form (DBLF), with some SQL operators and functions (e.g., negation,
aggregation, range, and set operator for SQL SELECT).
This approach, called SQL+, aims to solve some of the SQL restrictions
such as handling ordinals via loop operators (e.g., the 6th lowest, the 3rd high-
est). To replace the loop operator, SQL + expressions are entered into a pro-
gramming interface to SQL supplied with a cursor management.
SQL+ strives to save the ultimate power of NLQ by augmenting the SQLs
in a way that each NLQ token is represented and answered by SQL expres-
sions. Experiment results prove that even complex queries can be generated
following three strategies, the Join, the Temporary Relation and the Negation
Strategy, in addition to a mixture of these strategies [107].
For the join strategy, and in the DBLF formula, for each new relation
reference a join operation is built in the SQL FROM clause recursively in a
top-down direction. Universal quantifiers are usually implemented by creating
counters using double-nested constructs as in (NOT EXISTS [sub-SQL]) which
has been used in the TQA system [108]. However, [107] uses the temporary
relation creation strategy instead to handle universal and numeric quantifiers
and to handle ordinals and a mixture of aggregate functions as well. The tem-
porary relations are created using the join-strategy to easily embed them in any
SQL expression. Hence, whenever there is a quantifier, a temporary relation
30 NLP Application
is built for it recursively. For the negation strategy, and in DBLF, negation is
handled by setting the “reverse” marker for yes/no questions if the negation at
the beginning of the sentence, and by using (NOT IN [subquery]) constructs
in case of verb-negation and negated quantifiers in other positions. Both nega-
tion handling methods are doable if the negation occurs in front of a simple
predicate, and in this case, the number and position of negation particles is not
restricted. For the Mixed strategies, any of the previous three strategies can be
mixed arbitrarily as in building temporary relations when aggregate functions
or ordinals occur.
TQA [108] is an NLI that transforms NLQs into SQL directly using
semantic grammar and deep structure grammar to obtain a higher performance
and better ellipses and anaphora handling. However, TQA and similar systems
are almost not transportable and barely adaptable to other DB domains.
PHLIQAl [4], ASK [46] and TEAM [16] adopt the intermediate semantic
representation languages connected with a Conceptual Schema (CS) to provide
an efficient NLQ into SQL transformation tool. CS helps mapping NLQ tokens
to their DB lexicon representations because it stores all NLQ tokens, DB terms
of relations and attributes and their taxonomies, in addition to the DB hierarchy
structure and metadata.
The USL system [109] is adaptable and transportable because it has a
customization device. Yet its intermediate semantic and structure language
is syntax-oriented, and not based on predicate logic. Hence, some semantic
meanings are represented though tree structures forms.
TQA and USL techniques together form the LanguageAccess system
[107], which has a unique SQL generation component that uses the DBLF
and the Conceptual Logical Form (CLF) as two distinct intermediate semantic
representation languages.
LanguageAccess system works through many steps. First, a phrase struc-
ture grammar parses an NLQ to generate parse trees, which are then mapped
to CLF using the CS. Generated CLF formulae are paraphrased in NL and
then presented to the end user for ambiguous tokens interpretations and mean-
ing verification. Once the end user chooses a CLF formula, using the CS, it
gets transformed to DBLF, the source of SQL generation, which is then trans-
formed to SQLs. DBLF considers the DB values internal representations (e.g.
strings, numbers), the temporary relations order, and the generated expressions
delivery mechanism to DBs.
Authors of [7, 110, 111] used NLQs semantic parsing to model algorithms to
map NLQs to SQLs. Similar research work is done by [112] using specific semantic
grammar. Authors of [7, 110, 113] used lambda calculus and applied it on NLQs
meaning representation for the NLQ to SQL mapping process. Furthermore,
[114] used ILP framework’s defined rules and constrains to map NLQs using
3 • Literature Review 31
their semantic parsing. For the same purpose, [7, 110, 111] followed a time-
consuming and expensive approach by producing NLQ tokens’ meaning
representations manually. Similarly, [112] developed an authoring system
through extensive expertise time and efforts on semantic grammar specifica-
tion. Authors of [7, 110, 113] developed a supervision-extensive system using
lambda-calculus to map NLQs to their corresponding meaning representations.
Similar to KRISP [78], Giordani and Moschitti [115] developed a model using
only Q/A pairs of syntactic trees as the SQL compiler provides the NLQs deriva-
tion tree that is required to translate factoid NLQs into structural RDB SQLs with
generative parsers that are discriminatively reranked using an advanced ML SVM-
ranker based on string tree kernels. The reranker reorders the potential NLQ/SQL
pairs list which has a recall of 94%, recalling the correct answers in this system.
The system in [115] does not depend on NLQ-annotated meaning
resources (e.g. Prolog data, Lambda calculus, MR, or SQLs) or any manual
semantic representations except for some synonym relations that are missing
in WordNet. The first phase is the generation phase where NLQ tokens’ lexi-
cal dependencies and DB metadata-induced lexicon in addition to WordNet
are used, instead of a full NLQ’s semantic interpretation, to build the SQL
clauses (i.e. SELECT, WHERE, FROM, joins, etc.) recursively with the help
of some rules and a heuristic weighting scheme. DB metadata does the rela-
tions disambiguation tasks and includes DB data types, Primary Keys (PKs)
and Foreign Keys (FKs) and other constraints, names of entities, columns and
tables according to domain semantics, and is also called DB catalog usually
stored as INFO_SCHEMA (IS) in a DB. The output of the generation phase is
a ranked potential SQLs list created by the generative parser.
In Dependency Syntactic Parsing is used to extract NLQ tokens’ lexical
relations and dependencies. According to [19], WordNet is efficient at expand-
ing predicate arguments to their meaning interpretations and synonyms;
however, WordNet generalizes the relation arguments but does not guarantee
NLQ’s lack of ambiguity and noise which affects its meaning interpretation
significantly. Therefore, this system generates every possible SQL with all of
its clauses, including ambiguous ones, based on NLQs lexical and grammati-
cal relations dependencies matches, extracted by the Stanford Dependencies
Parser [116], and SQL clauses’ logical and syntactic formulation structures.
The first relation executed on the GEOQUERIES corpus in the [115] algo-
rithm is the FROM clause relation to find the corresponding DB tuples consid-
ering the optional condition in the WHERE clause and then match the results
with the SELECT clause attributes. In case of any empty clauses or nested
queries mismatching, this algorithm will generate no results; otherwise, correct
SQLs are generated among the top three SQLs in 93% of the times using stan-
dard 10-fold cross-validation performance measure. This high accuracy and
32 NLP Application
recall are due to the robust and heuristic weights-based reranker that is built
using SVM-Light-TK6 extending the SVM-Light optimizer [117] by employ-
ing the tree kernels [118, 119] to use the addition STKn + STKs or the multi-
plication STKn × STKs. Default reranker parameters are used such as in the
normalized kernels, λ = 0.4 and cost and trade-off parameters = 1. However,
this approach mandates the existence of possible SQLs in advance as no new
SQLs can be generated by the algorithm, it only verifies if an entered NLQ has
a corresponding SQL to produce a correct answer.
Conceptually similar to [115], Lu et al.’s [120] mapping system does
not depend on NLQ annotation either, but on a generative model and the
(MODELIII+R) which is a discriminative reranking technique. Also, DCS sys-
tem [121] does not depend on a DB annotation either and works as well as a
mapping system enriched with prototype triggers (DCS+). In addition, from Q/A
pairs, SEMRESP employs a semantic parser learner [122] that works best on
annotated logical forms (SQLs). Kwiatkowski et al. [123] developed UBL system
that when trained on SQLs and Q/A pairs, it is able to use restricted lexical items
together with some CCG combinatory rules to learn newly entered NLQ lexicons.
rule-based approaches. As such, the third approach that relies on MLA algo-
rithms requires the presence of huge domain-specific (specific keywords used
in the NLQ) NLQ/SQL translation pairs’ corpora. Such a corpus is difficult to
create because it is very time-consuming and a tedious task required by a domain
expert. NLQ/SQL pairs corpus requires hundreds of manually written pairs writ-
ten and examined by a domain expert to train and test the system [70, 76, 138].
Avinash [102] employed a domain-specific ontology for the NLQ’s
semantic analysis. As a result, Avinash’s algorithm would fall under the over-
customization problem, making the system unfunctional on any other domain.
It is also neither transportable nor adaptable to other DB environments, except
with extensive re-customisation. Such domain-specific systems assume the user
is familiar with the DB schema, data and contents. On the other hand, the cur-
rent research work uses simple algorithmic rules and is domain-independent.
Hence, it does not assume prior knowledge of the adopted RDB schema or
require any annotated corpora for training the system. Instead, it uses linguistic
tools to understand and translate the input NLQ. However, the used NLQ/SQL
pairs are only used for algorithm testing and validation purposes. Furthermore,
relying heavily on MLAs proved to be not effective in decreasing the transla-
tion error rates or increasing accuracy [139]. This remains the case even after
supplying the MLA algorithm with a dedicated Error Handling Module [77].
In this regard, the current research work took proactive measures by using
NLP linguistic techniques to make sure the NLQ is fully understood and well
interpreted. This full interpretation happens through the intermediate linguistic
layers and the RDB MetaTable before going any further with the processing;
to avoid potential future errors or jeopardize accuracy. Computational linguis-
tics is used here in the form of linguistics-based mapping constraints using
manually written rule-based algorithms. Those manually written algorithms
are mainly observational assumptions summarised in Table 4 (Chapter 4).
Table 4 specifies RDB schema categories and semantic roles to map the identi-
fied RDB lexica into the SQL clauses and keywords.
Generally speaking, rule/grammar-based approaches [102] require exten
sive manual rules defining and customizing in case of any DB change to
maintain accuracy [140]. However, the rule-based observational algorithm
implemented in the current research work is totally domain-independent and
transportable to any NLQ translation framework. Generally, mapping is a com-
plicated science [14] because low mapping accuracy systems are immediately
abandoned by end users due to the lack of system reliability and trust. Hence,
this research work proposes a cutting-edge translation mechanism using com-
putational linguistics. However, there are several aspects of the proposed
research contribution which will be discussed in reference to the two mapping
algorithms in Figure 8 (Chapter 4).
36 NLP Application
corpus as a training and testing dataset. Such datasets are created using com-
plex and cross-domain semantic parsing and SQL patterns coverage. However,
Spider’s performance surprisingly resulted in a very low matching and map-
ping accuracy. Hence, the current research work is distinct from most of
the previous language translation mechanism efforts because the focus here
gives highest priority to simplicity and accuracy of the algorithm’s matching
outcome.
The current research work employs the NLQ MetaTable (Table 1) to
map NLQ tokens into RDB lexica. The NLQ MetaTable covers NLQ words,
their linguistic or syntactic roles (noun, verb, etc.), matching RDB category
(table, value, etc.), generic data type (words, digits, mixed, etc.), unique as
PK or FK, besides their synonyms and enclosing source (i.e., tables or attri-
butes). MetaTables are used to check for tokens’ existence as a first goal, then
mapping them to their logical role as a relationship, table, attribute or value.
The general-purpose English language ontology (WordNet) are used to sup-
port the MetaTables with words’ synonyms, semantic meanings and lexical
analysis.
The implemented MetaTables fill up the low accuracy gap in language
translation algorithms that do not use any sort of deep DB schema data dic-
tionaries such as [81, 123], or just a limited data dictionary such as [43].
According to [19], WordNet is efficient at expanding NLQ predicate argu-
ments to their meaning interpretations and synonyms. However, WordNet
generalizes the relation arguments and does not guarantee NLQ’s lack of
ambiguity and noise, which significantly affects its meaning interpretation.
Hence, supportive techniques are employed in the current research work such
as the disambiguation module. In addition, to avoid confusion around the
RDB unique values, data profiling [121] is performed on large RDB’s statis-
tics to automatically compile the mapping table of unique values, PKs and
FKs, based on which RDB elements are queried more often. Mapping tables
are manually built for smaller RDBs, while using a data-profiling technique
to build them for larger RDBs. Unique values are stored in the mapping table
by specifying their hosting sources while a hashing function is used to access
them instantly.
Besides, the RDB lexical join conditions are also discovered between any two
words or values. The joint is based on the words’ or values’ connectivity sta-
tus with each other or having common parent node in the dependency tree.
The parsing helps with the NLQ semantics extraction and RDB lexical data
selection. RDB elements relationships are controlled by using only verbs to
represent any connectivity in the RDB schema. The verbs’ parameters (subject
or object) are mapped with the RDB relationship’s corresponding elements:
tables, attributes or values. If the NLQ verb is unidentified or missing, the rela-
tionship between NLQ tokens will be found by analysing the matching RDB
lexica intrarelationships with each other.
There are other methods in the literature that identify lexical dependen-
cies and grammatical relations, such as Stanford Dependencies Parser [145],
Dependency Syntactic Parser [134] and Dependency-Based Compositional
Semantics (DCS) Parser [92]. The current research work used a simple way
of representing RDB elements inter-/intra-relationships. This representation
restricts the RDB schema relationships to be in the form of a verb for easy
mapping between NLQ verbs and RDB relationships.
(attribute) to connect them together for the UML class diagram extraction
phase. The results are derived from the connectivity matrix by searching for
any existing logical path between the source and the target to eventually map
them into an equivalent SQL template. In comparison, and since the current
research work aims for a seemingly natural HCI interaction, the user does not
have to identify any semantic roles in their NLQ. This is because the underly-
ing NLP tools does this for them. Also, the relationships are identified by the
NLQ verbs, so the user is communicating more information in their NLQ using
the current research algorithm compared to the other literature works. Hence,
it is considered more advanced and user-friendly than that in [119]. Also, not
only objects and attributes are extracted from the NLQ; the proposed research
work extracts much lower-level linguistic and semantic roles (i.e., gerunds and
prepositions) which help select the matching RDB lexica with higher accuracy
and precision.
Complexity vs Performance
The current research work is considered significantly simpler than most com-
plex mapping approaches such as [29] as it relies on fewer, but more effective,
underlying NLP linguistic tools and mapping rules. An example of a com-
plex language translation model is the Generative Pre-trained Transformer 3
(GPT-3) [30], introduced in May 2020. GPT-3 is an AI deep learning language
translation model developed by OpenAI [31]. GPT-3 is an enormous artificial
neural networks model with a capacity of 175 billion machine learning param-
eters [32]. Therefore, the performance and quality of GPT-3 language transla-
tion and question-answering models are so high [2]. GPT-3 is used to generate
NLP applications, convert NLQ into SQL, produce human-like text and design
machine learning models.
However, GPT-3 NLP systems of pre-trained language representation
must be trained on text, in-context information and big data (i.e., a DB that
contains all internet contents, a huge library of books and all of Wikipedia) to
make any predictions [31]. Furthermore, for the model training, GPT-3 uses
model parallelism within each matrix multiply to train the incredibly large
GPT-3 models [30]. The model training is executed on Microsoft’s high-band-
width clusters of V100 GPUs.
Training on such advanced computational resources would largely con-
tribute to the excellent performance of GPT-3 models. The biggest weakness
of this model is its extreme complexity, advanced technology requirements and
that it is only efficient once trained because GPT-3 does not have access to the
underlying table schema.
3 • Literature Review 41
Neural networks have not been used in the current research work, nor
for any of the mapping mechanisms. The reason why will be clearer with
some recent work examples such as [124, 156]. SEQ2SQL [124] is a deep
Sequence to Sequence Neural Network Algorithm [157] for generating an SQL
from an NLQ semantic parsing tree. SEQ2SQL uses Reinforcement Learning
Algorithm [124] and rewards from in-the-loop query execution to learn an
SQL generation policy. It uses a dataset of 80,654 hand-annotated NLQ/SQL
pairs to generate the SQL conditions which is incompatible with Cross Entropy
Loss Optimization [158] training tasks. This Seq2SQL execution accuracy is
59.4% and the logical form accuracy is 48.3%.
SEQ2SQL does not use any manually written rule-based grammar like
what is implemented in the current research work. In another recent work in
2019 [156], a sequence-to-sequence neural network model has been proved to
be inefficient and unscalable on large RDBs. Moreover, SQLNet [124] is a map-
ping algorithm without the use of a reinforcement learning algorithm. SQLNet
showed small improvements only by training an MLA sequence-to-sequence-
style model to generate SQL queries when order does not matter as a solution
to the “order-matters” problem. Xu et al. [124] used Dependency Graphs [116]
and the Column Attention Mechanism [159] for performance improvement.
Though this work combined most novel techniques, the model has to be fre-
quently and periodically retrained to reflect the latest dataset updates, which
increases the system’s maintenance costs and computational complexity.
The work in [157] overcomes the shortcomings of sequence-to-sequence
models through a Deep-Learning-Based Model [124] for SQL generation
by predicting and generating the SQL directly for any given NLQ. Then, the
model edits the SQL with the Attentive-Copying Mechanism [160], a Recover
Technique [3] and Task-Specific Look-Up Tables [161]. Though this recent
work proved its flexibility and efficiency, the authors had to create their own
NLQ/SQL pairs manually. Besides, they also had to customize the used RDB,
which is a kind of over-customization to the used framework and environment
applied on. Hence, results are highly questionable in terms of generalizability,
applicability and adaptability on other domains. On the other hand, the cur-
rent research work used RDBs that are public sources namely, Zomato and
WikiSQL.
Intermediate Representations
The current research work tries to save every possible information given by the
NLQ tokens so that each of them is used and represented in the SQL clauses
3 • Literature Review 45
DOI: 10.1201/b23367-4 47
48 NLP Application
NLQ Input
Disambiguaon
e.g. Is Adam a paent or a physician?
Matcher/Mapper
Matching NLQ tokens with RDB schema metatables (lexicon)
and the general-purpose English language ontology.
SQL Execuon
Result
is not affected by the update because every added record will be automatically
annotated by the framework to include necessary annotations and metadata.
The NLQ is inserted through an NLI screen as input data up to 200 charac-
ters with the help of two Python libraries, namely, “server.bot”, which accepts
the input NLQ, and “text_processing.text_nlp”, which initially processes the
NLQ by separating the words and passing them as arguments to the next mod-
ule. The NLI will identify NLQ words as arguments, which will later help pre-
paring them for identifying their semantic and syntactic roles. Figure 6 briefly
summarizes the steps taken to transform an NLQ into an SQL statement. Those
steps will be further clarified throughout this chapter.
POS RECOGNITION
The multilayered translation algorithm framework splits the NLQ into its con-
stituent tokens. Then, these tokens are compared with the RDB MetaTables’
4 • Implementation Plan 49
contents to single out keywords in the NLQ sentence. With the tokens match-
ing schema data, a.k.a. the lexica, the NLQ should be able to be parsed seman-
tically to identify tokens’ semantic-role frames (i.e., noun, verb, etc.) which
helps the translation process. Semantic parsing is done by generating the pars-
ing tree using the Stanford CoreNLP library, with input from the English lan-
guage ontology, WordNet, which feeds the system with NLQ words meanings
(semantics).
The first process performed on the NLQ string is lemmatizing and stem-
ming its words into their broken-down original root forms. This is done by
deleting the words’ inflectional endings and returning them to their base forms,
such as transforming ‘entries’ into ‘entry’. Lemmatizing eases the selection
and mapping process of equivalent RDB elements. It also facilitates the tokens’
syntactic and semantic meaning recognition. Then comes the steps of pars-
ing and tokenizing the words’ stems into tokens according to the predefined
grammatical rules and the built-in syntactic roles. Those syntactic roles will be
mapped to specific RDB elements, for instance, NLQ verbs are mapped with
RDB relationships.
Begin
Split NLQ text to individual ordered words and store
into string array A
Delete any escape words from A
Map words in array A with RDB elements E
Replace words in array A by their matching synonyms
and type from E
If there is ambiguate word W in A then
Ask user “What is W?” and match word W with E
End If
If there is a conditional phrase C in A
Replace C with equivalent conditional operator
in O
Attach O to conditioned attribute name as a
suffix and store in A
End If
Do
50 NLP Application
For any NLQ translation process, both the parsed tokens and their subse-
quent POS tags must be clearly and accurately identified. This is performed by
an underlying multilayered pipeline which starts with tagging an NLQ POS.
Then, the tokenizer, annotator, semantic and syntactic (rule-based) parsers will
be applied and any punctuation marks will be removed. Part of this step is
omitting the meaningless excess escape words that are predefined in the system
(i.e., a, an, to, of, in, at, are, whose, for, etc.) from the NLQ words group. After
parsing, a parse tree is generated and a dictionary of tokens’ names, syntactic
roles and synonyms are maintained in the NLQ MetaTable. Also, the NLQ’s
subjects, objects, verbs and other linguistic roles are identified. Hence, each
tokenized word is registered into the NLQ MetaTable by the syntactic analyzer.
Tokens are then passed to the semantic analyzer for further processing.
The semantic analyzer employs a word-type identifier using a language
vocabulary dictionary or ontology such as WordNet. The word-type identi-
fier, such as WordNet, identifies what semantic role does a word or a phrase
(i.e., common or proper noun) play in a sentence and what is their role
assigner (the verb). Furthermore, the semantic analyzer is able to identify
conditional or symbolic words and map them with their relative represen-
tation from the language ontology. For example, the phrase “bigger than”
will be replaced by the operator “>”. In other words, the semantic analyz-
er’s entity annotator detects the conditional or symbolic words amongst the
input NLQ entities. Then, it replaces them with their equivalent semantic
types identified previously by the schema annotator. The entities replace-
ment creates a new form of the same NLQ that is easier for the SQL genera-
tor or pattern matcher to detect.
The entity annotator is not the only annotator the NLQ deals with. There
are other annotators the NLQ gets passed through such as the numerical
4 • Implementation Plan 51
for values(a, v)
if a ϵ attributes()
remove a from attributes()
end if
end for
52 NLP Application
FIGURE 7 Detailed research organization pipeline (light gray boxes are Python
libraries; dark gray boxes are tasks & gray blobs are passed-on data).
insert_synonyms()
for s ϵ synonyms and e ϵ elements
if s is similar to e and similarity > 0.75 then
merge (s, e) as (s)-[IS_LIKE]->(e)
end if
end for
sql_tagging()
for attributes and tables and conditions
if sql ≠ Ø then
apply semantics_dict[synonyms]
add attributes synonyms to select
add tables synonyms to from
add conditions synonyms to where
end if
end for
return sql_tagging(tags)
54 NLP Application
Other SQL keywords such as aggregate functions (e.g., AVG, SUM, etc.)
or comparison operations (e.g., >, <, =, etc.), defined in the Python “unicode-
data” library, are also tagged with their synonyms for easy and accurate map-
ping, as illustrated in PseudoCode 7 (Appendix 3).
After all of the NLQ and SQL words are tagged with their synonyms, the
algorithm will start the testing module to validate the similarity of RDB Lexica
and NLQ tokens compared with their tagged synonyms. If the “Similarity”
is greater than or equal to 75% (the least-acceptable similarity variance), it
is considered a matching synonym. In this case, lexica or tokens are tagged
with their matching synonyms according to their semantics using WordNet
synonym datasets.
DISAMBIGUATION
NLQ input disambiguation is an intermediate process and is done through con-
textual analysis. When the system cannot make a decision due to some ambi-
guity, it asks the user for further input. This occurs in case of the presence of
more than one match for a particular NLQ token (e.g., “Is Adam a patient or
a physician?”). However, engaging the user is solely for clarifying a certain
ambiguity in the NLQ input by choosing from a list of suggestions of similar
words or synonyms present in the lexica list.
In future work, and as a further disambiguation step to guarantee gener-
ated SQL accuracy, a feedback system could be applied after NLQ analysis.
This feedback system asks the user to confirm the translated NLQ into SQL
query by asking the user “is this the desired SQL?”. However, since we assume
the user’s ignorance of any programming abilities, including SQL, this feed-
back system is not applied in the current research work.
The RDB elements with identical names are carefully managed according
to the NLQ MetaTable (Table 1) and the RDB elements’ MetaTable (Table 2).
Hence, the ambiguity-checking module will eventually have a list of all identi-
cally named elements and their locations in the RDB.
Every entered NLQ goes through a syntactic rules checker for any gram-
matical mistakes. This module checks the NLQ validity or the need for a user
clarification for any ambiguity or spelling mistakes using the Python libraries
“unittest” and “textblob”. The algorithm will proceed to the next step if the
NLQ is valid. Otherwise, the algorithm will look for a clarification or spell-
ing correction response from the user by asking them to choose from a few
potential corrections. Then, the user’s response is classified to either positive
(i.e., Yes) or Negative (i.e., No). This classification happens using the Naïve
4 • Implementation Plan 55
MATCHER/MAPPER
In this phase, synonyms of NLQ tokens are replaced with their equivalent
names from the embedded lexica list. Then, SQL keywords are mapped and
appended with their corresponding RDB lexica. The Matcher/Mapper mod-
ule applies all mapping conditions listed later in Table 4 which covers NLQ
tokens, their associated RDB lexica, SQL clauses, conditional or operational
expressions or mathematical symbols.
This module has access to MetaTables (data dictionaries) of all attributes,
relationships, tables and unique values (Mapping Tables). Both mappers in
Figure 8 can refer to an embedded linguistic semantic-role frame schema, data
or language dictionary, or the underlying RDB schema. This layer uses RDB
schema knowledge (the semantic data models, MetaTables) and related syn-
tactic knowledge to properly map NLQ tokens to the related RDB structure
and contents.
In regard to unique RDB values, and since it’s a storage crisis to store all
RDB values in a RAM or CACHE memory, only unique values and PKs and
FKs will be stored in a mapping table. The unique values’ hosting attributes
and tables will be specified, and a hashing function will be used to access them.
For smaller RDBs (i.e., Zomato), and, as explained in Table 3, the mapping
table is built using the Python dictionary “server.map” that finds associations
Stemmed
• Linguisc Analysis RDB Elements
NLQ Tokens
between NLQ tokens and RDB elements that are often queried together. For
larger RDBs (i.e., WikiSQL), data profiling is performed on RDB elements’
statistics to automatically compile the mapping table. This compilation is
based on which RDB elements are queried more often, and then stored in
the mapping table as a hashing function. The mapping table is expressed as
mapping_table[unique_value] = corresponding_attribute.
Compared to the great value the mapping table adds to the algorithm’s
accuracy, there would not be any significant overhead added by integrating
a mapping table. Ye, the bigger the RDB the bigger the mapping table size,
which affects resources usage in terms of storage capacity.
/* register relationships */
for attributes in rdbSchema do
for attribute1(lexicon1, attribute1) and
attribute2(lexicon2, attribute2) do
relationships ← relation(attribute1, attribute2)
4 • Implementation Plan 57
end for
end for
/* if NLQ has no verbs */
if nlq(verb) = True
check relationships(synonyms)
else
check relation(attributes) in relationships
end if
Now that RDB relationships have been defined and registered, the algo-
rithm is able to retrieve matching RDB lexica, and their hosting attribute or
table. This matching happens in accordance with the matching NLQ lexica and
the relationships built between them. The retrieved data will be then passed
on to the next step to be used in the SQL clauses mapping as explained in
PseudoCode 10.
for rdbSchema(lexicon) do
find parent and relationship
return parent(attribute), parent(table),
Relationship(verb)
NLQ tokens are mapped with their internal representation in the RDB
schema via the MetaTables and synonyms, and then mapped to the SQL
clauses. Each input token is mapped with its associated RDB element (lexicon)
category (e.g., value, column, table or relationship).
The mapper translates the NLQ literal conditions and constraints, whether
they are temporal or event-based, into the SQL query clauses such as translat-
ing “Older than 30” to “Age > 30”. The mapper also extracts matches of func-
tion or structure words (i.e., linking words or comparison words) and search
tokens (i.e., Wh-question words) from the annotated NLQ. Function words
could be prepositions (i.e., of, in, between, at), pronouns (i.e., he, they, it),
determiners (i.e., the, a, my, neither), conjunctions (i.e., and, or, when, while),
auxiliary (i.e., is, am, are, have, got) or particles (i.e., as, no, not).
This module checks for the presence of any NLQ conditional, operational
or mathematical expressions (i.e., min, max, avg, etc.) in the NLQ to custom-
ize the WHERE statement accordingly to retrieve only relevant data from the
RDB, as explained in PseudoCode 11.
58 NLP Application
The first step is building the main SQL clauses, the SELECT, FROM and
WHERE clauses. The attribute names will be fed into the SELECT clause.
Hence, the SELECT keyword is appended with the table attributes. Attributes
are identified by semantically analysing the Wh-word’s main noun phrase or
head noun (main noun in a noun phrase). The WHERE keyword is mapped
with the attribute-value pairs derived from the NLQ semantics. The FROM
keyword is mapped with all involved tables’ names referenced in the SELECT
and WHERE clauses. If there is more than one table, tables will be joined and
added to the FROM clause. If there is a data retrieval condition, a WHERE
clause will be added, and conditions will be joined as illustrated in PseudoCode
13 (Appendix 6).
In this phase, the key mapping function is mapping SQL clauses and key-
words with the NLQ identified lexica, and then building the SQL query. The
tables list which tables names should be selected from must be identified. The
list of relationships, attributes and values with their associated attributes should
also be identified in the form (attribute, value).
query constraints (e.g., WHERE, IN, etc.) if they already have values, or as
part of the SELECT statement; if they need their values to be retrieved from the
RDB. After that, the input values and necessary operators are used to construct
the query constraints in the proper SQL template. An example of assigning a
suitable operator for every WHERE conditional pair (attributes and values) is
converting the NLQ string “equal” to the SQL keyword “LIKE” or the operator
“=” or converting “smaller or equal” to the operator “<=”.
In this work, only the following SQL main clauses are considered, in addi-
tion to other supplementary clauses (e.g., AS, COUNT, etc.):
-----------------
| AVERAGE (*) |
-----------------
| 23 Square Feet |
----------------
Implementation
User Case
Scenario
5
To match an NLQ to a proper SQL template, NLQ text will be analyzed and
tokenized to be matched against the RDB index. The NLQ goes through a full-
text search after it has been tokenized, which is different from the common
keyword search.
Figure 9 shows the directed RDB chart diagram for the Post-Traumatic
Stress Disorder (PTSD) RDB. RDB representation is used here instead of
ERD because the RDB relationships are richer in information than ERDs [13].
Tables 5–8 are the RDB tables. Table 9 is the NLQ MetaTable and Table 10 is
the PTSD RDB elements MetaTable that stores all the entities in the RDB and
their metadata.
All definitions of the RDB entities are stored in Table 10 to describe the
tables and attributes. Using Tables 9 and 10, the current automatic mapping
algorithm can produce considerably accurate mapping results.
RDB keywords related to different tables and attributes are stored together.
Hence, the algorithm is able to map the NLQ tokens to their internal represen-
tation of source attributes and tables in the RDB. To reduce ambiguity, the
relationships between attributes are controlled in the RDB design to be in the
form of verbs only (Figure 10).
Table 10 stores all of the definitions of RDB entities. This describes the
tables, attributes, and unique values. Using Table 9, the current automatic
mapping algorithm can produce more accurate mapping results since it is
able to map the NLQ tokens to their internal representation of source attri-
butes and tables in the RDB. This is because all DB keywords related to
different tables and attributes are stored together. To reduce ambiguity, the
relationships between attributes are controlled in the RDB design to be only
verbs.
DOI: 10.1201/b23367-5 63
64 NLP Application
SQL
One Mulple All Disnct Aggregate One Mulple Disjuncon Cross- Aggregate
Column Columns Columns Select Funcons
Tables
Condion
Nested
Condions Juncon Condion
Operators
Funcons
Less-Than Max In
Min-Select
Operator Condion
Max- Like
Select Operator
Between
Operator
The examples discussed in the below use case scenarios are as follows:
Q5: What medications did John prescribe for his patients? (Cascaded Query)
RA5: Π Medication.Med_Name, Medication.Med_Code (σPhysician.Ph_Name = “John” (Medication ⋈
Patient) ⋈ Physician)
SQL5: SELECT Med_Name, Med_Code FROM Medication WHERE P_ID
IN
(SELECT P_ID FROM Patient WHERE P_ID IN
(SELECT P_ID FROM Physician WHERE Ph_Name = “John”)));
Or
SELECT Medication.Med_Name, Medication.Med_Code FROM
(Medication INNER JOIN Patient
ON Medication.P_ID = Patient.P_ID) INNER JOIN Physician ON
Physician.Ph_ID = Physician.Ph_ID)
WHERE Physician.Ph_Name = “John”;
68 NLP Application
Example 1:
The NLQ words or phrases considered as tokens are those that present a
particular meaning. Such tokens will eventually participate in the iden-
tification of the RDB tables, attributes, relationships, operators (MAX,
AVG) or values. This is because any given token may have 1 of 5 possible
matches: a table, an attribute, a value, a relationship or an operator.
After searching the RDB for the instance “Adam”, it was found under
Patients.P_Name. so, the attribute name is found.
The second valuable token is “Birth Date”. Since every RDB ele-
ment (e.g., attribute) has a list of synonyms, BirthDate was matched
with Patients.P_BY. The noun phrase “Birth Date” is also a synonym of
the physician’s birth Year (Ph_BY), hence, the system must determine
the best RDB element match among all possible matches. This is done
using knowledge from other tokens’ processing. As such, since “Adam”
was found under “Patient” table, then the winning Birth Date match is
“P_BY”. Other matches’ determining mechanisms involve technical pro-
cedures such as statistical similarity measures (e.g., N-Grams Vectors’
Comparison Method). The “P_BY” here will be fed to the WHERE
clause. If there are no WHERE clauses, all DB relations and attributes will
be considered to find valid conditions. Some NLQs might not have condi-
tions, meaning there would not be a WHERE clause in the SQL template.
Generally speaking, any tables mentioned in the SELECT or WHERE
clauses should, by default, be included in the FORM clause to avoid any
SQL execution failure. Efficiency of this approach will be evaluated later
using accuracy measures.
5 • Implementation User Case Scenario 69
• Table = Patient
• Attribute 1 = P_BY
• Attribute 2 = P_Name
The benefit from Figure 12, the tokens breakdown analysis diagrams is to
show the ability to reach the source attribute, table and related RDB from
the NLQ tokens. Finding them helps feeding the SQL template with its
necessary arguments.
Token 1 Token 2
Birth Date
Adam
Type
Type Noune Phrase
Instance
Category
Aribute
Category
Search RDB for Aribute = Birth Date
Value
Birth Date synonyms found under 2 tables
Paent (P_BY) and Physician (Ph_BY)
Search RDB for Value = Adam
Paent is closer to the proper answer as it matches the
Found Under Aribute table of Adam's.
P_Name
Resulted Aribute
P_BY
Source
Source
Paent Table Paent Table
Example 2:
• Table 1 = Patient
• Table 2 = Physician
• Attribute 1 = P_Name
• Attribute 2 = Ph_Name
This attribute was chosen because there is only one table containing the
attribute “Physician” as a synonym to the attribute stored in its metadata,
“Ph_Name”.
Although the word Physician also exists in the Ph_BD attribute meta-
data, the WH-Word in the NLQ (Who) refers to a human name instance
(value), not consecutive digits or a number as in the Physician Birth Date
(Ph_BD) attribute values.
Therefore, the acquired information are:
• Table 1 = Patient
• Table 2 = Physician
• Attribute 1 = P_Name
• Attribute 2 = Ph_Name
Since we have more than one table, the suitable SQL Template here is:
Example 3:
Based on our main assumptions, any comparative expression will help iden-
tify the SQL comparative clause. In this case it is a MAX, and to calculate
the maximum of any range, we have to know the values within that range.
5 • Implementation User Case Scenario 73
From the token “Disease”, and after a search around the RDB, we found
only one table called disease with a synonym of illness. Under that table,
there is one attribute containing the word disease, which is Disease_
Name. This concludes all the required information to use the following
SQL template:
• Table = Disease
• Attribute = Disease_Name
Token 3
Token 1 Token 2
Illness
What Most
Type
Common Noun
Category
Type Type Aribute
Resulted Aribute
Disease_Name
Category Category
Proper Source
SQL MAX
Noun Value
Clause Disease Table
Indicator
Example 4:
• What=Value Indicator
• Ahmed= Instance = Value
• Drug = Common Noun = Attribute
• Taking = Verb = Relationship
Same as previous examples, except that the word “drug” has more than 1
matching. We have 1 table called Medication with a synonym of “Drug”,
but we have 2 attributes under the Medication table with synonyms of
“Drug”, namely Med_Name and Med_Code. Since the NLQ has no fur-
ther tokens to decide which attribute the user is referring to, we will output
both of them.
For complex and nested queries like this example, the mapping and
translation algorithm can be applied recursively.
Following the same steps of previous examples, we reach to the follow-
ing acquired information:
• Table 1 = Medication
• Table 2 = Patient
• Attribute 1 = Med_Name
• Attribute 2 = Med_Code
• Attribute 3 = P_ID
• Attribute 4 = P_Name
What Type
Type
Type
Instance Common Noun
Verb
Category Category
Wh-Word Search RDB for Value Search RDB for Aribute = Drug
= Sarah
Search RDB for Relaonship = take
Drug synonyms found under Medicaon Table
Found Under
Aribute Resulted Relaonship
Resulted Matching Aributes
P_Name Paent take Medicaon
Med_Name, Med_Code
Category
75
76 NLP Application
Example 5:
Using the word “prescribe” will help us identify who is John, which is a
very common name, could be in the patient’s table, as well as the physi-
cians. Searching the RDB, we’ll only find 1 relationship pointing from
physician to patient. Hence, we will look for the word value “John” in the
Physician table, with proper join clauses to the three tables, following the
below SQL template:
• Table 1 = Medication
• Table 2 = Patient
• Table 6 = Physician
• Attribute 1 = Med_Name
• Attribute 2 = P_ID
• Attribute 3 = Ph_ID
• Attribute 4 = Ph_Name
What
Type
Type Type
Instance
Verb Common Noun
Category
Category
Found Under
Aribute Resulted Relaonship Resulted Aribute
P_Name Physician Prescribe Paent Med_Name
Category
77
Implementation
Testing and
Performance
6
Measurements
IMPLEMENTATION
ENVIRONMENT AND
SYSTEM DESCRIPTION
The machine used for this experiment is a MacBook Pro. It was used to run
this experiment with macOS Mojave, version 10.14.2 (18C54). The proces-
sor speed is 2.9 GHz, Intel Core i7 (SATA Physical Interconnect), and 64bit
architecture. The memory is 8 GB of RAM (distributed among two memory
slots, each of which accepts a 1600 MHz memory speed and Double Data
Rate 3 (DDR3) type of memory module), and 750 GB of disk space. The used
MacBook has 1 Processor and 2 Cores, with 256 KB per core.
For the implementation coding and execution, Python 3.7 [164] was
chosen as the programming language due to its clear syntax and popular
NLP libraries for RDB processing tasks. The Integrated Development
Environment (IDE) PyCharm C, Xcode and XQuartz were used to develop
and compile the source codes as they have a Python unit-testing frame-
work that allows for unit-testing automation in consistence with the Python
Software Foundation [130]. The system’s required dependencies include
essential tools and supportive tools. All of the tools are downloaded and
DOI: 10.1201/b23367-6 79
80 NLP Application
installed locally on the experiment machine. The essential tools are declared
in Figure 16, including:
• IDE PyCharm C [168]: The Python IDE for code development and
unit testing.
• XQuartz 2.7.11 [169]: A development environment designed for
Apple OS X with supportive libraries and applications.
• Xcode 11 [170]: An application development tool for Apple OSX,
used in this implementation to check codes’ syntactic rightness.
• MySQL Workbench [171]: An SQL development and administra-
tion tool used mainly for visual modeling.
DATABASE
The current implementation uses MySQL DBMS as a backend environment.
The implementation testing uses two RDBs, Zomato RDB [172] for algorithm
testing, and the WikiSQL RDB [173] for algorithm validation. The testing
process using a small RDB confirms the framework’s functionality, while the
framework validation process evaluates the framework’s accuracy, efficiency
and productivity.
Results from both Zomato (small RDB) and WikiSQL (large RDB) will be
compared based on the RDB size. Table 11 compares between the two RDBs
in terms of their number of instances or records, the number of tables and the
public data source where they were published.
Zomato RDB [172], published in 2008, is a small RDB with a size of
2.5MB having 9,552 NLQ and SQL pairs stored in three comma-separated
value (csv) file tables. Zomato RDB is about a restaurant search engine sup-
plied by the public data platform “Kaggle”. Zomato RDB has the schema dem-
onstrated in Figure 17.
The WikiSQL_DEV RDB [174], published in 2017, was chosen because
of its large RDB. It has 200.5 MB of 80,654 manually annotated RDB of NLQ
and SQL pairs in 24,241 tables from Wikipedia. This RDB is used for develop-
ing NLIs for RDBs. Moreover, WikiSQL is considered the largest web-based
82 NLP Application
IMPLEMENTATION TESTING
AND VALIDATION
Testing the proposed mapping algorithm happens by running a randomized
shuffling of the NLQ/SQL pairs from the Zomato RDB. This step uses four
library functions namely, “random.shuffle”, “collections.defaultdict”, “tqdm”
and “sql_parse.get_incorrect_sqls”. First, the underlying NLP tools and the
Matcher/Mapper module are tested by feeding the system the NLQ lemma-
tized tokens. Then, the tokens go through the Matcher/Mapper module to
84 NLP Application
match the tokens with their synonyms built into the NLQ MetaTable. After
that, tokens and their synonyms will be mapped to their adjacent RDB values,
attributes, tables or relationships, each based on their syntactic role. To test
the SQL template generator module, a set of RDB lexica will be passed to this
module and the generated SQL will be examined for correctness, accuracy and
other performance metrics discussed in the next section.
PERFORMANCE EVALUATION
MEASUREMENTS
The purpose of the proposed algorithm is generating SQLs from NLQs auto-
matically. It is important to obtain a reliable estimate of performance for this
language translation algorithm. However, the algorithm’s accuracy perfor-
mance may rely on other factors besides the learning algorithm itself. Such
factors might include class distribution, effect (cost) of misclassification and
the size of training and test sets. Therefore, to validate the algorithm’s perfor-
mance and efficiency, more detailed accuracy measures are used to test the
generated SQLs accuracy, precision and recall using:
FIGURE 18 ROC curve for the first experiment with Zomato RDB.
For the second experiment with the WikiSQL RDB, the implementation
resulted with the following performance metrics declared in Table 14 and
Figure 19.
Compared with other similar research works on WikiSQL, the proposed
work still achieves the highest accuracy measure as illustrated in Table 15 and
Figure 20.
While the aforementioned performance measurements are sufficient to
answer the current research question, the average time translating each query
remains 1.5 minutes. This could be mainly due to the humble computer system
FIGURE 19 ROC curve for the second experiment with WikiSQL RDB.
used to run the testing and validation processes. However, this processing time
could be enhanced when analyzing the exact reasons of delay using further
performance analysis. For example, each server executing the NLQ into SQL
translation requests could be examined using the following specific perfor-
mance metrics:
• Area under graph, W = : the total time the server used to translate
all queries.
Where T = Total Period, C = Completed and B = Busy Time.
After running the analysis procedure, and as per Figure 21, it turns out that the
process phase that took the longest time is the matching and mapping phase.
This is to be expected since it does most of the tasks executed by the transla-
tion algorithm. A surprising discovery is the amount of time spent on the query
execution and results retrieval from the MySQL RDB as follows:
This time consumption breakdown represents the average time taken by each
module to execute a translation task. They were computed after running a
group of translation tasks and calculating the average time consumed by them
combined.
NL Interface
2%
POS Recognion
SQL Execuon and
17%
Results
30%
Disambiguaon
4%
SQL Template
Generator
5%
Matcher/Mapper
42%
The Queueing Model [188] could be used to enhance the translation perfor-
mance by grouping together the similar SQL types in the queue. However,
queueing models have dependency side effects considering the relationships
between the SQLs in the queue and the corresponding service times for each
SQL execution process.
Those effects could be mitigated by using similar calculations based on
predicted workload intensity [189] and service requirements [190]. The work-
load intensity [189] is a measure of the number of translation requests made in
a given time interval. The service requirements [190] represent the amount of
time each query translation request requires from the server in the processing
system.
6 • Implementation Testing and Performance Measurements 91
If we assume that the system is fast enough to handle the arriving transla-
tion requests, the queries’ translation completion rate (throughput) would equal
the arrival rate. In this ideal case, the implementation environment would be
called “jobs-flow balance” [191] where each query translation duration equals
zero minutes instead of 1.5 minutes, as is the case in the current implementa-
tion environment.
Implementation
Results Discussion 7
In terms of precision, Zomato RDB scored 93% while WikiSQL scored 91%.
Those are the proportion of the correctly generated queries. It also measures
the algorithm’s efficiency in the identification and retrieval of matching RDB
lexica since the retrieval of the wrong lexica would cause lower accuracy mea-
sures due to wrong SQL generations.
From the aforementioned precision and recall measures, the F-measure can
be derived where F-measure = 2PR/ (P + R). The F-measure is the average
performance measure of the matching RDB lexica retrieved as a result of an
accurate matching and mapping process during the NLQ into SQL translation.
Hence, the F-measure basically measures the accuracy of the data retrieved
from an RDB as a result of applying an algorithm. The implementation execu-
tion using Zomato RDB had an F-measure of 94.5%, while WikiSQL had a
92%. In Table 16, a comparison between the two RDB experiments’ perfor-
mance measures is summarized in a confusion matrix. The numbers in the table
represent averages over all runs of both experiments considering all queries in
a run.
Figure 22 illustrates and summarizes all aforementioned performance
metrics measures. With regard to the peaks of accuracy, recall, precision and
F-measure bars in Figure 22, and in addition to the error rates (FPR and FNR)
comparison, it can be concluded that the proposed algorithm is functioning
properly and as needed. Hence, the mapping algorithm does indeed select the
correct RDB elements successfully and map them with the correct SQL clauses
using the novel mapping mechanism. This mapping is based on linguistics
studies of the sentence structure by breaking down the sentence into the words
level and study the words’ inter-/intra-relationships.
Moreover, the area under the ROC curves, called AUC and shown in
Figures 18 and 19, for Zomato RDB is almost 100% while WikiSQL AUC area
is 93%. We conclude from this AUC comparison that the proposed algorithm
shows accurate results for smaller RDBs, but not as much accuracy for RDBs
that cover big data.
DOI: 10.1201/b23367-7 93
94 NLP Application
120
100
80
60
40
20
0
Accuracy Recall Precision F-Measure TPR FPR TNR FNR
Zomato DB WikiSQL DB
IMPLEMENTATION LIMITATIONS
Mapping Limitations
The reason for the lack of accuracy in bigger RDBs, according to the pro-
posed algorithm’s experiments, is the system’s confusion between the actual
7 • Implementation Results Discussion 95
RDB elements’ names in the RDB MetaTable and the synonyms table as a
whole. Thus, if a field is actually named “Birth_Day”, and another field is
named “BD” but has a synonym of “Birthday”, the system will give prior-
ity to the field named “Birth_Day”, which is the main source of confusion.
However, the adoption of the synonym’s table in NLP is quite immature
and could be improved using appropriate machine learning techniques. Such
techniques include classifying the synonyms and recognizing the actual col-
umn names.
Another cause of inaccuracy in the proposed framework is the mapping
table. When NER or data profiling is used to import RDB’s unique values and
fields and tables’ names, the algorithm will be obstructed from correctly map-
ping an NLQ token to an RDB value. This occurs when the NLQ token is not a
unique value and therefore not included in the mapping table. Though the algo-
rithm is supposed to search for the mentioned NLQ value in the whole RDB, it
still starts with the RDB’s unique values table (the mapping table) to minimize
the searching time. This precedence prioritizes the RDB’s unique values list,
stored in the mapping table, over the entire RDB elements which increases the
chances that the included unique values are mistakenly selected as a matching
lexicon. Yet the value retrieval accuracy would still not be guaranteed since
this depends greatly on how clearly the data clerk have entered the data and
whether it had adequate synonyms attached to it.
i. Limited HCI interaction with users to assure the most natural way
of communication, that is direct questioning and answering. This
way the user does not need to identify any NLQ tokens semantic
roles.
ii. Does not need an annotated NLQ/SQL pairs corpus for training,
making it domain-independent and adaptable on any environment.
DOI: 10.1201/b23367-8 99
100 NLP Application
To the best of our knowledge, this proposed research work for NLQ into SQL
mapping and translation presents a novel mechanism. This work bridges the
gap between RDBs and nontechnical DB administrators through a simple
language translation algorithm using strong underlying NLP techniques. This
work enables nontechnical users with no knowledge of RDB semantics to have
the capability to retrieve information from employed RDBs.
The validation of the proposed research experiments and results has
shown promising NLQ into SQL transformation and translation performance.
As such, the smaller RDB performed a 95% accuracy, which is more than the
larger RDB, which scored about 93%. This conclusion is in accordance with
the applied performance metrics and measures such as accuracy, precision,
recall and F-Measure.
However, larger RDBs in this experiment identified clear areas of improve-
ment to enhance their language transformation accuracy to higher than a 93%
accuracy. Another big area of improvement is further simplifying the algo-
rithm coding and testing it on better implementation environment and technical
resources. The aim is to minimise the translation time as it takes an average of
1.5 minutes to return a well-formed SQL, given an NLQ.
FUTURE WORK
Since the research around NLIDBs is only a few years old, there are so many
future work opportunities to expand this work, including but not limited to:
8 • Conclusion and Future Work 101
word = NLQ(tokens)
for rslt = match_label[Table, Attribute, Value,
Relationship]
if rslt[0] then
label token as Table
end if
if rslt[1]
if rslt[2] then
label token as Value
return(rslt[1], rslt[2])
else
label token as Attribute
end if
if rslt[3] then
label token as Relationships
end if
end for
103
Appendix 2
105
106 Appendix 2
keywords_synonyms()
if keyword is average then
add synonyms[‘average’, ‘avg’]
elif keyword is great then
add synonyms[‘greater’,’gt’,’>’,’larger’,’more
than’, ‘is greater than’]
elif keyword is small then
add synonyms[‘smaller’,’st’,’<‘,’lesser
than’,’less than’, ‘is less than’]
elif keyword is greater_or_equal then
add synonyms[greater or equal’, ‘gt or eq’,
‘>=‘, ‘larger or equal’, ‘more than or equal’]
elif keyword is smaller_or_equal then
add synonyms[‘smaller or equal’, ‘st or eq’,
‘<=‘, ‘lesser than or equal’, ‘less than or equal’]
elif keyword is equal then
add synonyms[‘equal’, ‘eq’, ‘=‘, ‘similar’,
‘same as’, ‘is’]
elif keyword is sum then
add synonyms[‘what is the total’, ‘sum’]
elif keyword is max then
add synonyms[‘what is the maximum’, ‘max’,
‘maximum’]
elif keyword is min then
add synonyms[‘what is the minimum’, ‘min’,
‘minimum’]
elif keyword is count then
add synonyms['how many’, ‘count’]
elif keyword is junction then
add synonyms[‘and’, ‘addition’, ‘add’,
‘junction’]
107
108 Appendix 3
input = nlq(words)
output = correct_nlq(input)
while input ≠ Ø do
if Spellcheck(nlq) = error then
print (‘Sorry, there is an error in your NLQ.’,
‘nlq’)
reset input = user_response(nlq(words))
return input
if ambiguitycheck(input) = true
print out (‘What did you mean by’,
ambiguate(word), ‘?’)
classify user_response()
if user_response = true then
set input ← user_response(clarification)
else
set input = user_response(originalNLQ)
end if
end if
else Spellcheck(nlq) ≠ error
end if
end while
109
Appendix 5
for nlq(token) do
/* mapping tokens with their equivalent lexica or
their synonyms */
if lexica[matching_lexicon(table, attribute,
value, relationship), synonym] ← token then
find (matching_lexicon(table) -
[HAS_ATTRIBUTE]-> matching_lexicon(attribute) -
[HAS_VALUE]-> matching_lexicon(value)) -
[HAS_RELATIONSHIP]->
matching_lexicon(relationship)
Compare token with matching_lexicon and
synonym
spanTag matching_lexicon where
matching_lexicon is similar to token and
similarity > 0.75
return matching_lexicon
elif matching_lexicon > 1 then
print (‘which word did you mean to use?’,
lexicon[0], ‘or’, lexicon[1])
matching_lexicon ← user_response()
return matching_lexicon
else matching_lexicon[] ↚ token
matching_lexicon(table) or
matching_lexicon(attribute) or
matching_lexicon(value) or
matching_lexicon(relationship) = False
return error
end if
/* find the corresponding RDB elements from the
identified ones */
111
112 Appendix 5
113
Appendix 7
Included SQL Query Types:
• Simple Queries:
– SELECT 1 column in a table (or more) without conditions to
present all data under selected column.
• Nested Queries (Subqueries):
– SELECT 1 column in a table (or more) with WHERE condition\s.
• Cascaded Queries:
– Join 2 or more columns FROM 2 or more tables in the SELECT/
FROM statement without conditions like:
table-name1 JOIN table-name2 ON attribute1(PK of table1) =
attribute2 (attribute in table2 and also FK of table1)
– Join 2 or more columns FROM 2 or more tables in the SELECT/
FROM statement with WHERE conditions. The WHERE clause
is a single condition or a joint of several conditions.
• Negation Queries:
– Using the NOT Operator with SQL syntax to negate a WHERE
condition.
• Simple WHERE Conditions:
– 1 Simple operational condition (=,>, <, etc.)
– 1 Aggregation condition (max, min, etc.)
– 1 Negation Condition (NOT)
• Complex WHERE Conditions:
– 2 or more operational conditions (=,>, <, etc.)
– 2 or more Aggregation conditions (max, min, etc.) concatenated
“=” with a value specified by the end user.
– 2 or more Negation conditions (NOT AND)
– Including subordinates and conjunctions
• Order/group by:
– Asci. (Alphabetical, numeric).
– Desc. (Alphabetical, numeric).
115
Appendix 8
class Templates:
/* zero attributes, one table */
temp100 = Template(‘SELECT DISTINCT * FROM $table’)
/* zero attributes, one table, one attribute-value
pair */
temp101 = Template(‘SELECT DISTINCT * FROM $table
WHERE $attribute='$value'‘)
/* one attribute, one table */
temp110 = Template(‘SELECT DISTINCT $attribute FROM
$table’)
/* one attribute, one table, two attribute-value
pairs (AND) */
temp112 = Template(‘SELECT DISTINCT $attribute FROM
$table WHERE $attribute1='$value1' AND
$attribute2='$value2'‘)
/* two attributes, one table */
temp120 = Template(‘SELECT DISTINCT $attribute1,
$attribute2 FROM $table’)
/* zero attributes, two tables */
temp200 = Template(‘SELECT DISTINCT * FROM $table1
NATURAL JOIN $table2’)
/* zero attributes, two tables, one attribute-value
pair */
temp201 = Template(‘SELECT DISTINCT * FROM $table1
NATURAL JOIN $table2 WHERE $attribute='$value'‘)
/* zero attributes, three tables, one attribute-
value pair (AND) */
temp301 = Template(‘SELECT DISTINCT * FROM $table1
NATURAL JOIN $table2 NATURAL JOIN $table3 WHERE
$attribute='value'‘)
117
118 Appendix 8
119
120
TABLE 17 Literature works comparison
Appendix 9
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
1 NLQ into SQL Authoring • Uses semantic grammar • Relies heavily on end- • Only involves end users
mapping Interface Based specification, which is a user input throughout in the case of any NLQ
Approaches Systems language definition that multiple interface screens words spelling mistakes
provides accurate rules to modify the used or ambiguous phrases.
for linguistic expressions keywords or phrases. • For linguistic expressions
semantic parsing. • Requires extensive semantic parsing,
expertise time and NLP tools are used to
efforts to identify and lemmatize, tokenize,
specify RDB elements define and tag each NLQ
and concepts. token.
2 Enriching • Widely used in MLA • Requires extensive • The rule-based
the NLQ/ problems. manual rules defining observational algorithm
SQL Pairs via • Provides logical and customizing in case implemented is totally
Inductive Logic knowledge and of any DB change to domain-independent and
Programming reasoning. maintain accuracy. portable on any natural
language translation
framework.
• Adds extra metadata
to the NLQ/SQL pairs to
easily find a semantic
interpretation for NLQ’s
ambiguous phrases for
accurate mapping.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
3 Using MLA • NLQ/SQL pairs’ corpora • Requires a huge domain • Uses simple
Algorithms induces semantic specific NLQ/SQL algorithmic rules and is
grammar parsing to map translation pairs’ corpora domain independent.
NLQs into their SQLs. that is manually written. • It does not assume
• Used by training a • Data preparation is prior knowledge of the
Support Vector Machine time consuming and a adopted RDB schema or
(SVM) classifier, which tedious task. require any annotated
is an efficient MLA • Requires a domain corpora for training
for high dimensional expert to train and test • NLQ/SQL pairs are only
datasets. the system. used for algorithm
• System is over- testing and validation
customized and purposes.
unfunctional on any • Focus is on
other domain. understanding the NLQ
• Assumes the user is to avoid potential future
familiar with the DB errors or jeopardize
schema, data and accuracy.
contents.
• Relying heavily on
MLAs are not effective
in decreasing the
translation error rates or
Appendix 9
increasing accuracy.
• SVM algorithm needs a
lot of memory space.
• SVM is not scalable to
larger DBs.
(Continued)
121
122
Appendix 9
TABLE 17 (Continued) Literature works comparison
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
4 Restricted NLQ • Uses a simple keyword- • Restricts the user to • The current work
Input based search structure. using certain domain- facilitates the interaction
• Uses a user-friendly specific keywords. between humans and
form or template based • Insignificant in terms of computers without NLQ
or menu based NLI to accuracy and recall. restrictions.
facilitate the mapping • Has portability problems • Has a limited interaction
process. even with advanced with the user to assure
algorithms such as the most natural way
similarity-based Top-k of communication,
algorithm. direct questioning and
answering, without
needing the user to
identify any NLQ tokens
semantic roles.
• Provides high accuracy
and recall.
• Compatible with any RDB
domain and a translation
environment.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
5 Lambda Calculus • Uses NLQs meaning • Has some complicated • Uses a compatible
representation for the language logic. programming language,
mapping process. • Too abstract in many Python, that could
• A simple high-level cases. be translated to any
language model of • Very slow in execution. other language using
computation. • Hard to define rules with grammatical parse trees
its logical expressions. and language compilers.
• The current speed is an
average of 1.5 mins per
query.
6 Tree Kernels • Applies kernel functions • Requires a fully • Does not require an
Models on NLQ/SQL pairs syntactic annotated NLQ/SQL NLQ/SQL pairs corpus to
trees to learn its grammar. pairs corpus. develop.
• Applies linear kernels • Unable to recognise • The employed NLP tools
on a “bag-of-words” structural similarities and carefully understands the
to train the classifier to syntactic relations in an NLQ and recognizes its
select the correct SQLs NLQ. structural similarities and
for a given NLQ. • Has lower performance syntactic relations.
when scaled up to larger • Has an insignificantly
DBs. lower performance with
larger RDBs as well but is
still acceptable.
Appendix 9
(Continued)
123
124
Appendix 9
TABLE 17 (Continued) Literature works comparison
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
7 Unified Modeling • Used to model the DB’s • Limited to a few class • MetaTables and mapping
Language (UML) static relationships and diagram concepts tables are used. They
data models. (e.g., classes, accommodate any type
• Refers to the DB’s attributes, associations, and kind of data.
conceptual schema. aggregation and • Only NLQ input is
generalization). required from the user.
• The end user has to • Rule-based algorithm
identify classes and their is compatible with
constituents. all computational
• UML models environments.
visualization
requires compatible
environments.
8 Weighted Links • Uses the highest weight • Compromises accuracy • Accuracy and simplicity
meaningful joins for with complexity. are both the main focus
mapping between NLQ • Requires a huge in the current work.
tokens, RDB lexica and annotated training • No training dataset is
SQL clauses. dataset. needed.
• Computationally
expensive.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
9 NLQ Tokens Morphological • Used for tokens • Requires a huge • The English word
into RDB Lexica and Word extraction. annotated training semantics dictionary
Mapping Group Analyzers • Analyses words’ dataset. (WordNet) is used to
(NLQ Tokens morphology. • Mapping accuracy is extract words’ semantic
Extraction) considerably low. information.
10 Pattern Matching • Used to find keywords • Requires a huge • NLQ tokens
types. annotated training extraction and their
• Facilitates learning other dataset. types identification
domains’ features. • Hard to analyse NLQ/ happens through NLP
SQL pairs mismatching computational linguistics
causes. processes, mainly the
Lemmatizer and the
tokenizer.
• The assumption-based
rules make it easy to find
out causes of NLQ/SQL
pairs mismatching.
11 Name Entity • Used to tokenize and • Restricted to only • While NER tagging is
Recognizer extract NLQ’s semantic recognize the NLQ part of the underlying
(NER) Alone information. tokens that already exist NLP tools, the main
with Coltech- in the NER resource. data source is acquired
Parser in GATE • Integrating data from from the NLQ and RDB
Appendix 9
external resources MetaTables.
is computationally
expensive.
(Continued)
125
126 Appendix 9
TABLE 17 (Continued) Literature works comparison
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
• Mapping accuracy is • MetaTables are used
considerably low. to check for tokens’
existence as a first goal,
then mapping them
to their logical role as
a relationship, table,
attribute or value.
• WordNet is used to
support the MetaTables
with words’ synonyms,
meanings and Lexical
Analysis.
12 Java Annotation • Used for NLQ • Less expressive than • No source trees are
Patterns tokenization and NER SQL-like languages. required except for the
Engine (JAPE) tagging. • It is memory extensive in NLQ tokens’ relations
Grammars • Generates a tag that it creates a whole analysis step.
probability distribution. structured source tree
• Applies a rich feature for every DB element.
representation.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
13 Porter Algorithm • Used to extract tokens’ • Only supports few • The current work is
stems. languages. language independent.
• Does not require • Not a practical approach • Does not mandate the
knowledge structures’ as it requires a huge availability of a huge
reprocessing. memory to process. memory, except for the
• Has a high false positives storage of the RDB and
rate. the MetaTables.
• Hard to implement in
other languages.
14 Unification-Based • Extracts NLQ tokens • Long processing time. • Average processing time
Learning (UBL) using restricted lexical • Complicated nature of of 1.5 mins per query.
Algorithm items and Combinatory stemmer. • The NLP tools easily
Categorial Grammar • Has a high error rate in identify and recognize
(CCG) rules. recognizing NLQ noun tokens’ semantic roles
phrases. and their lexical relations.
• Difficulty in analyzing
tokens relations.
15 Dependency • Used to extract tokens • Potential data loss • NLTK parser parses the
Syntactic Parsing and their lexical during parse tree NLQ tokens according
relations. generation and to the built-in semantic
• Replaces parse trees with expansion. roles that are mapped to
dependency structures. • Eventual error specific RDBs elements.
Appendix 9
• Captures meaningful propagation while • A parse tree is generated
dependency relations applying the greedy and a dictionary of table
directly. parsing. names, attributes and
• Language dependent. tokens are maintained,
and NLQ’s subjects, objects
127
and verbs are identified.
(Continued)
TABLE 17 (Continued) Literature works comparison
128
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
Appendix 9
16 Separate Value • A compromising • Requires a big annotated • Supports RDB schema
and Table approach for not training dataset. elements’ MetaTables
Extractor supporting the RDB • Does not provide and synonyms for tokens’
Interfaces schema elements’ information on NLQ semantic information.
MetaTables and tokens’ semantic • No need for a rich
synonyms. relationships. annotated corpus of
• Ideal for complex NLQ/ • Requires a long time to NLQ/SQL pairs for
SQL pairs. process. algorithm training.
• Domain-independent
and configurable on any
working environment.
17 NLQ Tokens into Spider System • Uses a rich corpus • Uses a huge human • Focuses on simplicity and
RDB Lexica created using complex labeled NLQ/SQL corpus accuracy of the algorithm’s
Mapping and cross-domain for training and testing. mapping outcome with
(RDB Lexica semantic parsing and • Mapping accuracy is not highest priority.
Mapping) SQL patterns coverage. significantly high. • Uses NLQ MetaTable to
• An incremental map NLQ tokens into
approach, new RDB lexica.
experiences affect • The implemented
processing. MetaTables fill up the low
accuracy gap in language
translation algorithms
that do not use any sort
of deep DB schema data
dictionaries or just a
limited Data Dictionary.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
18 WordNet alone • Efficient at expanding • Generalizes the relation • Supportive techniques are
NLQ predicate arguments and does employed in the current
arguments to their not guarantee NLQ’s research work such as the
meaning interpretations lack of ambiguity and disambiguation module.
and synonyms. noise which significantly • To avoid confusion
• Handles complex NLQs affects its meaning around the RDB unique
without an ontological interpretation. values, data profiling
distinction. is performed on RDB’s
statistics to automatically
compile the mapping
table of unique values,
PKs and FKs.
• Unique values, PKs
and FKs are stored in
a mapping table by
specifying their hosting
attributes and tables
while a hashing function
is used to access them
instantly.
(Continued)
Appendix 9
129
TABLE 17 (Continued) Literature works comparison
130 Appendix 9
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
19 NLQ Tokens into Stanford • Can parse any language • It is an outdated • NLQ can be in any
RDB Lexica Dependencies in any free word order. lexicalized parser, which form as long as it has
Mapping Parser • Displays all sentence leads to unnecessary correct spellings and no
(RDB Lexica structure and tokens errors. ambiguous tokens.
Relationships) dependencies. • Sentences must follow
• Uses parsing trees to Chomsky Normal Form
represent syntax and (CNF) style.
semantics.
20 Dependency • Simple and expressive. • Does not show any • The simplest and
Syntactic Parsing • Displays each token in semantic information. most effective way
the NLQ in a high level. • Some parsing trees of representing RDB
are erroneous in that elements relationships
they never lead to the is by restricting the RDB
targeted RDB elements. schema relationships
• Potential false early to be in the form of a
prune out. verb for easy mapping
between NLQ verbs and
RDB relationships.
21 Dependency- • Parsing and learning is • Requires manual • Simple rule-based
Based done using logical forms annotation of logical algorithm that uses the
Compositional (trees). forms and semantic semantic role of a verb to
Semantics (DCS) • Has rich a properties parsing. link RDB lexica with each
System Enriched set of computations, • Complex other.
with Prototype statistics and linguistics. implementation. • No manual annotation is
Triggers needed.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
22 NLQ Tokens into Named Entity • Recognizes a wide • Does not remember • Previously tagged
RDB Lexica Tagger range of literal values previously tagged entity entity sets are saved
Mapping (NLP of named or numerical sets. (temporarily in case of
syntax and entity sets. • Supports limited limited storage) in the
semantics) languages. NLQ MetaTable.
• Does not show • Supports the NLQ’s
dependencies between syntactic and semantic
named entity sets. grammar analysis with
computational linguistics
algorithms in the form of
RDB and NLQ MetaTables
to assist tokens mapping
into RDB lexica.
• NLP syntactic and
semantic tools show the
source tables of each
token, which explains
tokens’ relationships.
23 Dependency • Produces parallel • Does not recognise • Considers understanding
Parser syntactic dependency complex language the NLQ, by finding
trees. phenomena. the combination of its
• Constructs dependency tokens’ meanings, is the
Appendix 9
trees directly without most essential part in the
any parse trees mapping and translation
conversions. process.
(Continued)
131
132 Appendix 9
TABLE 17 (Continued) Literature works comparison
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
• Employs computational
linguistic studies at the
words processing level is
employed.
• The current research
discovered common
semantics between NLQ
and SQL by analyzing the
language syntax roles.
24 LIFER/LADDER • Uses NLQ syntactic and • Not sufficient and • This research framework
Method semantic analysis alone. produces substantially overcomes any poor
• Simple and easy to low precision, FPR and underlying linguistic
implement. TNR. tools’ performance that
are meant to analyse NLQ
syntax and semantics
by using the supportive
MetaTables and WordNet
ontology. Such NLP tools
include named entity
tagger, tokenizer, or
dependency parser.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
25 NLQ/SQL Syntax • Learns multiple NLQ • Limited to available data • RDB schema knowledge,
Trees Encoded syntactic features. resources. the semantic data
Via Kernel • Represents unrestricted • Expensive development models in the form
Functions features of domain- and high time of MetaTables, and
specific knowledge. consumption. syntactic-based analysis
• Does not show knowledge are used to
dependencies between generate parse trees from
named entity sets. the identified tokens
to properly map NLQ
tokens to the related RDB
elements.
26 The Probabilistic • Models NLQ features • Proved to be challenging • The current research
Context Free using production in terms of finding discovered common
Grammar rules with estimated the right grammar for semantics between NLQ
(PCFG) Method probabilities. optimization. and SQL by analyzing the
• Uses overlapping and • Iterative production language syntax roles.
interdependent features rules lead to inherited • Does not require
to build its probability computational annotated datasets.
models. complexity.
• Requires an annotated
training and testing
dataset.
Appendix 9
(Continued)
133
134 Appendix 9
TABLE 17 (Continued) Literature works comparison
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
27 RDB Lexica into The Extended • Extracts fuzzy tokens • Has a high False Positive • Uses computational
SQL Clauses UML Class semantic roles in a form Ratio. linguistics mapping
Mapping Diagrams of a validation sub- constraints to transform
(SQL clauses Representations graph or tree of the Self lexica into SQL clauses
mapping) Organizing Maps (SOM) and keywords.
diagram representation • Computational linguistics
that transforms class is used here in the form
diagrams into SQL of linguistics-based
clauses using the fuzzy mapping constraints
set theory. using a manually written
• More flexible than the rule-based algorithm.
MLA approaches. • Those algorithms are
• Provides higher mainly observational
measures of recall. assumptions.
• The MetaTable specifies
RDB schema categories
(value, relationship,
attribute, etc.) to map
the identified RDB lexica
into SQL clauses and
keywords.
• Provides high measures
or accuracy and recall.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
28 RDB Relationships • Relationships are used to • The user has to • Current work uses
and Linguistic map the lexica into NLQs identify the source, RDB relationships
Analysis linguistic semantic roles’ its associations or and NLP tools, which
classes as a conceptual relationships and the are more capable of
data model. target in the fuzzy “understanding” the
• The mapping results NLQ to connect them NLQ statement before
are derived from the for UML class diagram translating it to an SQL
connectivity matrix extraction. query.
by searching for any • Extraction is not • This method highly
existing logical path thorough or exhaustive. contributes to the
between the source increase in the translation
(objects) and the target accuracy.
(attributes) to eventually • Regarding the linguistic
map the logical path to inter-relationships within
an equivalent SQL. the RDB schema in the
current work, not only
WordNet is used, but also
an NLTK and NLP tools.
Besides, a manual rule-
based algorithm is also
used to define how NLQ
linguistic roles match with
Appendix 9
the RDB elements, which
explains the variance in
translation accuracy and
precision in comparison.
(Continued)
135
TABLE 17 (Continued) Literature works comparison
136
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
Appendix 9
• Assures a seemingly
natural interaction
between the user and
the computer. Hence,
the user does not have
to identify any semantic
roles in their NLQ. The
underlying NLP tools does
this for them.
• The relationships are
identified by the NLQ
verbs, so the user is
communicating more
information in their NLQ
using the current research
algorithm compared to
the other literature works.
Hence, it is considered
more advanced and user
friendly.
• Not only objects and
attributes are extracted
from the NLQ, the
proposed research work
extracts much lower level
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
linguistic and semantic
roles such as gerunds
and prepositions which
help select the matching
RDB lexica with higher
accuracy and precision.
29 RDB Lexica into L2S System • Compares all existing • A complicated system • Considered significantly
SQL Clauses NLQ tokens with existing that consumes a lot of simpler than most
Mapping DB elements using NLP time to run through complex mapping
(Complexity vs tools, tokens’ semantic all DB elements for approaches as it relies on
Performance) mapper and a graph- comparison with NLQ fewer, but more effective,
based matcher. tokens. underlying linguistic tools
• Computationally and mapping rules.
expensive.
30 Bipartite Tree-Like • Employs sophisticated • A complicated system • The current work is
Graph-Based semantic and syntactic that requires a domain- the best in terms of
Processing analysis of the input specific background performance, simplicity
Model NLQ. knowledge and a and adaptability to
thorough training and different framework
testing dataset. environments and RDB
domains.
• The use of MetaTables to
Appendix 9
define the lexica semantic
roles and their adjacent
SQL slots for better
mapping accuracy.
(Continued)
137
138
Appendix 9
TABLE 17 (Continued) Literature works comparison
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
31 Ellipsis Method • Deals with instances • Requires that NLQ be • Employs a lightweight
of ellipsis (less than a explained by the user. approach for query
sentence). • A memory-based translations with no
• Uses a computationally learning method. need to storage spaces
cheap and robust • Produces a few other than storing the
approach. mismatched SQLs. MetaTables and the
mapping tables.
32 The Highest • Automatically discards • Relies heavily on labeled • No need for labeled
Possibility any features that are not training and testing data for developing and
Selection necessary. data, which is expensive executing.
• Memorizes and searches and tedious to create. • Can be generalizable
previous NLQ encounters • Not generalizable across across domains.
to find relatable other domains.
features.
33 Weighted Neural • Generates ordered • Computationally • The translation accuracy
Networks and weighted SQLs expensive. of this algorithm still falls
and Stanford schemata. • Unscalable to bigger behind the proposed
Dependencies • Uses linguistics in their RDBs. algorithm because of the
Collapsed (SDC) algorithm. use of further semantic
roles and linguistic
categories (i.e., adjectives,
pronouns etc.).
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
• Uses NLQ’s subject • Prioritizes SQLs based on • Uses the verbs to find the
or object to search probability of correctness attributes’ and values’
the DB for matching instead of accuracy and relationships instead of
attributes with weighted precision. using a heavy weighted
projection-oriented tool such as the weighted
stems and generate the projection-oriented stems.
SQL clauses accordingly.
34 Pattern Matching • Used as a mapping • Does not apply any NLQ • Both implemented
of SQL algorithm. interpretation modules mappers have access to
• Easy to develop and or parsing elaborations. an embedded linguistic
execute. • Its translation accuracy semantic-role frame
• Handles mixed data and overall performance schema (WordNet
types of NLQ tokens. is highly jeopardized. and Stanford CoreNLP
Toolkit), data dictionary
(MetaTables) and the RDB
schema. Those resources
are essential for accurate
SQL query formation and
generation.
(Continued)
Appendix 9
139
140
Appendix 9
TABLE 17 (Continued) Literature works comparison
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
35 RDB Lexica into NLQ Conceptual • A concept-based query • SQLs are constructed • Simplifies SQL queries
SQL Clauses Abstraction language to facilitate from scratch, which adds generation by using
Mapping (SQL SQL formulation. extra computational ready SQL templates.
Formation vs • Scalable to large complexity to the • SQL construction
SQL Templates) datasets. language translation constraints are used in
system. the mapping algorithm to
• Adds an additional guarantee accurate SQL
unnecessary layer on top template selection.
of the original system • This approach is
architecture. considered as a simple
and accurate method of
generating SQLs.
36 Semantic • Used to store all • Due to this system’s • Accommodates more
Grammar grammatical words to be complexity, this complex SQL types such
Analysis used for mapping NLQ’s architecture can only as nested or cascaded
intermediate semantic translate simple NLQs. SQLs.
representation into SQL • Not flexible with nested
clauses. or cascaded SQLs.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
37 Kernel Functions, • Used to classify NLQ/ • Achieves low recall of • No need to develop
SVM Classifier, SQL pairs as correct or correctly retrieved SQL a training and testing
and the incorrect. answers. datasets of NLQ/SQL pairs
Statistical • The mapping algorithm • Requires labeled training for every new domain.
and Shallow is at the syntactic level. and testing datasets of • High recall due to the use
Charniak’s • Uses NLQ semantics to NLQ/SQL pairs. of an accurate mapping
Syntactic Parser build syntactic trees to • Such exclusive domain- algorithm mapped to
select SQLs according to specific systems are ready SQL templates.
their probability scores. highly expensive.
• A parser is applied to • Its performance is
compute the number subjective to the
of shared high-level accuracy and correctness
semantics and common of the training and
syntactic substructures testing datasets, which
between 2 trees to are manually written by
produce the union of a human domain expert.
the shallow feature
spaces.
(Continued)
Appendix 9
141
TABLE 17 (Continued) Literature works comparison
142 Appendix 9
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
38 Heuristic • NLQ/SQL pairs syntactic • There is no use of any • Uses simple algorithmic
Weighting trees are used as an SQL NLQ annotated meaning rules based on
Scheme compiler to derive NLQ resources or manual computational linguistics
parsing trees. semantic interpretation to fully understand the
• NLQ tokens’ lexical and representation to input NLQ to ensure
dependencies, DB fully understand the highest translation
schema and some NLQ. accuracy.
synonym relations are • The SQL generator • RDB MetaTable is used
used to map DB lexica performance is for lexical relations
with the SQL clauses. considerably low. disambiguation.
• SQL generator is built • A mapping table is also
from scratch, which used, which includes RDB
adds high complexity to lexica data types, PKs and
the language translation FKs and names of entity
algorithm. sets (unique values),
in addition to other
rule-based mapping
constraints.
• Uses SQL templates and
puts extra focus on passing
accurate RDB lexia into
SQL templates generator
for better performance
and correct output.
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
39 A Deep Sequence • Generates an SQL from • Incompatible with cross • Uses manually written
to Sequence NLQ semantic parsing. entropy loss optimization rule-based grammar for
Neural Network • Uses reinforcement training tasks. the mapping.
learning and rewards • Requires manually • Produces high measures
from in-the-loop query annotated NLQ/SQL of accuracy and recall.
execution to learn an pairs for generating the • No need for training and
SQL generation policy. SQL conditions. testing labeled data.
• Execution accuracy is as
low as 59.4% and logical
form accuracy is 48.3%.
• Proved to be inefficient
and unscalable on large
RDBs.
40 MLA Sequence- • Employs a mapping • Has very low • Translates NLQs into
To-Sequence- algorithm without performance measures. SQLs while maintaining
Style Model reinforcement learning. • Had to use dependency high simplicity and
• Showed small graphs and the column performance.
improvements to attention mechanism • The system does not
generate SQL queries for performance need to be updated or
when order does not improvement. maintained periodically.
matter. • The model has to
• Solves the “order- be frequently and
Appendix 9
matters” design periodically retrained to
problem. reflect the latest dataset
updates, which increases
the system’s maintenance
costs and computational
143
complexity.
(Continued)
144
Appendix 9
TABLE 17 (Continued) Literature works comparison
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
41 A Deep-Learning- • Predicts and generates • NLQ/SQL pairs were • Uses RDBs that are public
Based Model the SQL directly for any manually written for source namely, Zomato
given NLQ. model training and and WikiSQL, only for
• Uses attentive-copying testing. algorithm validation and
mechanism, a recover • The used RDB is testing.
technique and task- specifically customized • Does not need labeled
specific look-up tables, to the used framework data for developing.
to edit the generated and environment applied • The translator algorithm
SQL. on. is domain-independent
• Overcomes the • Highly questionable in and configurable on any
shortcomings of terms of generalizability, other environment.
sequence-to-sequence applicability and
models. adaptability on other
• Proved its flexibility and domains.
efficiency
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
42 RDB Lexica into Regular • Represents NLP • regexps collections in • Uses NLQ MetaTables
SQL Clauses Expressions tokens phonology and NLQ sentences are not and RDB MetaTables
Mapping (regexps) morphology. clearly articulated in the to increase accuracy of
(Intermediate • Uses NLQ intermediate literature. mappings between NLQ
Representation) semantic representation • Proved to be not as tokens into RDB lexica
layers to represent NLQ effective as the NLP and then into SQL clauses.
lexica as SQL clauses. tools, MetaTables and • Tries to save every
• Tokens representation mapping tables in possible information
happens by applying terms of accuracy and given by the NLQ so
First Order Predicate precision. as each NLQ token is
Calculus Logic resembled used and represented
by DB-Oriented Logical in the SQL clauses and
Form (DBLF) and expressions production.
Conceptual Logical Form • Uses multiple NLP tools,
(CLF) with some SQL MetaTables and mapping
operators and functions tables for unique values
to build and generate to fully understand the
SQLs. NLQ and map its tokens
to their corresponding
RDB elements.
(Continued)
Appendix 9
145
TABLE 17 (Continued) Literature works comparison
146 Appendix 9
EXISTING HOW THESIS SYSTEM
# AREA SOLUTIONS ADVANTAGE DISADVANTAGE DIFFERS?
43 The Similarity- • Processes NLQ tokens • Proved to be not • No conceptual
Based Top-K to map them to their effective because of representation is needed
Algorithm internal conceptual the high complexity for the mapping.
representation layer. and time consumption • The identified attributes
• Uses Entity-Attribute- approaches applied. are automatically mapped
Value (EAV) DB into the SQL SELECT
metadata and clause, while the tables
grammatical parse trees. are extracted from the
SELECT clause to generate
an SQL FROM clause, and
the values are used as a
conditional statement in
the WHERE clause.
44 Lambda-Calculus • Uses lambda-calculus • A supervision-extensive • No meaning
to map tokens to their system. representation is needed
corresponding meaning for the mapping.
representations. • Does not require any
human supervision to
properly function.
45 An Intermediate • Transforms DB lexica into • Computationally • No internal graphical
Tree-Like Graph an intermediate tree-like expensive and representation is needed
graph. processing is time for the mapping.
• Extracts the SQL from consuming. • Processing the mapping
the maximum bipartite has an average speed of
matching algorithm. 1.5 mins.
Glossary
AI Artificial Intelligence
API Application Program Interface
CSV Comma-Separated Values
DAC Data Administration Commands
DB DataBase
DBMS Database Management System
DCL Data Control Language
DDL Data Definition Language
DML Data Manipulation Language
DQL Data Query Language
EAV Entity-Attribute-Value
ERD Entity-Relational Diagram
FKs Foreign Keys
FNR False Negative Ratio
FPR False Positive Ratio
IDE Integrated Development Environment
MLA Machine Learning Algorithm
NER Named Entity Recognition
NL Natural Language
NLI Natural Language Interface
NLIDB Natural Language Interface for DataBase
NLP Natural Language Processing
NLQ Natural Language Question
NLTK Natural Language Toolkit
NLTSQLC NL into SQL Convertor
NoSQL Not Only Structured Query Language
OLTP Online Transactional Processing
PKs Primary Keys
POS Part of Speech
PTSD Post-Traumatic Stress Disorder
QAS Question Answering Systems
QL Query Language
RA Relational Algebra
RDB Relational DataBase
147
148 Glossary
149
150 References
[15] Queralt, A., & Teniente, E. (2006, November). Reasoning on UML class dia-
grams with OCL constraints. In International Conference on Conceptual
Modeling (pp. 497–512). Springer, Berlin, Heidelberg.
[16] Grosz, B. J., Appelt, D. E., Martin, P. A., & Pereira, F. C. (1987). TEAM: An
experiment in the design of transportable natural-language interfaces. Artificial
Intelligence, 32(2), 173–243.
[17] Owei, V., Rhee, H. S., & Navathe, S. (1997). Natural language query filtra-
tion in the conceptual query language. In Proceedings of the Thirtieth Hawaii
International Conference on System Sciences (Vol. 3, pp. 539–549). IEEE.
[18] Nguyen, D. T., Hoang, T. D., & Pham, S. B. (2002). A Vietnamese natural lan-
guage interface to database. In 2012 IEEE Sixth International Conference on
Semantic Computing (pp. 130–133). IEEE, China.
[19] Miller, G. A. (1995). WordNet: A lexical database for English. Communications
of the ACM, 38(11), 39–41.
[20] Sleator, D. D., & Temperley, D. (1995). Parsing English with a link grammar.
arXiv preprint cmp-lg/9508004.
[21] Stanford CoreNLP. (2014). Stanford CoreNLP – Natural language software.
Last accessed December 23, 2019. Retrieved from: https://stanfordnlp.github.io/
CoreNLP/
[22] Johnstone, B. (2018). Discourse analysis. John Wiley & Sons.
[23] Leech, G. N. (2016). Principles of pragmatics. Routledge.
[24] Giordani, A., & Moschitti, A. (2009, June). Semantic mapping between natu-
ral language questions and SQL queries via syntactic pairing. In International
Conference on Application of Natural Language to Information Systems
(pp. 207–221). Springer, Berlin, Heidelberg.
[25] Zhang, J., Scardamalia, M., Reeve, R., & Messina, R. (2009). Designs for collec-
tive cognitive responsibility in knowledge-building communities. The Journal of
the Learning Sciences, 18(1), 7–44.
[26] Gallè, F., Mancusi, C., Di Onofrio, V., Visciano, A., Alfano, V., Mastronuzzi, R.,
… & Liguori, G. (2011). Awareness of health risks related to body art practices
among youth in Naples, Italy: A descriptive convenience sample study. BMC
Public Health, 11(1), 625.
[27] Safari, L., & Patrick, J. D. (2019). An enhancement on Clinical Data Analytics
Language (CliniDAL) by integration of free text concept search. Journal of
Intelligent Information Systems, 52(1), 33–55.
[28] Kando, N. (1999, November). Text structure analysis as a tool to make retrieved
documents usable. In Proceedings of the 4th International Workshop on Infor
mation Retrieval with Asian Languages (pp. 126–135).
[29] Zhang, M., Zhang, J., & Su, J. (2006, June). Exploring syntactic features for table
extraction using a convolution tree kernel. In Proceedings of the Main Conference
on Human Language Technology Conference of the North American Chapter
of the Association of Computational Linguistics (pp. 288–295). Association for
Computational Linguistics.
[30] Sagar, R. (2020, June 3). OpenAI releases GPT-3, the largest model so far.
Analytics India Magazine. Retrieved October 14, 2020.
[31] Chalmers, David (2020, July 30). “GPT-3 and General Intelligence”. Daily
Nous. Last accessed October 14, 2020. Retrieved from: https://dailynous.
com/2020/07/30/philosophers-gpt-3/#chalmers
References 151
[32] OpenAI, Discovering and enacting the path to safe artificial general intelligence,
2020. Last accessed October 13, 2020. Retrieved from: https://openai.com/
[33] Bybee, J. L. (1985). Morphology: A study of the relation between meaning and
form (Vol. 9). John Benjamins Publishing.
[34] Smith, N. V. (1973). The acquisition of phonology: A case study. Cambridge
University Press.
[35] Stetson, R. H. (2014). Motor phonetics: A study of speech movements in action.
Springer.
[36] Zhang, D., & Lee, W. S. (2003, July). Question classification using support vec-
tor machines. In Proceedings of the 26th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 26–32).
ACM.
[37] Iftikhar, A., Iftikhar, E., & Mehmood, M. K. (2016, August). Domain specific
query generation from natural language text. In 2016 Sixth International
Conference on Innovative Computing Technology (INTECH) (pp. 502–506).
IEEE.
[38] Kumar, R., & Dua, M. (2014, April). Translating controlled natural language
query into SQL query using pattern matching technique. In International
Conference for Convergence for Technology–2014 (pp. 1–5). IEEE.
[39] Boyd-Graber, J., Fellbaum, C., Osherson, D., & Schapire, R. (2006, January).
Adding dense, weighted connections to WordNet. In Proceedings of the Third
International WordNet Conference (pp. 29–36).
[40] NLTK 3.4.5 Documentation. (2019). Natural language toolkit. Last accessed
December 23, 2019. Retrieved from: http://www.nltk.org/
[41] Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI
2001 Workshop on Empirical Methods in Artificial Intelligence (Vol. 3, No. 22,
pp. 41–46).
[42] Safari, L., & Patrick, J. D. (2018). Complex analyses on clinical information
systems using restricted natural language querying to resolve time-event depen-
dencies. Journal of Biomedical Informatics, 82, 13–30.
[43] Ganti, V., He, Y., & Xin, D. (2010). Keyword++: A framework to improve
keyword search over entity databases. Proceedings of the VLDB Endowment,
3(1–2), 711–722.
[44] Woods, W. A. (1981). Procedural semantics as a theory of meaning. Bolt Beranek
and Newman Inc.
[45] Kaur, S., & Bali, R. S. (2012). SQL generation and execution from natural lan-
guage processing. International Journal of Computing & Business Research,
2229–6166.
[46] Bhadgale Anil, M., Gavas Sanhita, R., Pati Meghana, M., & Pinki, R. (2013).
Natural language to SQL conversion system. International Journal of Computer
Science Engineering and Information Technology Research (IJCSEITR), 3(2),
161–166, ISSN 2249-6831.
[47] Popescu, A. M., Etzioni, O., & Kautz, H. (2003, January). Towards a theory of
natural language interfaces to databases. In Proceedings of the 8th International
Conference on Intelligent User Interfaces (pp. 149–157). ACM.
[48] Parlikar, A., Shrivastava, N., Khullar, V., & Sanyal, S. (2005). NQML: Natural
query markup language. In 2005 International Conference on Natural Language
Processing and Knowledge Engineering (pp. 184–188). IEEE.
152 References
[49] Peng, Z., Zhang, J., Qin, L., Wang, S., Yu, J. X., & Ding, B. (2006, September).
NUITS: A novel user interface for efficient keyword search over databases. In
Proceedings of the 32nd International Conference on Very Large Data Bases
(pp. 1143–1146). VLDB Endowment.
[50] Feijs, L. M. G. (2000). Natural language and message sequence chart representa-
tion of use cases. Information and Software Technology, 42(9), 633–647.
[51] Karande, N. D., & Patil, G. A. (2009). Natural language database interface
for selection of data using grammar and parsing. World Academy of Science,
Engineering and Technology, 3, 11–26.
[52] El-Mouadib, F. A., Zubi, Z. S., & Almagrous, A. A. (2009). Generic interactive
natural language interface to databases (GINLIDB). International Journal of
Computers, 3(3).
[53] Li, H., & Shi, Y. (2010, February). A wordnet-based natural language interface
to relational databases. In 2010 The 2nd International Conference on Computer
and Automation Engineering (ICCAE) (Vol. 1, pp. 514–518). IEEE.
[54] Enikuomehin, A. O., & Okwufulueze, D. O. (2012). An algorithm for solving
natural language query execution problems on relational databases. International
Journal of Advanced Computer Science and Applications, 3(10), 180–182.
[55] Chen, P. P. S. (1983). English sentence structure and entity-relationship dia-
grams. Information Sciences, 29(2–3), 127–149.
[56] QUEST/a natural language interface to relational databases (2018). In Proceedings
of the Eleventh International Conference on Language Resources and Evaluation
(LREC).
[57] Desai, B., & Stratica, N. (2004). Schema-based natural language semantic map-
ping. In Proceedings of the 9th International Conference on Applications of
Natural Language to Information Systems.
[58] Becker, T. (2002, May). Practical, template–based natural language generation
with tag. In Proceedings of the Sixth International Workshop on Tree Adjoining
Grammar and Related Frameworks (TAG+ 6) (pp. 80–83).
[59] Androutsopoulos, I., Ritchie, G. D., & Thanisch, P. (1995). Natural language
interfaces to databases–an introduction. Natural Language Engineering, 1(1),
29–81.
[60] Patrick, J., & Li, M. (2010). High accuracy information extraction of medica-
tion information from clinical notes: 2009 i2b2 medication extraction challenge.
Journal of the American Medical Informatics Association, 17(5), 524–527.
[61] Chaudhari, P. (2013). Natural language statement to SQL query translator.
International Journal of Computer Applications, 82(5), 18–22.
[62] Gao, K., Mei, G., Piccialli, F., Cuomo, S., Tu, J., & Huo, Z. (2020). Julia lan-
guage in machine learning/algorithms, applications, and open issues. Computer
Science Review, 37, 100254.
[63] Hasan, R., & Gandon, F. (2014, August). A machine learning approach to sparql
query performance prediction. In 2014 IEEE/WIC/ACM International Joint
Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)
(Vol. 1, pp. 266–273). IEEE.
[64] Wang, H., Ma, C., & Zhou, L. (2009, December). A brief review of machine
learning and its application. In 2009 International Conference on Information
Engineering and Computer Science (pp. 1–4). IEEE.
References 153
[65] Boyan, J., Freitag, D., & Joachims, T. (1996, August). A machine learning archi-
tecture for optimizing web search engines. In AAAI Workshop on Internet Based
Information Systems (pp. 1–8).
[66] Chen, H., Shankaranarayanan, G., She, L., & Iyer, A. (1998). A machine learn-
ing approach to inductive query by examples: An experiment using relevance
feedback, ID3, genetic algorithms, and simulated annealing. Journal of the
American Society for Information Science, 49(8), 693–705.
[67] Hazlehurst, B. L., Burke, S. M., & Nybakken, K. E. (1999). Intelligent query
system for automatically indexing information in a database and automatically
categorizing users. U.S. Patent No. 5,974,412. Washington, DC: U.S. Patent and
Trademark Office. WebMD Inc.
[68] Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic
analysis. Machine Learning, 42(1–2), 177–196.
[69] Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., & Kirilov, A. (2004). KIM –
A semantic platform for information extraction and retrieval. Journal of Natural
Language Engineering, 10(3–4), 375–392.
[70] Alexander, R., Rukshan, P., & Mahesan, S. (2013). Natural language web inter-
face for database (NLWIDB). arXiv preprint arXiv:1308.3830.
[71] Zhou, Q., Wang, C., Xiong, M., Wang, H., & Yu, Y. (2007). SPARK: Adapting
keyword query to semantic search. In The semantic web (pp. 694–707). Springer,
Berlin, Heidelberg.
[72] Bergamaschi, S., Domnori, E., Guerra, F., Orsini, M., Lado, R. T., & Velegrakis,
Y. (2010). Keymantic: Semantic keyword-based searching in data integration
systems. Proceedings of the VLDB Endowment, 3(1–2), 1637–1640.
[73] Strötgen, J., & Gertz, M. (2010, July). HeidelTime: High quality rule-based
extraction and normalization of temporal expressions. In Proceedings of the 5th
International Workshop on Semantic Evaluation (pp. 321–324). Association for
Computational Linguistics.
[74] Sohn, S., Wagholikar, K. B., Li, D., Jonnalagadda, S. R., Tao, C., Komandur
Elayavilli, R., & Liu, H. (2013). Comprehensive temporal information detection
from clinical text: Medical events, time, and TLINK identification. Journal of
the American Medical Informatics Association, 20(5), 836–842.
[75] Zhou, L., Friedman, C., Parsons, S., & Hripcsak, G. (2005). System architec-
ture for temporal information extraction, representation and reasoning in clinical
narrative reports. In AMIA Annual Symposium Proceedings (Vol. 2005, p. 869).
American Medical Informatics Association.
[76] Giordani, A., & Moschitti, A. (2012, June). Generating SQL queries using natu-
ral language syntactic dependencies and metadata. In International Conference
on Application of Natural Language to Information Systems (pp. 164–170).
Springer, Berlin, Heidelberg.
[77] Giordani, A., & Moschitti, A. (2010, May). Corpora for Automatically Learning
to Map Natural Language Questions into SQL Queries. In LREC.
[78] Kate, R. J., & Mooney, R. J. (2006, July). Using string-kernels for learn-
ing semantic parsers. In Proceedings of the 21st International Conference on
Computational Linguistics and the 44th Annual Meeting of the Association
for Computational Linguistics (pp. 913–920). Association for Computational
Linguistics.
154 References
[79] Tseng, F. S., & Chen, C. L. (2006, September). Extending the UML concepts to
transform natural language queries with fuzzy semantics into SQL. Information
and Software Technology, 48(9), 901–914.
[80] Booch, G. (2005). The unified modeling language user guide. Pearson Education
India.
[81] Oestereich, B. (2002). Developing software with UML: Object-oriented analysis
and design in practice. Pearson Education.
[82] Higa, K., & Owei, V. (1991, January). A data model driven database query tool.
In Proceedings of the Twenty-Fourth Annual Hawaii International Conference
on System Sciences (Vol. 3, pp. 53–62). IEEE.
[83] Muller, R. J. (1999). Database design for smarties: Using UML for data model-
ing. Morgan Kaufmann.
[84] Winston, P. H. (1992). Artificial intelligence. Addison-Wesley Longman
Publishing Co., Inc.
[85] Schmucker, K. J., & Zadeh, L. A. (1984). Fuzzy sets natural language computa-
tions and risk analysis. Rockville Md/ Computer Science Press.
[86] Zadeh, L. A. (1975). The concept of a linguistic variable and its application to
approximate reasoning-I. Information Sciences, 8, 199–249.
[87] Isoda, S. (2001). Object-oriented real-world modeling revisited. Journal of
Systems and Software, 59(2), 153–162.
[88] Moreno, A. M., & Van De Riet, R. P. (1997, June). Justification of the equivalence
between linguistic and conceptual patterns for the object model. In Proceedings
of the International Workshop on Applications of Natural Language to Infor
mation Systems, Vancouver.
[89] Métais, E. (2002). Enhancing information systems management with natu-
ral language processing techniques. Data & Knowledge Engineering, 41(2–3),
247–272.
[90] Yager, R. R., Reformat, M. Z., & To, N. D. (2019). Drawing on the iPad to input
fuzzy sets with an application to linguistic data science. Information Sciences,
479, 277–291.
[91] Owei, V., Navathe, S. B., & Rhee, H. S. (2002). An abbreviated concept-based
query language and its exploratory evaluation. Journal of Systems and Software,
63(1), 45–67.
[92] Hoang, D. T., Le Nguyen, M., & Pham, S. B. (2015, October). L2S: Transforming
natural language questions into SQL queries. In 2015 Seventh International
Conference on Knowledge and Systems Engineering (KSE) (pp. 85–90).
IEEE.
[93] Costa, P. D., Almeida, J. P. A., Pires, L. F., & van Sinderen, M. (2008,
November). Evaluation of a rule-based approach for context-aware services. In
IEEE GLOBECOM 2008 – 2008 IEEE Global Telecommunications Conference
(pp. 1–5). IEEE.
[94] Garcia, K. K., Lumain, M. A., Wong, J. A., Yap, J. G., & Cheng, C. (2008,
November). Natural language database interface for the community based moni-
toring system. In Proceedings of the 22nd Pacific Asia Conference on Language,
Information and Computation (pp. 384–390).
[95] International Conference on Applications of Natural Language to Information
Systems (13th: 2008: London, England). (2008). Natural Language and Information
References 155
[112] Minock, M., Olofsson, P., & Näslund, A. (2008, June). Towards building robust
natural language interfaces to databases. In International Conference on Appli
cation of Natural Language to Information Systems (pp. 187–198). Springer,
Berlin, Heidelberg.
[113] Zettlemoyer, L. S., & Collins, M. (2012). Learning to map sentences to logi-
cal form: Structured classification with probabilistic categorial grammars. arXiv
preprint arXiv:1207.1420.
[114] Tang, L. R., & Mooney, R. J. (2001, September). Using multiple clause con-
structors in inductive logic programming for semantic parsing. In European
Conference on Machine Learning (pp. 466–477). Springer, Berlin, Heidelberg.
[115] Giordani, A., & Moschitti, A. (2012, December). Translating questions to SQL
queries with generative parsers discriminatively reranked. In Proceedings of
COLING 2012: Posters (pp. 401–410).
[116] De Marneffe, M. C., MacCartney, B., & Manning, C. D. (2006, May). Generating
typed dependency parses from phrase structure parses. In LREC (Vol. 6,
pp. 449–454).
[117] Joachims, T. (1998). Making large-scale SVM learning practical (No. 1998, 28).
Technical Report.
[118] Moschitti, A. (2006, September). Efficient convolution kernels for dependency
and constituent syntactic trees. In European Conference on Machine Learning
(pp. 318–329). Springer, Berlin, Heidelberg.
[119] Giordani, A., & Moschitti, A. (2009, September). Syntactic structural kernels
for natural language interfaces to databases. In Joint European Conference
on Machine Learning and Knowledge Discovery in Databases (pp. 391–406).
Springer, Berlin, Heidelberg.
[120] Lu, W., Ng, H. T., Lee, W. S., & Zettlemoyer, L. (2008, October). A generative
model for parsing natural language to meaning representations. In Proceedings
of the 2008 Conference on Empirical Methods in Natural Language Processing
(pp. 783–792).
[121] Liang, P., Jordan, M. I., & Klein, D. (2013). Learning dependency-based compo-
sitional semantics. Computational Linguistics, 39(2), 389–446.
[122] Clarke, J., Goldwasser, D., Chang, M. W., & Roth, D. (2010, July). Driving
semantic parsing from the world’s response. In Proceedings of the Fourteenth
Conference on Computational Natural Language Learning (pp. 18–27). Asso
ciation for Computational Linguistics.
[123] Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., & Steedman, M. (2010, October).
Inducing probabilistic CCG grammars from logical form with higher-order
unification. In Proceedings of the 2010 Conference on Empirical Methods in
Natural Language Processing (pp. 1223–1233). Association for Computational
Linguistics.
[124] Xu, X., Liu, C., & Song, D. (2017). SQLnet: Generating structured queries
from natural language without reinforcement learning. arXiv preprint arXiv:
1711.04436.
[125] Thompson, B. H., & Thompson, F. B. (1985). ASK is transportable in half a
dozen ways. ACM Transactions on Information Systems, 3(2), 185–203.
References 157
[126] Kudo, T., Suzuki, J., & Isozaki, H. (2005, June). Boosting-based parse reranking
with subtree features. In Proceedings of the 43rd Annual Meeting on Association
for Computational Linguistics (pp. 189–196). Association for Computational
Linguistics.
[127] Toutanova, K., Markova, P., & Manning, C. (2004). The leaf path projection view
of parse trees: Exploring string kernels for HPSG parse selection. In Proceedings
of the 2004 Conference on Empirical Methods in Natural Language Processing
(pp. 166–173).
[128] Kazama, J. I., & Torisawa, K. (2005, October). Speeding up training with tree
kernels for node table labeling. In Proceedings of the Conference on Human
Language Technology and Empirical Methods in Natural Language Processing
(pp. 137–144). Association for Computational Linguistics.
[129] Gaikwad Mahesh, P. (2013). Natural language interface to database. Inter
national Journal of Engineering and Innovative Technology (IJEIT), 2(8), 3–5.
[130] Papalexakis, E., Faloutsos, C., & Sidiropoulos, N. D. (2012). ParCube: Sparse
parallelizable tensor decompositions. In Joint European Conference on Machine
Learning and Knowledge Discovery in Databases (pp. 521–536). Springer,
Berlin, Heidelberg.
[131] Safari, L., & Patrick, J. D. (2014). Restricted natural language based querying of
clinical databases. Journal of Biomedical Informatics, 52, 338–353.
[132] Chandra, Y., & Mihalcea, R. (2006). Natural language interfaces to databases,
University of North Texas (Doctoral dissertation, Thesis (MS)).
[133] Shen, L., Sarkar, A., & Joshi, A. K. (2003, July). Using LTAG based features in parse
reranking. In Proceedings of the 2003 Conference on Empirical Methods in Natural
Language Processing (pp. 89–96). Association for Computational Linguistics.
[134] Collins, M., & Duffy, N. (2002, July). New ranking algorithms for parsing
and tagging: Kernels over discrete structures, and the voted perceptron. In
Proceedings of the 40th Annual Meeting on Association for Computational
Linguistics (pp. 263–270). Association for Computational Linguistics.
[135] Kudo, T., & Matsumoto, Y. (2003, July). Fast methods for kernel-based text anal-
ysis. In Proceedings of the 41st Annual Meeting on Association for Computational
Linguistics-Volume 1 (pp. 24–31). Association for Computational Linguistics.
[136] Cumby, C. M., & Roth, D. (2003). On kernel methods for relational learning.
In Proceedings of the 20th International Conference on Machine Learning
(ICML-03) (pp. 107–114).
[137] Culotta, A., & Sorensen, J. (2004, July). Dependency tree kernels for table
extraction. In Proceedings of the 42nd Annual Meeting on Association for
Computational Linguistics (p. 423). Association for Computational Linguistics.
[138] Ghosal, D., Waghmare, T., Satam, S., & Hajirnis, C. (2016). SQL query for-
mation using natural language processing (NLP). International Journal of
Advanced Research in Computer and Communication Engineering, 5, 3.
[139] Zhang, J., Tang, J., Ma, C., Tong, H., Jing, Y., & Li, J. (2015, August). Panther:
Fast top-k similarity search on large networks. In Proceedings of the 21st ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp. 1445–1454).
158 References
[140] Ghosh, P. K., Dey, S., & Sengupta, S. (2014). Automatic SQL query formation
from natural language query. International Journal of Computer Applications,
975, 8887.
[141] Choudhary, N., & Gore, S. (2015, September). Impact of intellisense on the
accuracy of natural language interface to database. In 2015 4th International
Conference on Reliability, Infocom Technologies and Optimization (ICRITO)
(Trends and Future Directions) (pp. 1–5). IEEE.
[142] Willett, P. (2006). The Porter stemming algorithm: Then and now. Program.
[143] Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., … & Zhang, Z.
(2018). Spider: A large-scale human-labeled dataset for complex and cross-
domain semantic parsing and text-to-SQL task. arXiv preprint arXiv:1809.08887.
[144] Nelken, R., & Francez, N. (2000, July). Querying temporal databases using
controlled natural language. In Proceedings of the 18th Conference on Compu
tational Linguistics – Volume 2 (pp. 1076–1080). Association for Computational
Linguistics.
[145] Naumann, F. (2014). Data profiling revisited. ACM SIGMOD Record, 42(4), 40–49.
[146] Singh, G., & Solanki, A. (2016). An algorithm to transform natural language into
SQL queries for relational databases. Selforganizology, 3(3), 100–116.
[147] Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338–353.
[148] Zadeh, L. A. (1978). PRUF – A meaning representation language for natural
languages. International Journal of Man-Machine Studies, 10(4), 395–460.
[149] Dalrymple, M., Shieber, S. M., & Pereira, F. C. (1991). Ellipsis and higher-order
unification. Linguistics and Philosophy, 14(4), 399–452.
[150] Tanaka, H., & Guo, P. (1999). Portfolio selection based on upper and lower expo-
nential possibility distributions. European Journal of Operational Research,
114(1), 115–126.
[151] Kang, I.-S., Bae, J.-H., & Lee, J.-H. Database semantics representation for natu-
ral language access. In Proceedings of the First International Symposium on
Cyber Worlds (CW ’02), ISBN:0-7695-1862-1.
[152] De Marneffe, M. C., & Manning, C. D. (2008). Stanford typed dependencies
manual (pp. 338–345). Technical Report, Stanford University.
[153] Zeng, J., Lin, X. V., Xiong, C., Socher, R., Lyu, M. R., King, I., & Hoi, S. C.
H. (2020). Photon: A robust cross-domain text-to-SQL system. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics:
System Demonstrations (pp. 204–214). ACL.
[154] Poole, D. L., & Mackworth, A. K. (2010). Artificial intelligence: Foundations of
computational agents. Cambridge University Press.
[155] Warren, D. H., Pereira, L. M., & Pereira, F. (1977). Prolog – the language and its
implementation compared with Lisp. ACM SIGPLAN Notices, 12(8), 109–115.
[156] Wang, P., Shi, T., & Reddy, C. K. (2019). A translate-edit model for natural lan-
guage question to SQL query generation on multi-relational healthcare Data.
arXiv preprint arXiv:1908.01839.
[157] Yao, K., & Zweig, G. (2015). Sequence-to-sequence neural net models for graph-
eme-to-phoneme conversion. arXiv preprint arXiv:1506.00196.
[158] Zhang, Z., & Sabuncu, M. (2018). Generalized cross-entropy loss for training
deep neural networks with noisy labels. In Advances in Neural Information
Processing Systems (pp. 8778–8788).
References 159
[159] Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2016). Bidirectional atten-
tion flow for machine comprehension. arXiv preprint arXiv:1611.01603.
[160] Ke, N. R., Goyal, A. G. A. P., Bilaniuk, O., Binas, J., Mozer, M. C., Pal, C., &
Bengio, Y. (2018). Sparse attentive backtracking: Temporal credit assignment
through reminding. In Advances in neural information processing systems
(pp. 7640–7651).
[161] Michaelsen, S. M., Dannenbaum, R., & Levin, M. F. (2006). Task-specific train-
ing with trunk restraint on arm recovery in stroke: Randomized control trial.
Stroke, 37(1), 186–192.
[162] Löb, M. H. (1976). Embedding first order predicate logic in fragments of intu-
itionistic logic. The Journal of Symbolic Logic, 41(4), 705–718.
[163] Yu, B., Lin, X., & Wu, Y. (1991). The tree representation of the graph used in
binary image processing. Information Processing Letters, 37(1), 55–59.
[164] Python. (2018, June). Python 3.7.0 Home Page. Last accessed December 23,
2019. Retrieved from: https://www.python.org/downloads/release/python-370/
[165] MySQL Community Downloads. (2019). MySQL Community Server 8.0.18
Home Page. Last accessed December 23, 2019. Retrieved from: https://dev.
mysql.com/downloads/mysql/
[166] MySQLdb. (2012). Welcome to MySQLdb’s documentation! Last accessed
December 23, 2019. Retrieved from: https://mysqldb.readthedocs.io/en/latest/
[167] TextBlob. (2015). TextBlob: Simplified Text Processing. Last accessed December
23, 2019. Retrieved from: https://textblob.readthedocs.io/en/dev/
[168] JetBrains. (2019). IDE PyCharm C Home Page. Last accessed December 23,
2019. Retrieved from: https://www.jetbrains.com/pycharm/
[169] XQuartz. (2016, October). XQuartz 2.7.11 Home Page. Last accessed December
23, 2019. Retrieved from: https://www.xquartz.org/index.html
[170] Apple Developer. (2019). Xcode 11 Home Page. Last accessed December 23,
2019. Retrieved from: https://developer.apple.com/xcode/
[171] MySQL. (2019). MySQL Workbench Home Page. Last accessed December 23,
2019. Retrieved from: https://www.mysql.com/products/workbench/
[172] Kaggle. (2017). Zomato Restaurants Data. Last accessed December 23,
2019. Retrieved from: https://www.kaggle.com/shrutimehta/zomato-restaurants-
data
[173] GitHub. (2017). WikiSQL RDB – A large annotated semantic parsing corpus
for developing natural language interfaces. Last accessed December 23, 2019.
Retrieved from: https://github.com/salesforce/WikiSQL
[174] Zhong, V., Xiong, C., & Socher, R. (2017). Seq2SQL: Generating structured
queries from natural language using reinforcement learning. arXiv preprint
arXiv:1709.00103.
[175] Zhong, V., Xiong, C., & Socher, R. (2017). Seq2SQL: Generating structured
queries from natural language using reinforcement learning. arXiv preprint
arXiv:1709.00103.
[176] Streiner, D. L., & Cairney, J. (2007). What’s under the ROC? An introduction to
receiver operating characteristics curves. The Canadian Journal of Psychiatry,
52(2), 121–128.
[177] Original implementation code extended from “Couderc, B., & Ferrero, J. (2015,
June). fr2SQL: Interrogation de bases de données en français”.
160 References
[178] Yu, T., Li, Z., Zhang, Z., Zhang, R., & Radev, D. (2018). TypeSQL:
Knowledge-based type-aware neural text-to-SQL generation. arXiv preprint
arXiv:1804.09769.
[179] Hwang, W., Yim, J., Park, S., & Seo, M. (2019). A comprehensive explora-
tion on wikisql with table-aware word contextualization. arXiv preprint arXiv:
1902.01069.
[180] He, P., Mao, Y., Chakrabarti, K., & Chen, W. (2019). X-SQL: Reinforce schema
representation with context. arXiv preprint arXiv:1908.08113.
[181] Yavuz, S., Gur, I., Su, Y., & Yan, X. (2018). What it takes to achieve 100%
condition accuracy on WikiSQL. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing (pp. 1702–1711).
[182] Gur, I., Yavuz, S., Su, Y., & Yan, X. (2018, July). DialSQL: Dialogue based
structured query generation. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Vol. 1: Long Papers pp. 1339–1349).
[183] Zhekova, M., & Totkov, G. (2021). Question patterns for natural language trans-
lation in SQL queries. International Journal on Information Technologies &
Security, 13(2), 43–54.
[184] Brunner, U., & Stockinger, K. (2021, April). Valuenet: A natural language-to-sql
system that learns from database information. In 2021 IEEE 37th International
Conference on Data Engineering (ICDE) (pp. 2177–2182). IEEE.
[185] Xu, X., Liu, C., & Song, D. (2017). SQLnet: Generating structured queries
from natural language without reinforcement learning. arXiv preprint arXiv:
1711.04436.
[186] Talreja, R., & Whitt, W. (2008). Fluid models for overloaded multiclass many-
server queueing systems with first-come, first-served routing. Management
Science, 54(8), 1513–1527.
[187] Tosirisuk, P., & Chandra, J. (1990). Multiple finite source queueing model with
dynamic priority scheduling. Naval Research Logistics (NRL), 37(3), 365–381.
[188] Carbonell, J. R., Ward, J. L., & Senders, J. W. (1968). A queueing model of
visual sampling experimental validation. IEEE Transactions on Man-Machine
Systems, 9(3), 82–87.
[189] Hoi, S. Y., Ismail, N., Ong, L. C., & Kang, J. (2010). Determining nurse staff-
ing needs: The workload intensity measurement system. Journal of Nursing
Management, 18(1), 44–53.
[190] Robinson, W. N. (2003, September). Monitoring web service requirements. In
Proceedings 11th IEEE International Requirements Engineering Conference,
2003 (pp. 65–74). IEEE.
[191] Kim, C., & Kameda, H. (1990). Optimal static load balancing of multi-class
jobs in a distributed computer system. IEICE Transactions (1976–1990), 73(7),
1207–1214.
Index
B
Bali, R. S., 28 F
Booch’s OO Analysis and Design, 34
False Positive Ratio (FPR), 38, 84, 134
Boyan, J., 21
Foreign Keys (FKs), 31, 36
Freitag, D., 21
C
CHAT-80, 19
Chen, C. L., 25 G
Chen, H., 21
Chen, P. P. S., 2 Generative Pre-trained Transformer 3 (GPT-
Clinical Data Analytics Language (CliniDAL), 3), 40
18–20, 22–23, 45 generic interactive natural language interface
Clinical Information System (CIS), 20 to databases (GINLIDB), 19
Coltech-parser, 27, 36, 46, 125 Giordani, A., 23–24, 29, 31, 43
Conceptual Logical Form (CLF), 30, 45, 145 Graphical User Interface (GUI), 7
D
I
DB-Oriented Logical Form (DBLF), 29–30,
45, 145 IDE PyCharm C, 79, 81
DBXplorer, 18 Integrated Development Environment (IDE),
Dependency-Based Compositional Semantics 79
(DCS) Parser, 38 Intelligent Query Engine (IQE) system, 21
Deshpande, A. K., 28 Isoda, S., 26
disambiguation, 8, 10, 12–13, 31, 37, 43, 47, Iyer, A., 21
54–55, 89, 129, 142
J
E
Java Annotation Patterns Engine (JAPE), 27,
English Slot Grammar (ESG), 20 36, 46, 126
English Wizard, 19 Joachims, T., 21
161
162 Index
T
S
TEAM, 17–18
Satav, A. G., 28 TextBlob, 51, 54, 80
Self Organizing Maps (SOM), 39, 134 To, N. D., 26
Semantic Analysis Module, 28 Top-k Algorithm, 18, 20, 34, 45, 46, 122,
Shankaranarayanan, G., 21 146
She, L., 21 True Negative Ratio (TNR), 38, 84
Song, D., 44 Tseng, F. S., 25
SPARK, 22
SQL, see Structured Query Language
U
Stanford CoreNLP 3.9.2, 80–81
Stanford Dependencies Collapsed (SDC), 23 Unified Modeling Language (UML), 25,
Structured Object Model (SOM), 25 34–35, 46, 124
Index 165
V X
Van De Riet, R. P., 26 Xcode, 11, 79, 81
XQuartz, 79, 81
Xu, X., 44
W
Y
weighted links, 34–35
Yager, R. R., 26
WikiSQL, 93
confusion matrix with, 86
NLQ to SQL translation, 87–88 Z
RDB, 81 Zomato RDB, 81, 85, 93
ROC curve for, 87 confusion matrix with, 85
to SQL Translation Work on WikiSQL mapping table, 96
RDB, 88 ROC curve, 86
WordNet, 81 schema, 82