0% found this document useful (0 votes)
74 views5 pages

Formation of SQL From Natural Language Query Using NLP: Uma M Sneha V Sneha G

The document describes a system that takes natural language queries as input and outputs corresponding SQL queries to access a railway reservation database. It involves several natural language processing steps: tokenization, lemmatization, part-of-speech tagging, and parsing. The system was tested on a dataset of 2880 queries about train fares and seat availability, achieving 98.89% accuracy at mapping queries to SQL. The goal is to enable non-technical users to easily extract information from the database using natural language instead of requiring SQL knowledge.

Uploaded by

Ali hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views5 pages

Formation of SQL From Natural Language Query Using NLP: Uma M Sneha V Sneha G

The document describes a system that takes natural language queries as input and outputs corresponding SQL queries to access a railway reservation database. It involves several natural language processing steps: tokenization, lemmatization, part-of-speech tagging, and parsing. The system was tested on a dataset of 2880 queries about train fares and seat availability, achieving 98.89% accuracy at mapping queries to SQL. The goal is to enable non-technical users to easily extract information from the database using natural language instead of requiring SQL knowledge.

Uploaded by

Ali hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Formation of SQL from Natural Language Query

using NLP
Uma M Sneha V Sneha G
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
SSN College of Engineering SSN College of Engineering SSN College of Engineering
Chennai, India Chennai, India Chennai, India
[email protected] [email protected] [email protected]

Bhuvana J Bharathi B
Computer Science and Engineering Computer Science and Engineering
SSN College of Engineering SSN College of Engineering
Chennai, India Chennai, India
[email protected] [email protected]

Abstract—Today, everyone has their own personal devices that etc. Another important application of NLP is chatbot (Chat
connects to the internet. Every user tries to get the information Robot) that can be used for voice or textual interactions. Our
that they require through internet. Most of the information is in work focuses on Railway Reservation System, where a user
the form of a database. A user who wants to access a database
but having limited or no knowledge of database languages faces can enquire about the trains that are available from a source to
a challenging and difficult situation. Hence, there is a need for destination, the fare of a ticket in various classes. Objective of
a system that enables the users to access the information in this paper is to convert a natural language query into a SQL to
the database. This paper aims to develop such a system using simplify data extraction. The use of natural language interface
NLP by giving structured natural language question as input to access database was topic of research since the beginning
and receiving SQL query as the output, to access the related
information from the railways reservation database with ease. of NLP and is still an interesting problem.
The steps involved in this process are tokenization, lemmatization,
parts of speech tagging, parsing and mapping. The dataset used II. L ITERARY S TUDY
for the proposed system has a set of 2880 structured natural In 1972, [1] W.A. Woods developed a system that provided a
language queries on train fare and seats available. We have search interface for the database system that stored information
achieved 98.89 per cent accuracy. The paper would give an overall
view of the usage of Natural Language Processing (NLP) and use about the rock samples that were brought from the moon for
of regular expressions to map the query in English language to research. This system used two databases, the chemical analy-
SQL. ses and the literature references. This system used Augmented
Index Terms—Natural Language Processing (NLP), Structured Transit Network (ATN) parser and semantics of Woods. This
Query Language (SQL), Tokenization, POS tagging, Chunking, system was demonstrated informally at Second Annual Lunar
Parsing, Regular Expression
Science conference in 1971. Lifer/Ladder system (1978) [2]
was one of the good search interface techniques (i.e. NLP sys-
I. I NTRODUCTION
tem) which used semantic grammar for parsing the input query
In this fast technologically advancing world, it has become and the query generated was given as input on a distributed
important for the humans to interact with computers to provide database system. This system supports single table queries and
assistance in many fields like medicine, education, space simple join queries in case of multiple tables. Akshay et al. [7]
research, etc. Retrieval of the required information from the proposed a system which provides a search interface for the
database is a tedious process. In order to extract information users to pose questions in their natural language. The primary
from the database, the user must have a prior knowledge on goal of this system is to generate a database language query
Database Management System (DBMS), which is a software from a NL query. This system includes an additional feature
developed to store and manipulate the data in a database. of eliminating spelling errors from user queries and used Word
Hence a non-technical user faces difficulty in extracting the Pair Mining Technique for the same. Then the query in English
data. To find a solution for such a problem and facilitate is mapped to an equivalent SQL query. Prasun Kanti et al. [11]
human interaction with computers, Natural Language Process- has proposed a system for interfacing a college database that
ing(NLP) techniques are used. Natural Language Processing transforms English query to SQL using semantic grammar. The
has applications in various sectors like tourism, where a system goes through the morphological, syntactic, semantic
tourist can get information about the famous tourist spots in phases. The user may ask the question in speech format
a particular city, the hotels available, best restaurants nearby which is then converted to text using Scripting Language for
Second International Conference on Computational Intelligence in Data Science (ICCIDS-2019)

Fig. 1: Proposed System Model

Android (SL4A). The natural language query is then parsed B. Algorithm


using parser. A data dictionary stores all the attributes and Algorithm: Formation of SQL Query from Natural Lan-
tables of the database. The attribute identifier then finds out guage Query
the attributes that are present in the natural language query. Input: Natural Language query in English text.
With the identified attributes, SQL query is generated [9], [10]. Output: SQL query
K. Javubar et al. [3] has proposed a user-friendly interface for 1) Tokenize the input into list of words
accessing data from various web sources such as Facebook, 2) Lemmatize the list of words
Twitter, etc. The architectural layout consists of tokenizing, 3) Perform POS tagging
stemming, parsing and mapping stages. The input natural 4) Parsed sentence = Parse using regular expressions
language query initially undergoes morphological analysis 5) If table. attribute ∈ Parsed sentence
then semantic analysis which is followed by a mapping phase. a) Extract them
The three main keywords SELECT, FROM and TO are looked b) Call SQLmap()
for in the input query. Once these are found, the SQL query
is formed [8]. C. Proposed System Model
Our proposed system consists of several modules that are
III. P ROPOSED S YSTEM used to extract key words alone and leave out the redundant
data. This is critical because presence of redundant data will
Retrieving the required information from a database is quite certainly decrease the overall performance of the system.
difficult for any common man and requires a lot of effort Input data initially goes through an NLP phase followed
which needs the knowledge of the database structure. DBMS by a mapping phase. The NLP phase consists of processes
is incapable of dealing with queries framed in any other such as tokenization, lemmatization, Parts Of Speech tagging
languages other than the standard database languages. So to (POS tagging) and parsing. The mapping phase identifies the
make the retrieval more effortless and interactive for naive attributes in the processed input and finally the SQL query is
user, our proposed work provides a facility through which a formed from the key information that was obtained [4], [6].
user is free to pose a query in English, which will be processed The detailed workflow of our proposed work is given in Fig.1.
by several modules to form an equivalent SQL query. 1) Tokenization: It is the first step that is used to break a
sentence into smaller meaningful tokens in most cases these
A. Overview of Query Formation are words. In the proposed system, we applied tokenization
as soon as the text input is received from the user and the
User submits an English query in the text form which is then tokens obtained are stored in the form of a list. We have used
sent into several natural language processing (NLP) modules. word tokenize module of nltk.tokenize library in Python.
This NLP phase is followed by a mapping phase in which the 2) Lemmatization: This process is similar to stemming
attributes are detected in the English query, mapped to form where the root words or lemma of each of these tokens
the final SQL query and may then be fed into the database to are obtained from the output of the previous step and are
retrieve the required information and provide it to the user. The stored in another list. Lemmatization is chosen over stemming
overview of our proposed work is elucidated in Fig. 1. As of because the process of stemming does not always prove to
now, the proposed work focuses on generation of an equivalent be accurate since it removes simply the prefix or suffix of a
SQL query from a natural language question in English. Once word. Whereas in lemmatization, the roots are matched with
the SQL query is generated the retrieval of data from DB will its lemmas contained in a dictionary and hence more accurate
be an easy task. results were obtained.

978-1-5386-9471-8/19/$31.00 2019 IEEE


Second International Conference on Computational Intelligence in Data Science (ICCIDS-2019)

(a) Tokenizing, lemmatizing and parts of speech tagging for question type 1

(b) Parse tree for question type 1 (c) Values for question
type 1

(d) Structured Query Language query for question type 1

Fig. 2: Stages of proposed System

3) Syntatical Analysis: In syntactic analysis, each of the tag because sometimes the POS tagger tags certain verbs as
lemmatized tokens are analyzed and according to their context nouns. Followed by this there may be a noun (optional denoted
of appearance, each token is tagged with a POS. Here, each by *) or adverb (optional denoted by *). This is followed by
word and its tag are packed into a tuple and a list of all such either a from (POS Tag IN) keyword or to keyword (POS
tuples is obtained. In the proposed system, the Stanford POS tag TO). When these keywords appear, it is predicted that the
Tagger is used for POS tagging. This tagger is preferred over user would give the source information immediately after from
the POS tagger present in the NLTK package as it provides and the destination information immediately after to keyword.
more accurate tagging. Hence the regular expression specifies a check for a proper
4) Semantic Analysis: In semantic analysis, we try to make noun following from or to (since all locations are proper
sense of the tokens so that the system could proceed with the nouns). The proper noun may be followed by a noun(optional)
SQL query formation. This is achieved by the process of pars- or verb (optional) to include information such as station or
ing (or chunking). In the proposed system, the RegExpParser() junction (which are nouns). Once again there is the same
(regular expression parser) is used for parsing the POS tagged pattern of from or to followed by location followed by optional
input data. This parser chunks the data based on a regular noun or verb. If the first pattern detected the source, the second
expression. In our work, a regular expression is framed such pattern would detect the destination and vice-versa. Finally,
that a phrase has the source and destination information is this regular expression is fed into the RegExpParser() which
classified into a separate chunk and are extracted by means of looks for this pattern in the POS tagged input data. When the
a rule-based paradigm. The regular expression that is used in pattern is detected, it is classified as a separate chunk named
the project is: CH which appears as a subtree in the tree diagram of the
CH : {(< V B.? > | < N N.? > | < JJ.? >) parsed data.
< N N > ∗ < RB.? > ∗(< IN > | < T O >)? < N N P > 5) Attribute Identification: An attribute is a column in the
< N N.? > ∗ < N N > ∗ < V B.? >?(< IN > | < T O > database table. A rule-base paradigm is followed for extracting
)? < N N P >< N N S >? < V B.? >?} the attributes from the parsed data. a) Source - Destination
POS tags expansion: <> Enclosing brackets for all POS Identifier: For finding the source and destination (if any), we
tags; VB Verb; NN Noun; JJ Adjective; RB Adverb; IN make use of the list that was created in the last step. For
Preposition; TO to; NNP Proper Noun; NNS Noun (Plural each element in the list, if the element is a preposition (from
form). Regular expression syntax overview: . any character; or to), we extract the immediately following proper noun
? occurrence of 0 or 1 times of the preceding character; * (location) as source (if from is encountered) or destination
occurrence of 0 or more times of the preceding character ; (if to is encountered). In some cases, the user may not use
k OR condition; () Group; CH Name given to the string a preposition to specify the location details. In such cases,
(or chunk) that matches the above regular expression [5]. Ac- we check if each list element is a verb. If it is a verb, we
cording to this regular expression, we detect a pattern such that make the system understand the meaning of the verb using
it contains the source and destination in an extractable form. Wordnet module. Initially we create a list of all possible verbs
The pattern is such that the beginning of the chunk is either that the user may use and then we compile the list of all
a verb or a noun. The condition includes a check for noun possible synonyms of those words. These synonyms can be

978-1-5386-9471-8/19/$31.00 2019 IEEE


Second International Conference on Computational Intelligence in Data Science (ICCIDS-2019)

obtained by using the synsets() function of Wordnet. For each it is returned. If no such train name could be found, then the
verb encountered, we check if the word lies in the synonyms calling function had None value.
of any of the predefined words we created. If it matches, then e) Available Identifier: When the user asks if a train is
the proper noun following that verb is extracted as source or available, the system needs to display only the trains which
destination accordingly. If the verb encountered is similar to have seats available for booking in them. Hence the system
starting or departing, the proper noun following it will be the must check if the trains that are to be displayed have at least
source. If the verb is similar to reaching, stopping or arriving, one empty seat. For this, the set of input data is searched for
the proper noun following it will be the destination. If there the keyword available. If it is found, an additional part is to
is no source or destination or if the system could not detect be added to the SQL query being formed. This part makes
any source and destination, then the variables to the left of the sure that there is at least one seat in either class1 or class2 or
calling function will have None value. class3. If the available keyword is not found, then the calling
b) Date Identifier: For extracting date (if any), we make function will have None value. The algorithm for SQL query
use of the datetime module. As the CH subtree of the formation is given below.
parsed tree diagram contains only the source and destination, Algorithm: SQLmap()
the date will be present outside the CH chunk. We look query= ”SELECT”
for the tag CD (Cardinal Digit) for any number. Regular if (value = Fare Identifier())
expressions are used to find certain patterns of date. The query = query + value
date can be of any format where the month is at the mid- else
dle such as DD/MM/YYYY. Apart from this, the user may query = query + train no + train name
give the date in formats like [MONTH][DAY][YEAR] or query = query + ”FROM railways.train WHERE”
[DAY][MONTH][YEAR] like December 29th, 2018 or 29th if (value= Train Name Identifier ())
December 2018. If element in the list of tuples is 29 or 29th, query = query + ”train name=” + value
it is tagged as CD, NN or JJ. If it is detected, then the tag of if (source, destination = Source Destination Identifier())
the tuple at index before and after, or two index positions after query = query + ”source stn=” + source +
the tuple in which the day (e.g. 29) is found are compared ”destination stn=” + destination
to find the date. If th, rd, or st is present in the word, it if (date= Date Identifier())
is replaced with an empty string. Using regular expression query = query + ”next date source” + date
(jan+)|(feb+)|(mar+)|(apr+)|(may+)|(jun+)|(jul+)|
(aug+)|(sep+)|(oct+)|(nov+)|(dec+), we can detect the month IV. EXPERIMENTS AND RESULTS
in the list. Then datetime() is used to separately get the day, We have taken the domain of railway reservation
month and year in a regularized format as it had a formatted for our query translation. We have considered only
datetime object returned. If no date is detected, nothing is a single table railways.train for which the SQL
returned to the calling function and hence the values of the query would be produced. There are 24 attributes in
variable at the left of the calling function has None value. the table, they are: train no, train name, source stn,
c) Fare Identifier: The proposed system can handle two types destination stn, arrival time, departure time, available days,
of queries: one in which the user wants to know the train next date source, next date destination, total seats1,
number and train name and the other in which the user wants total seats2, total seats3, booked seats1, booked seats2,
to know the fare. To detect if the user wants to know about the booked seats3, waiting seats1, waiting seats2,
fare, we search the set of lemmatized words for the keyword waiting seats3, available seats1, available seats2,
fare. If it is present, we search the succeeding and preceding available seats3, fare1, fare2, fare3. We have implemented
words for the class name of which the fare is needed. The class the proposed system in Python3 and using the following
names maybe class1, class2 or class3. Regular expressions are packages: nltk, re, datetime, nltk.tokenize, nltk.tag, nltk.stem,
used for this purpose. The regular expression class[.][123] nltk.corpus. The questions that are given as input to the
is used to find out the class name. Once the class name is proposed systems are of two types: 1. What are the trains
identified, its corresponding fare such as fare1, fare2 or fare3 is that are available from source to destination on a given date?
returned. If the user has not posed any fare regarding questions, 2. What is the fare of a specific class of the train travelling
then the variable at the left of fare identifier() will have None from source to destination on a given date? The dataset
value. containing 2880 NL queries was formed using the above two
d) Train Name Identifier: In some cases, the users would formats of questions. The dataset was collected from friends
give the train name with or without the source destination and family by asking them to pose the same question in
and date information. To detect the train name(s) (if any) their own words. Let us consider an example to illustrate the
is present in the input data, we use the regular expression working of the algorithm. Question: What are the available
’(express)|(mail)|(passenger)’. According to this regular ex- trains that travel from Delhi to Goa on 1st April 2019?
pression, the input data is searched for the words express, The input sentence is split into words and stored in a list,
mail and passenger. If they are found, their preceding word i.e. tokenized, followed by removing unnecessary words by
along with the matched word is taken as the train name and using stop words which is a list containing words that are of

978-1-5386-9471-8/19/$31.00 2019 IEEE


Second International Conference on Computational Intelligence in Data Science (ICCIDS-2019)

not much importance. Then the list is lemmatized. Next, the tool for future advancements in numerous fields like medicine,
words are tagged using StanfordPOSTagger(). The output of business, and education so on.
the above steps is shown in Fig. 2(a). During the POS tagging,
R EFERENCES
sometimes the words get tagged incorrectly. Hence some of
the tags are changed to keep it regularized. Then, the list is [1] Woods, William, (1972) “The lunar sciences natural language informa-
tion system,” BBN report, Bolt Beranek and Newman.
parsed using the RegexParser() to which regular expression is [2] Hendrix, Gary G, Sacerdoti, Earl D and Sagalowicz, Daniel and Slocum,
given as argument. The regular expression is in such a way Jonathan, (1978) “Developing a natural language interface to complex
that it chunks the list so that the chunked part contains the data,” ACM Transactions on Database Systems (TODS), vol. 3, Issue 2,
pp. 105–147.
source and destination. The regular expression is formed by [3] Sathick, K Javubar and Jaya, A, (2015) “Natural language to SQL
observing the POS tagging of various sentences. The parse generation for semantic knowledge extraction in social web sources,”
tree after parsing is given in Fig. 2(b). The above step is Indian Journal of Science and Technology, vol. 8, Issue 1, pp. 1–10.
[4] Singh, Garima and Solanki, Arun, (2016) “An algorithm to transform
followed by mapping. There are five modules/functions in the natural language into SQL queries for relational databases,” Selforgani-
program code to identify source-destination, fare, date, train zology, Directory of Open Access Journals, vol. 3, Issue 3, pp. 100–116.
name and the word available from the list. If any of them [5] Huang, Bei-Bei, Zhang, Guigang et al., (2008) “A natural language
database interface based on a probabilistic context free grammar,” IEEE
is not available in the list, their respective values would be International workshop on Semantic Computing and Systems, pp. 155–
None. For this example, the values are given as shown in Fig. 162.
2(c). The values in Fig.2(c) are mapped to form the query in [6] Rao, Gauri, Agarwal, Chanchal, Chaudhry, Snehal, et al., (2010) “Nat-
ural language query processing using semantic grammar,” International
SQL. If source, destination, available and date are not None, journal on computer science and engineering, vol. 2, Issue 2, pp. 219–
then the SQL query would display the train number and train 223.
name of trains that have seats available in any of the three [7] Satav, Akshay G, Ausekar, Archana B and Bihani, et al., (2014) “A
Proposed Natural Language Query Processing System,” International
classes of the train travelling from that source to destination Journal of Science and Applied Information Technology, vol. 3, Issue
on the given date. The query for the given example is shown 2, pp. 219–223.
in the Fig. 2(d). Total number of queries as test cases are [8] Iftikhar, Anum, Iftikhar, Erum, Mehmood and Muhammad Khalid,
(2016) “Domain specific query generation from natural language text,”
2880, correctly generated queries are 2848 and incorrectly IEEE Sixth International Conference on Innovative Computing Technol-
generated queries are 32. ogy (INTECH), pp. 502–506.
[9] El-Mouadib, Faraj A, Zubi, Zakaria S, Almagrous, Ahmed A, El-Feghi
TABLE I: PERFORMANCE METRICS and Irdess S, (2009) “Generic interactive natural language interface to
databases (GINLIDB),” International journal of computers, vol. 3, Issue
3, pp. 301–310.
Metric Result obtained [10] Bhadgale, Anil M, Gavas, Sanhita R, Patil, Meghana M and Pinki,
Precision 0.5 R, (2013) “Natural language to SQL conversion system,” International
Recall 0.494 Journal of Computer Science Engineering and Information Technology
F1 Score 0.497 Research (IJCSEITR), Vol. 3, Issue 2, pp. 161–166.
Accuracy 0.9889 [11] Ghosh, Prasun Kanti, Dey, Saparja, Sengupta and Subhabrata (2014)
“Automatic sql query formation from natural language query,” Inter-
national Journal of Computer Applications (0975-8887), International
Conference on Microelectronics, Circuits and Systems (MICRO-2014)
V. CONCLUSION
Although several methodologies are employed to extract
information from a database, Natural Language Processing has
set a new standard in doing the same. This work presents a
clear picture on the steps that are involved in NLP. Various
processes like tokenization, lemmatization, syntactic and se-
mantic analysis are carried out to generate an equivalent SQL
query from a natural language query. We have obtained an
accuracy of 98.89 per cent.
Following are the future improvements that can be incor-
porated; The input received can be of audio form, which can
be converted into textual format; The SQL query could be
of greater complexity; The database could be larger in terms
of attributes and tuples. Also, there could be multiple tables
of related data which can be accessed using JOIN keyword;
It could be used to create chatbots for various sectors which
handle large databases and can help users to access them with
greater ease; The output could be converted as a sentence then
into audio format to make the system more interactive; This
work can also be extended to other languages. To conclude,
NLP is a boon to any ordinary person having no knowledge on
database management. In short, NLP proves to be a promising

978-1-5386-9471-8/19/$31.00 2019 IEEE

You might also like