0% found this document useful (0 votes)
14 views

Paper 93-BERT Model Based Natural Language To NoSQL Query Conversion

Uploaded by

1minutemail
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Paper 93-BERT Model Based Natural Language To NoSQL Query Conversion

Uploaded by

1minutemail
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369057820

BERT Model-based Natural Language to NoSQL Query Conversion using Deep


Learning Approach

Article in International Journal of Advanced Computer Science and Applications · January 2023
DOI: 10.14569/IJACSA.2023.0140293

CITATION READS
1 574

4 authors:

Kazi Mojammel Hossen Mohammed Nasir Uddin


Jagannath University - Bangladesh Jagannath University - Bangladesh
4 PUBLICATIONS 4 CITATIONS 37 PUBLICATIONS 122 CITATIONS

SEE PROFILE SEE PROFILE

Minhazul Arefin Md Ashraf Uddin


Dhaka International University Jagannath University - Bangladesh
5 PUBLICATIONS 5 CITATIONS 77 PUBLICATIONS 1,139 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Remote Patient Monitoring View project

Ensure Patient Privacy Given Internet of Things Data Transmitted from Wearable Sensors View project

All content following this page was uploaded by Md Ashraf Uddin on 08 March 2023.

The user has requested enhancement of the downloaded file.


(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

BERT Model-based Natural Language to NoSQL


Query Conversion using Deep Learning Approach

Kazi Mojammel Hossen1 , Mohammed Nasir Uddin2 , Minhazul Arefin3 , Md Ashraf Uddin4
Department of CSE, Jagannath University
Dhaka, Bangladesh1,2,3,4

Abstract—Databases are commonly used to store complex query from Natural Language (English) is challenging. Using
and distinct information. With the advancement of the database NoSQL approach, amateur users can interact with the database
system, non-relational databases have been used to store a vast system. The model facilitates communication between humans
amount of data as traditional databases are not sufficient for and computers without recalling the query syntax method
making queries on a wide range of massive data. However, for the non-relational databases. Natural Language Processing
storing data in a database is always challenging for non-expert
users. We propose a conversion technique that enables non-expert
(NLP) [9], [10], [11] is a branch of linguistics, information
users to access and filter data as close to human language as engineering, computer science, and artificial intelligence that
possible from the NoSQL database. Researchers have already studies how computers and humans interact with Natural
explored a variety of technologies in order to develop more precise Language [12]. Traditional machine translation is applied to
conversion procedures. This paper proposed a generic NoSQL translate the text from one language to another by NLP [13].
query conversion learning method to generate a Non-Structured
Query Language from natural language. The proposed system This research aims to develop a feasible tool for searching
includes natural language processing-based text preprocessing databases where natural language can be used without needing
and the Levenshtein distance algorithm to extract the collection complex database queries that are developed by expertise.
and attributes if there were any spelling errors. The analysis of Generating NoSQL from natural language has wide range of
the result shows that our suggested approach is more efficient and applications. Tools with AI knowledge [14] such as Google
accurate than other state-of-the-art methods in terms of bilingual Assistant or Alexa use the NLIDB system for non-technical
understudy scoring with the WikiSQL dataset. Additionally, the users. Filling out a lengthy online form can be tedious and
proposed method outperforms the existing approaches because users might need to navigate through the screen, scroll, look
our method utilizes a bidirectional encoder representation from up values in the scroll box, and so on whereas with NLIDB,
a transformer multi-text classifier. The classifier process extracts
the users need to type a question similar to a sentence. Conse-
database operations that might increase the accuracy. The model
achieves state-of-the-art performance on WikiSQL, obtaining quently, such a tool has a wide range of usage and applications.
88.76% average accuracy. NoSQL approach has been researched both in academia as
well as in industry [15]. In this paper, we implement a Neural
Keywords—Natural language processing; NoSQL query; BERT Machine translation model which consists of four steps. First,
model; Levenshtein distance algorithm; artificial neural network we have used a Natural Language Tool-Kit for performing text
preprocessing. Secondly, attributes are collected and extracted
I. I NTRODUCTION using Levenshtein Distance (LD) [16], [17] algorithm. Thirdly,
we have used a bidirectional encoder representations from
In today’s digital age, non-relational databases are utilized BERT Transformers Model-based multi-text classification [18]
in almost every industry to store information. Non-Structured to extract the operations including find, insert, update and
Query Language (NoSQL) databases [1], [2] are increasingly remove. The last step of the proposed approach is generating
being used for large-scale data sets, search engines, and query.
real-time web applications [3]. Nowadays, NoSQL databases
work as an alternative to relational databases [4] and other Many research works have used WIKISQL dataset for
conventional databases [5]. conversation Natural Language to Structured Query Language.
The BERT Model generates the NoSQL operational command
With the growth of technology, NoSQL databases stores a from the WIKISQL task. The contribution of this research
large amount of data in document stores, key-value data stores, paper can be summarized as follows:
wide-column stores, and Graph stores. As opposed to relational
databases, MongoDB, CouchDB, Cassandra, etc are designed • Designing several algorithms to come up with a stan-
on the architecture of distributed systems to store massive date dard machine translation model for converting Natural
[6]. Many organizations are gradually looking into approaches Language into NoSQL queries.
to understand and analyze this enormous unstructured data.
• To resolve the syntax errors for primitive users using
The current approaches to data management, organization, and
Levenshtein Distance algorithm that can extract the
storage are being changed by “Big Data” [7]. In particular,
collection and attributes from the text even if any users
“Big Data,” an open source framework used to store vast
make spelling mistakes or utilize synonyms.
amounts of structured, unstructured, and semi-structured data
[8]. So, Normal users require knowledge of the query syntax • To employ the latest contextual word representation
and table schema to access and store a large amount of data. BERT transformer model to extract the operations
However, finding a reliable approach to generate the NoSQL with a higher accuracy rate.
www.ijacsa.thesai.org 810 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

The remainder of the paper is organized as follows: Related keywords, symbols, attributes, values, and relationships among
works conducting with the same and different technologies various types of quiries. Additionally, they proposed a graph-
by other researchers are illustrates in Section II. Section III ical user interface where users could enter NL queries and
describes the proposed methodology and work flow. Section have a NoSQL query created from those queries. For complex
IV shows experiment evaluation and result of the proposed queries, the proposed framework didn’t offer a solution.
system. Conclusions with the future expansion are detailed in
Section V. S. Mondal et al. [25] introduced a query-response model
that can respond to a variety of queries, including assertive,
II. R ELATED W ORK interrogative, imperative, compound, and complicated forms.
This NoSQL system’s primary task is to retrieve knowledge
Research in Natural Language for non-relational databases data from the default MongoDB database. This paper didn’t
has started as far back as the twenty century. Since the interest give any solution of time-related query such as “What is the
in Natural Language Processing has continued tremendously. age of x after 10 years”.
In the early 1970’s LUNAR [19], the first Natural Language
Interface for the relational database (NLIDB) has introduced T. Pradeep et al. [26] presented a Deep Learning based
to the researcher. LUNAR was a Question Answering (QA) approach that converts English questions to MongoDB queries.
system connected with the moon rock sample database. The They applied an encoder-Decoder machine-translation method
information of rock samples brought back from the moon for this conversion. The encoder turns the NLQ text input
was used to make the LUNAR database. NLP to NoSQL into a vector and sends it to the decoder. The decoder uses a
query conversion field has very little research on it. This deep neural network to predict NoSQL queries. Their system
section discusses various works on Natural Language to query uses ten different deep learning models to handle ten types of
conversion. MongoDB queries. One solution is the best possible answer
for this problem.
In 2021, Minhazul et al. [20] suggested a machine learning-
based NLP2SQL translation system. They used the Naive Sebastian Blank et al. [27] suggested an end-to-end Ques-
Bayes algorithm for command extraction and decision tree re- tion Answering (QA) system. It allows a user to ask a question
gression for condition extraction. Their proposed method lack in natural language on the Elasticsearch database. They solve
accuracy because of using the bag of words technique in the the homogeneous operation problem of the database by us-
derivation of condition from SQL. An advance deep learning ing policy-based reinforcement learning. For that, they used
solution can mitigate this problem. On the other hand, they can Facebook’s bAbI Movie Dialog dataset. They also design a
use the neural translation technique for this machine translation KBQueryBot, an agent of translating a natural language query
approach. The system can use the statistical translation method into the domain-specific query language based on a sequence-
also. to-sequence model [28]. It gives every single answer with the
help of an external knowledge base.
Mallikarjun et al. [21] proposed an automated NLP-based
text processing approach. Their approach can successfully Some classic NLIDB systems can solve the spelling cor-
convert an excel datasheet into a DBMS. Their system has a rections of misspelled words automatically [29]. The module
user authentication system that prevents unwanted users. The gives the interface between computer and user by the database
system has a limitation of 16,384 columns and 1,048,576 rows query language. Consequently, they discuss the overall system
for an excel worksheet. This data may be massive for average architecture of the NLIDB, some implementation details, and
purposes but not enough for big data. experimental results. The proposed work only focuses on
An Intelligent processing system in a document-based automatic spelling and grammar correction.
NoSQL database had proposed by Benymol et al. [22] in Z. Farooqui et al. [30] recommended the conversion of
2021. They used state-of-the-art algorithms and technologies to English to SQL. For example, their system converts English
convert text into NoSQL. They used different types of TF-IDF questions or text queries into SQL queries. Later it will be
schemes for information retrieval, machine learning algorithm operated on databases. Their suggested technique and method
for modeling, and hyper parameter tuning for model selection. are generic and smooth. It can handle both small and large
The system may have vulnerability in stream and batch data applications for generic NLIDB systems. There are four types
on the Big Data processing platform. The proposed model also of input NLQ text Normal, Linear Disjoint, Linear Coincident,
has a problem with dynamic processing strategies. In this stage, and Non-Linear Model. It focuses on simple SQL query
the system fails to find any possible solution. clauses such as SELECT, FROM, WHERE, and JOIN. Their
Fatma et al. [23] proposed an automatic UML/OCL model system can handle complex queries resulting from ambiguous
for the NoSQL database converter. Their system mainly fo- NL queries.
cuses on the big data platform. Because there is wide use
Tanzim Mahmud et al. [31] proposed a system based on
of NoSQL database in the big data platform. After creating
Context-Free-Grammar (CFG). Any input token of appropriate
the NoSQL database, the system automatically checks the
terminals found in the input NLQ will replace the correspond-
OCL constraints of the model. There are different types of
ing attribute in the relational table or applicable operators of
NoSQL databases and a maximum of them have a problem
SQL. The interface can configure easily and automatically by
with integrity constraint checking. For this, it is the most
the user. It relies on the Metadata set and Semantic sets for
challenging task in the system.
tables and attributes. It can handle ambiguities in the input
In [24] M. T. Majeed et al. have designed a fully auto- NLQ. For example, the system can solve the same attribute
mated framework that, using an AI technique, can recognize name clashing problem within a table. The limitation of the
www.ijacsa.thesai.org 811 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

proposed CFG system can only convert limited queries. Other


than that, the system is dependent on a specific language.
Xiaojun Xu et al. proposed SQLNet [32], an NLP to
SQL conversion approach which is order-independent and
alternative to the traditional sketch based program synthesis
approaches [33]. Failing an order input NLQ text is not a
problem in that case. It uses a sequence–to-set model, which is
a column attention mechanism that generates SQL queries. It
represents pseudo-tasks with the help of a function of relevance
and works on the WikiSQL task.
Victor Zhong et al. [34] thought about the availability
of the query ground truth (intermediate labels) and database
response. They proposed Seq2SQL, a modular approach that
translates NLQ into SQL queries. Their suggested system
also generalizes across different table schemas. There are
three modules in Seq2SQL. The first module tries to identify
an aggregator function like MIN() or MAX(). The second
module extracts column names from NLQ and uses them as
a select operator. Both modules worked on question-answer
pairs. The third module extracts condition or where-clause
from NLQ. There is a possibility to swap between arguments in
the WHERE clause. This ambiguity problem could solve by
policy-based reinforcement learning [35] in question-answer
pairs.

III. P ROPOSED M ETHODOLOGY


The main concept behind this method is to transform
Natural Language (NL) into Non-Structured Query Language
(NSQL) using Natural Language Tool Kit (NLTK) and Deep
Learning Model. The concept and its description are formal- Fig. 1. Proposed methodology
ized in the following sections. The proposed architecture is
shown in Fig. 1.

2) Lowercase conversion: Lowercase conversion is the first


A. Input Natural Language Query (NLQ) step of text preprocessing. In this step, the input NLQ is
NLQ consists of only the normal terms of user’s language, converted into a lower case format. Although the uppercase or
without any special format or syntax. Natural language query lowercase forms of words are supposed to have no difference,
(NLQ) in English is given as input. This input text will be all the uppercase characters usually changed into lowercase
processed for getting information and later converted into forms before the classification.
NoSQL queries.
3) Tokenization: Tokenization splits the natural language
1) Text preprocessing: Since the inventory of individual query, phrase, string, or entire text document into smaller
words, text can take many forms, ranging from sentences units such as individual words or tokens. The former Sentence
to many paragraphs with special letters. In NLP the text Boundary Disambiguation (SBD) is often used to form a list
preprocessing is an important task and the first step in the of individual sentences. It depends on a pre-trained, language-
preprocessing to building a model. It is a data mining technique specific algorithm similar to the Punkt Models from NLTK.
that transforms plain text into a machine-readable format. Real- The text divides into a list of words using an unsupervised
world data is frequently inadequate, inconsistent, or deficient algorithm to form a model for abbreviated words. For the
in specific behaviors and is likely to contain various errors. English language, a pre-trained Punkt tokenizer includes in
This step is needed for transferring input text from human the NLTK data package.
language to machine-readable format for further processing.
4) Removing escape words: Escape or extra words are the
In this paper, we have used NLTK for text preprocessing. words that are frequently appeared within the text without
The NLTK is the most widely used and well-known of the NLP having more information or content. So, the escape words
libraries in the Python ecosystem. It is used for all sorts of tasks are removed because they are not needed in the analysis of
from lowercase conversion to tokenization, removing escape the query. For the purpose of building queries, several sets of
words, part of speech tagging, and beyond. Input text will escape words have been developed. In this paper, we proposed
be processed for getting information from Natural language a new set of escape words. Auxiliary verbs and prepositions
Query input. From the processed text, the system will extract are mainly used in this context as escape words such as ‘a’,
collection, attribute, and operation for making a NoSQL query. ‘an’, ‘the’, ‘is’, ‘of’, ‘with’, ‘to’, ‘for’, ‘and’, ‘all’, etc.
www.ijacsa.thesai.org 812 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

TABLE I. E SCAPE W ORDS Algorithm 2: Keeping Necessary Tags


Escape Words With Escape Words Without Escape
Input: W = All the words after removing stop words;
Words Output: T = Necessary Tag’s with appropriate word
all, the Find all the students Find students c = CountWord(W )
is, the, of, What is the name What name student
all of all student?
for c ∈ C do
the, and Insert the student Insert student w = W[c] TAGS = Tagging(w)
name x and age 20 name x age 20 for t ∈ TAGS do
TAG = EMPTY
if t ∈ VERB then
Beside Table I, this step eliminates punctuation from the PUSH(T AG, t)
input natural query. The detailed process of eliminating escape end
words is illustrated in Algorithm 1. if t ∈ NOUN then
PUSH(T AG, t)
end
Algorithm 1: Removing Extra Words
end
Input: I = Input words and; E = List of Extra Words PUSH(T, TAG)
Output: L: List of words after removing extra words end
cw = CountWord(I) return T
for cw ∈ CW do
I = I[c] TOKENS = Tokenization(I)
for l ∈ T OKEN S do
TOKEN = EMPTY language queries. The approach starts by counting how many
if l ∈
/ E then words in the list are similar to one another. Afterward, it
PUSH(T OKEN, t) compares every single similar word with every attribute from
end the WordNet by the LD and synonym list plays a crucial role in
end extracting attribute and collection names. This method keeps
PUSH(L, T OKEN ) a list of words that are synonyms for each noun tag. Using
end WordNet, a list of noun tag synonyms is generated. The aim
return L of making a synonym list is to find a specific collection and
attribute from an input query. Every user formulates their query
in a different way. They also use different words to describe
Removing escape words is a simple but essential aspect the attribute or collection names. So, this approach checks
of many text mining applications cause it reduces memory synonyms of the words from the user query in the WordNet
overhead. It can reduce noise and false positives. This method library. we give some analogies in Table II:
can potentially improve the power of prediction in any text
mining application. TABLE II. A NALOGY BETWEEN T EXT AND I NTRUSION D ETECTION
WHEN A PPLYING THE LD A LGORITHM
5) Parts of Speech (PoS) tagger: PoS tagging helps in text-
to-speech conversion, information retrieval, and word sense Text Intrusion Detection
disambiguation. It’s used for the classification of words in their Find the name of all student Collection(all student)
What is the accommodation Attribute(address)
PoSs and labeling them according to the tagset. The collection of student id 01
of tags used for PoS tagging is tagset. PoS tagging is also Find the Fastname of the students Attribute(name)
referred to as word classes or lexical categories. However, all
PoS tags aren’t necessary to analyze. All PoS tagging attributes
are provided by the NLTK toolkit. The PoSs must be defined In this paper, the LD algorithm works as a threshold.
as the following: Sentence word is compared with the collection name if the
value is greater than the threshold then it saves the collection
• Noun Tags = [‘NN’, ‘NNS’, ‘NNP’, ‘NNPS’] and attributes with the appropriate name in a list. Finally, this
approach gives an output figure of the match collection and
• Adjective Tags = [‘JJ’, ‘JJS’, ‘JJR’]
attribute. The designed algorithm for collection and attribute
• Verb Tags = [‘VB’, ‘VBP’, ‘VBD’, ‘VBG’, ‘VBZ’] extraction is described in Algorithm 3
• Adverb Tags = [‘RB’, ‘RBR’, ‘RBS’] Levenshtein Distance formula is used to measure the dis-
tance between the two strings a and b with length |a| and |b|,
Adverb and adjective tags do not have much significance respectively.
in generating NoSQL queries. Only noun and verb tags are 
considered for the next steps of PoS tagging. Because verb & max(m, n)
 
noun tags may indicate command and attributes or table name 
LDa,b (m − 1, n) + 1
respectively. Algorithm 2 illustrates the process. LDa,b (m, n) =

min LDa,b (m, n − 1) + 1

LDa,b (m − 1, n − 1) + 1(am ̸=bn )

B. Collection and Attribute Extraction
Levenshtein distance (LD) algorithm is used in a specific Here (am ̸= bn ) is the indicator function that is equal to 0
solution to extract collections and attributes from natural when (am ̸= bn ), otherwise 1, and LDa,b (m, n) is the distance
www.ijacsa.thesai.org 813 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

Fig. 2. Operation extraction using BERT model

Algorithm 3: Attribute Extraction data containing (or instances) whose categorical membership
Input: W = List of Attributes from Database; C = is known [36]. A classification model tries to make some
List of Collection name from Database; S = Set of inferences from the observed data. To predict one or more
Similar Words outcomes from the dataset, provide one or more data as inputs
Output: A = Attributes Name; B = Table Name to the categorization model.
t = CountWord(T ) In the dataset, BERT employs a novel technique known as
for i ← 1 to t do Masked Language Model (MLM), in which it masks words
for j ∈ S do in the sentence at random and then attempts to predict them.
LD − T HRESHOLD = 1 It doesn’t use common sequence left-to-right or right-to-left
THRESHOLD = LD-Algorithm(S[j], W [j]) language models. Instead, it uses the bidirectionally trained
if LD − T HRESHOLD>T HRESHOLD sequence with a deeper sense of language context and the
then model. The pre-train BERT applying two unsupervised tasks:
PUSH(A[i], W [j]) • Pre-training the BERT to understand language.
PUSH(B[i], C[i]) • Fine-tuning the BERT to learn specific task.
end
end BERT depends on a Transformer (the self-attention mech-
end anism to learns contextual relationships between words in a
text). A simple Transformer consists of an encoder that reads
text input and a decoder to generates a task prediction. Since
the BERT model only requires the encoder part for generating
between the first m characters of a and the first n characters a language representation model. There are two main models
of b. of BERT:
• BERT base has 12 transformer blocks, 768 hidden
C. Operation Extraction
layers, 12 attention heads, and 110M parameters.
Operation extraction is a particular solution that uses BERT • BERT large has 24 transformer blocks, 1024 hidden
Model to extract operations from natural language queries. In layers, 16 attention heads, and 340M parameters.
this approach, we use BERT Model for classifying the specific
operation. In machine learning, classification is the set of In this paper, we used the BERT base model that has
categories that analysis belongs to the basis of a training set of enough pre-trained data to help bridge the gap in data. The
www.ijacsa.thesai.org 814 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

model for operation extraction shows in Fig. 2. Given the input Query column which represent the natural language query
text, the Model that tokenizes the text using BERT tokenizer and (2) Operations column. The description of the datasets
then generates the input masks with input IDs of the sentence. is illustrated in Table III.
The input mask uses WordPiece [37] for tokenizing that splits
the token like “going” to “go” and “ing.” It is mainly to cover
TABLE III. D ESCRIPTION OF DATASET
a broad spectrum of Out-Of-Vocabulary (OOV) words. After
tokenization, the output class goes as input in the classification Dataset WIKISQL
model. we used a neural network for classification to get Language NLQ
Total number of cases 80,654
the highest accuracy. After classifying, we get the output of Length of the text (average) 61.09
the operation. Here we work on four types of operations, in Word count of the text (average) 11.66
consideration- FIND, INSERT, UPDATE, REMOVE. Granualarity of text description line
Number of validation text 8,421
Number of test cases (total) 15,878
D. Build Syntax Tree & Generate Query Number of train cases (total) 56,355

After Tokenization, collection, attribute, and operation are


extracted from the sentence, we map the syntax tree with key-
value pairs to build the query sequentially with the logical To avoid overfitting, we split the dataset into the training
expression. If there are no logical expression in the sentence, set and the testing set. we train our model on 70:30, 60:40,
it will be Nulled. Fig. 3 shows the syntax tree. and 80:20 ratios and get the optimal result from the 80:20
ratio on our dataset. The data fields are the same among all
splits. WikiSQL is a collection of hand-annotated SQL table,
question, and query examples from Amazon Mechanical Turk
crowd workers. It is orders of magnitude larger than current
datasets, with 87000 samples as of this writing. The number of
validation queries is 8,421. We build queries for the table and
then ask crowd workers to paraphrase them. Each paraphrase is
then double-checked by independent personnel to ensure that it
does not alter the meaning of the original inquiry. We anticipate
that making WikiSQL available will aid the community in
developing the next generation of natural language interfaces
(Fig. 5).
The Fig. 6 illustrates a blue histogram which shows the
word and text distribution of dataset. It is hand-annotated
Fig. 3. Mapping syntax tree semantic parsing dataset that contains logical and normal
forms, respectively. In the dataset, the data is extracted from
the web.
Finally, we concatenate the whole step part-by-part and
generate a NoSQL query. Fig. 4 shows the architecture of
NoSQL query and given the output of the result. B. Text Pre-Processing
Text pre-processing is the first step of our proposed system.
This step involves removing noise from our dataset. we apply
several pre-processing steps to the data to convert words into
Fig. 4. Architecture of query generation numerical features. An example of tokenization is:
Input: ‘find the name of all student’
Output: [‘find’, ‘the’, ‘name’, ‘of’, ‘all’, ‘student’]
IV. E XPERIMENTAL A NALYSIS AND R ESULT
In this section, we evaluate our proposed model with the C. Collection and Attribute Extraction from WordNet
dataset. Firstly, we present the analysis of our dataset, then set
up the evaluation. In the end, we compare our proposed model We used NLTK WordNet to find find synonyms and
with the existing works and mention the differences, weak and antonyms of words. A WordNet is a lexical database that con-
strong points of our proposed model. tains semantic relationships between words and their meanings.
Our proposed model can successfully extract collection and
A. Dataset attributes from WordNet library if there were any spelling
errors occur or synonyms used. The bar diagram 5 shows
We reshuffle the WIKISQL dataset for a better understand- how extract collection and attribute from WordNet using
ing of our model performance. WIKISQL is a massive crowd Levenshtein distance. For example:
sourced dataset for creating NLIDB. The model is retrained
periodically by reflecting the latest dataset. Our proposed Collection extraction: ‘all student’:[‘student’,‘students’]
model has used two types of data: (1) Natural Language Attribute extraction: ‘name’: [‘name’, ‘title’, ‘label’]
www.ijacsa.thesai.org 815 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

Fig. 5. Collection and attribute extraction

E. Split Data for Training and Testing


The training phase is the first step for the BERT Model.
This model is a transformer design based on an encoder
stack. We trained the WIKISQL dataset using this model.
The Model uses the Semi-Supervised Learning approach for
translating natural language query into operation. The training
sub-dataset contains all of the features required to turn a natural
language query into an operational query. To partition the
WIKISQL dataset into two sub-datasets, we use the scikit-
learn library’s “train test split” method. The suggested system
is built using the dataset’s training sub-dataset. The training
dataset is a fraction (80%) of the whole data set. The rest
(20%) is considered as test data. This information is imported
Fig. 6. Word and text distribution as a.csv file. Table IV shows a portion of the training data set.

TABLE IV. A S AMPLE OF T RAINING DATASET


D. BERT Tokenizer
Line No. Natural Language Query Operations
In operation extraction our proposed system starts with 1 What’s Dorain Anneck’s pick find
BERT Tokenizer step. It gives sinusoidal positional encoding, number?
2 Find the student whose name is find
the model itself learns the positional embedding during the x.
training phase. Using the word-piece tokenizer concept that 3 Insert the arrival time of insert
break some words into sub-words. greenbat.
4 Put the status of the trains at insert
It helps many times to break unknown words into some location Museum
5 Update the record for september update
known words and tokenize our text into tokens that correspond 15, 1985.
to BERT’s vocabulary. An example of BERT Tokenization is: 6 Re-equip the student update
7 Remove the brighton cast for remove
Input: ‘find the name of all student’ jerry cruncher
Output 1: [101, 1023, 12334, 15233, 2033, 2435, 24353, 102] 8 Delete the all student remove
Output 2: [‘[CLS]’, ‘find’, ‘the’, ‘name’, ‘of’, ‘all’, ‘student’,
‘[SEP]’]
F. Model Building
Output 1 is indices of the input tokens from the vocab file BERT is an architecture that uses a transformer encoder
and output 2 is the reverse, a human-readable token of the to process each token of input text in the context of all other
input ids. Apart from the input tokens we also got 2 special tokens. After splitting the dataset, we start with the pre-trained
tokens ‘[CLS]’ and ‘[SEP]’. BERT model is designed in such BERT Model to classify the find, insert, update and remove
a way that the sentence has to start with the [CLS] token and operations. In our model we use 12 layers of Transformer
end with the [SEP] token. encoder. After run the operation we get two variables: First
www.ijacsa.thesai.org 816 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

Fig. 7. Comparison of four types operations

variable contains the embedding vectors of all of the tokens in G. Model Accuracy
a sequence and second variable contains the embedding vector
of [CLS] token. We then pass the variable into a linear layer Accuracy evaluates how well our model forecasts compare
with ReLU activation function. We have a vector of size 4 them with the original values. With a low rigor yet a high
at the end of the linear layer, each of which corresponds to blunder, the model would make huge mistakes in the data.
a category of our labels (find, insert, update, and remove). Both blunder and rigor lowness indicates that with most data,
We use Adam as the optimizer and train the model for 10 the model produces smaller errors. However, it produces huge
epochs. Because we’re dealing with multi-class classification, mistakes in some systems if they are both high. The ideal
we’ll need to use categorical cross entropy as our loss function. scenario of any model would be high rigor and little blunder.
Fig. 8 depicted the operation. Fig. 9 illustrate the accuracy of the proposed model.

For example:

Input: ‘find the name of all student’


Output: ‘find’

Fig. 9. Model accuracy

H. Model Loss
Fig. 8. Model building
Loss is the total of our model errors. It evaluates how well
our model does (or how badly it does). When there are a lot of
mistakes, the loss is high and the model doesn’t work properly.
The better our model works, the lower it is. However, the
The model enhances the accuracy rate for classification greatest conclusion we can make from it is whether the loss
than the previous model. For the classification task, the model is big or low. If we plot losses over time, we can evaluate
can classify 81.45% average class detection from previous if and how quickly our model is learning. This is because
research. One of the reasons is BERT uses a pre-trained model the loss function is utilized by the model for learning. This
which is based on transfer learning. It can tune the data on a takes the shape of approaches like gradient descent, which
specific NoSQL language. Fig. 7 illustrates the accuracy rate modify parameters of the model using information on the loss
of four types of operations separately. outcome. Fig. 10 illustrate the loss of the proposed model.
www.ijacsa.thesai.org 817 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

language query as well as appropriate. The output contains


each converted NoSQL query with original query and test
query, along with the percentage of converted NoSQL query.

J. Evaluation Setup
In this dissertation, we evaluate the result on our dataset
that have three notation to evaluate the query synthesis accu-
racy.
• Normal form accuracy is the form of a NoSQL
query that has no attribute. We analyze the synthesized
NoSQL query with the ground truth to verify whether
they match each other.
Fig. 10. Model loss
• Logical form accuracy is the accuracy of a NoSQL
query that has attributes or any logical expression of
the query.
I. Output
• Query match is the comparison accuracy with the
In output, we get the collection and attribute name, such original query match for find, insert, update and re-
as all student and name. From the operation extraction, get move operations query. We use a canonical represen-
the find operation, then concatenate all the extractions output tation of the synthesized NoSQL query and the ground
part-by-part to generate a NoSQL query. For example: truth to determine whether two NoSQL queries are
identical.
We also find out the F1 score for operation extraction that
measures the precision and recall value. Finally, we present
the comparison of our model with previous work on NoSQL
We have classified the wrong output into two categories: conversion tasks. The implementation of our model using
(a)sometimes, the query contained incomplete logical expres- python [38].
sion in condition part (b) the query is incorrect. Analysis of
the conversion results reveals the following: The F1-score measures the accuracy of the operation (find,
insert, update, remove) by applying the precision and recall
• Observing all the NoSQL output, we can notice sug- values of the test. This test looks at whether the system can
gested model can work with natural language queries process the sentences entered by the user so that it can be
of different lengths. After a successful NoSQL query measured the operation accurately with the F1-score method.
output, the number of input and output tokens might Table VI shows the accuracy values. The equation of the F1-
be distinct. The accuracy of the proposed model did score, precision, recall, and accuracy have given below:
not depend on the length of the query.
• Precision: It is the true positive relevance rate that
• The BERT Model successfully predicts the operation tp
defined as the ratio tp+f p , where f p indicates the
using a pre-trained model. It also tunes the NoSQL number of false positives;
command from a distinctive size of input text.
• Recall: It is the true positive rate that defined as the
• The BERT model can process a large amount of tp
ratio tp+f n , where tp and f n are the number of true
data. The WIKISQL dataset covered different types positives and false negatives, respectively;
of query statements. So there is no problem for the
BERT model to work with the WIKISQL dataset. • F1-score: F1-score is a function of Precision and
Recall that is the harmonic mean between Precision
• The Bert model understands the semantic relationship and Recall, defined the ratio as 2∗(precision∗recall)
precision+recall ;
between natural language and NoSQL queries. As a
result, the decoder output is logically correct for the Next, we find out the accuracy of normal and logical forms.
maximum query. Let X is the total number of queries in our dataset and X ex is
the execution query. we evaluate the every clause (find, insert,
• The model can generate “contextualized” word em-
update and remove) query using accuracy metric for normal
beddings but it is compute-intensive at inference time
form Acc nf = XXex and for logical form Acc lf = XXex . Table
and need to calculate compute vectors every time.
VII shows the accuracy of normal and logical queries. After
• In collection and attribute extraction, we use the Lev- that the overall result is evaluated by the BLEU (Bilingual
enshtein Distance algorithm. The algorithm can extract Evaluation Understudy) that was developed to evaluate the
attributes from natural language queries furthermore machine translation system.
check the spelling error. The run time complexity of
this algorithm is lower than O(n2 ). K. Result
Test results show in Table V that have been translated The article presents an efficient approach to transform
into the NoSQL syntax. The test data contains the natural the natural language query into a NoSQL query effectively.
www.ijacsa.thesai.org 818 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

TABLE V. T HE ACCURACY FOR C ONVERTING NATURAL L ANGUAGE INTO N ON S TRUCTURED Q UERY L ANGUAGE

Input Text Original query Test query Accuracy(%)

Find all the students db.all student.find() db.all student.find() 100


What is the name of all student? db.all student.find({name:True}) db.all student.find({name:True}) 100
Find the student whose age greater than 70 db.all student.find({ age: { $gt: 70 }) db.all student.find({ age: { 70 }) 75.0
Insert the student whose name is x db.all student.insert ({name:‘x’}) db.all student.insert ({name:‘x’}) 100
Insert student whose name is x, age 22 db.all student.insert ({name:‘x’},{age:22}) db.all student.insert ({name:‘x’},{age:22}) 83.33
Update the name y who is x in student table db.all student.update ({name:‘x’},{$set:{name:‘y’}}) db.all student.update ({name:‘x’},{$set:{name:‘y’}}) 100
Update name z and age 40, whoes name is x db.all student.update ({name:‘x’},{$set:{name:‘y’}}) db.all student.update ({name:‘x’},{$set:{name:‘x’}}) 65.5
Remove all the students db.all student.drop() db.all student.drop() 100
Delete student whose name is x and age 20 db.all student.removeMany ({name:‘x’, age: 20}) db.all student.remove ({name:‘x’, age: 20}) 75.0

This model achieves a competitive result on our dataset.


The following tables represent the experiment result of each
classifier. TABLE VIII. A NALOGY OF D IFFERENT T YPES OF M ODEL
Model Accuracy Error Rate
TABLE VI. E XPERIMENTAL R ESULTS OF EACH C LASSIFIER (%) (%)
Encoder-Decoder Model 71.5 28.5
F1-score 0.808
REINFORCE-algorithm Model 84.2 15.8
Precision 0.892
Proposed Model 88.76 11.24
Recall 0.74

Bilingual Evaluation Understudy (BLEU) is a score for Table VIII represents BLEU portion of efficiency for
comparing a candidate translation of the NoSQL query to forecasting correct NoSQL query. Using the WikiSQL reshape
one or more reference translations. To predict the accuracy dataset the proposed model is passed down for comparing
of automatic machine translation systems, Kishore Papineni, with the existing other models. Fig. 11 illustrates the three
et al. [39] proposed this score in 2002. We used the BLEU models’ estimated efficiency and error rates. It demonstrates
score to determine the output. the accuracy of other measure rates of converting the natural
language query into the non-structured query language (that
BLEU is not entirely effective but offers several interesting scored 88.76%) is better or at least competitive than the earlier
benefits like quick, easy to calculate, language-independent, results.
highly interactive with human interpretation, and widely used.
m
P = (1)
wt
where, m is the estimate of tokens from the candidate
source code that are found in the reference text, and wt is the
total estimate of words in the candidate query. The accuracy
is calculated using the equation 2.
Accuracy = P × 100% (2)

The performance analysis of our model is given in Table


VII.

TABLE VII. P ERFORMANCE A NALYSIS OF OUR M ODEL . ACCnf AND


ACClf I NDICATE THE N ORMAL FORM AND L OGICAL FORM Q UERY Fig. 11. Performance factor between previous and our proposed model
ACCURACY, AND ACCqm I NDICATES THE ACCURACY OF Q UERY M ATCH
Operation clause Accnf (%) Acclf (%)
Find 100 87.5
Insert - 91.67
Update - 82.75 V. C ONCLUSION
Remove 100 75.0
In the age of digitalization, internet users have been in-
creasing continuously. So a large amount of data needs to be
Accounting to Concepts Identification errors and domain
stored in a database. Relational databases faced some chal-
dictionary errors, the average accuracy achieved by our system
lenges in search engines and social networking services. Here,
is 88.76% respectively. We define the error rate as:
the NoSQL database helps maintain a broad range of hierar-
ErrorRate = (100 − AccuracyRate)% (3) chical data models. The proposed model deals with the NLP
www.ijacsa.thesai.org 819 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

to non-relational query conversion. Initially, preprocessing the [17] T. Ho, S.-R. Oh, and H. Kim, “A parallel approximate string matching
text (English) by NLTK, then used LD algorithm for collection, under levenshtein distance on graphics processing units using warp-
attribute extraction and BERT model for operation extraction shuffle opera-tions,”PloS one, vol. 12, no. 10, p. e0186251, 2017.
and finally, query generation. Our model can generate queries [18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” CoRR,
for Find, Insert, Update, Remove clause with an average vol. abs/1810.04805, 2018.
accuracy of 88.76%. In the future, we intend to improve more [19] Woods, W. A.,. 1973. Progress in natural language understanding: An
complex NoSQL queries such as logical function queries, using application to LUNAR geology. AFIPS Natl. Computer. Conj: Expo..
other incentive mechanisms for better performance. Conference Proc. 42, 441-450.
[20] M. Arefin, K. M. Hossen, and M. N. Uddin, “Natural language
ACKNOWLEDGMENT OF F UNDING query to sql conversion using machine learning approach,” in 2021 3rd
International Conference on Sustainable Technologies for Industry 4.0
This work was supported by the UGC Jagannath (STI). IEEE, 2021, pp. 1–6.
University Research Branch, Dhaka, Bangladesh, Under [21] B. Mallikarjun, K. Annapoorneshwari, M. Yadav, L. R. Rakesh, and
JnU/research/rp/2020-21/science/44. S. Suhaas, “Intelligent automated text processing system-an nlp based
approach,” in 2020 5th International Conference on Communication and
Electronics Systems (IC- CES). IEEE, 2020, pp. 1026–1030.
R EFERENCES [22] B. Jose and S. Abraham, “Intelligent processing of unstructured textual
[1] J. Han, E. Haihong, G. Le, and J. Du, “Survey on nosql database,” in 2011 data in document based nosql databases,” Materials Today: Proceedings,
6th international conference on pervasive computing and applications. 2021.
IEEE,2011, pp. 363–366. [23] F. Abdelhedi, A. A. Brahim, and G. Zurfluh, “Ocl constraints checking
[2] A. Nayak, A. Poriya, and D. Poojary, “Type of nosql databases and its on nosql systems through an mda-based approach,” International Journal
comparison with relational databases,”International Journal of Applied of Data Warehousing and Mining (IJDWM), vol. 17, no. 1, pp. 1–14,
2021.
Information Systems, vol. 5, no. 4, pp. 16–19, 2013.
[24] M. T. Majeed, M. Ahmad, and M. Khalid, “Automated xquery gener-
[3] R. S. Al Mahruqi, “Migrating web applications from sql to nosql
databases,” Ph.D. dissertation, Queen’s University (Canada), 2020. ation for nosql,” in 2016 Sixth International Conference on Innovative
Computing Technology (INTECH). IEEE, 2016, pp. 507–512.
[4] S. Batra, C. Tyagi, ”Comparative Analysis of Relational And Graph
Databases”, IJSCE,vol.2(2), pp. 509-512, 2012. [25] S. Mondal, P. Mukherjee, B. Chakraborty, and R. Bashar, “Natural
language query to nosql generation using query-response model,” in 2019
[5] R. Alexander, P. Rukshan, and S. Mahesan, “Natural language web International Conference on Machine Learning and Data Engineering
interfacefor database (nlwidb),”arXiv preprint arXiv:1308.3830, 2013. (iCMLDE). IEEE, 2019, pp. 85–90.
[6] Z. Wei-ping, L. Ming-xin and C. Huan, ”Using MongoDB to implement [26] T, Pradeep and P C, Rafeeque and Murali, Reena, Natural Lan-
textbook management system instead of MySQL”, IEEE-ICCSN,2011, guage To NoSQL Query Conversion using Deep Learning (August
pp. 303-305. 13, 2019). In proceedings of the International Conference on Sys-
[7] P. Chen, C. Xhang,”Data-intensive applications, challenges, techniques tems, Energy & Environment (ICSEE) 2019, GCE Kannur, Kerala,
and technologies: A survey on Big Data”, Information Sciences, Elsevier, July 2019, Available at SSRN: https://ssrn.com/abstract=3436631 or
vol.275, pp.314–347, 2014. http://dx.doi.org/10.2139/ssrn.3436631
[8] B. Jose, S. Abraham, Unstructured Data Mining for Customer Rela- [27] S. Blank, F. Wilhelm, H.-P. Zorn, and A. Rettinger, “Querying nosql
tionship Management: A Survey, International Journal of Management, with deeplearning to answer natural language questions,” in Proceedings
Technology And Engineering 8, Issue 7. ISSN NO (2018) 2249–7455. of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019,
[9] O. Ferschke, J. Daxenberger, and I. Gurevych, “A survey of nlp pp. 9416–9421.
methods and resources for analyzing the collaborative writing process [28] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learn-
in wikipedia,” in The People’s Web Meets NLP. Springer, 2013, pp. ing with neural networks,” Advances in neural information processing
121–160. systems, vol. 27, 2014.
[10] Garrido-Mu ñoz, A. Montejo-R áez, F. Mart́ ınez-Santiago, and L. A. [29] M. D. Gadekar, B. M. Jadhav, A. S. Shaikh, and R. B. Kokare, “Natural
Ure ña- L ópez, “A survey on bias in deep nlp,” Applied Sciences, vol. language (english) to mongodb interface, ”International Journal of Ad-
11, no. 7, p. 3184, 2021. vancedResearch in9 Computer Engineering & Technology (IJARCET),
[11] S. Srivastava, A. Shukla, and R. Tiwari, “Machine transla- vol. 4, no. 3, 2015.
tion: from statisticalto modern deep-learning practices,”arXiv preprint [30] P. Anand and Z. Farooqui, “Rule based domain specific semantic
arXiv:1812.04238, 2018. analysis for natural language interface for database,” International Journal
[12] Kłosowski, P. (2018). Deep learning for natural language processing of Computer Applications, vol. 164, no. 11, 2017.
and language modelling. In 2018 Signal Processing: Algorithms, Archi- [31] T. Mahmud, K. A. Hasan, M. Ahmed, and T. H. C. Chak, “A rule based
tectures, Arrangements, and Applications (SPA), September 2018, pp. approach for nlp based query processing,” in 2015 2nd International
223-228. IEEE. 10.23919/SPA.2018.8563389 Conference on Electrical Information and Communication Technologies
[13] U. K. Acharjee, M. Arefin, K. M. Hossen, M. N. Uddin, M. A. (EICT), pp. 78–82, IEEE, 2015.
Uddin, and L. Islam, “Sequence-to-sequence learning-based conversion [32] X. Xu, C. Liu, and D. Song, “Sqlnet: Generating structured queries
of pseudo-code to source code using neural translation approach,” IEEE from natural language without reinforcement learning,” arXiv preprint
Access, vol. 10, pp. 26 730–26 742, 2022. arXiv:1711.04436, 2017.
[14] B. Jose and S. Abraham, “Intelligent processing of unstructured textual [33] J. Bornholt, E. Torlak, D. Grossman, and L. Ceze, “Optimizing synthesis
data in document based nosql databases,” Materials Today: Proceedings, with metasketches,” in Proceedings of the 43rd Annual ACM SIGPLAN-
2021. SIGACT Symposium on Principles of Programming Languages, 2016,
[15] N. Yaghmazadeh, X. Wang, and I. Dillig, “Automated migration of pp. 775–788.
hierarchical data to relational tables using programming-by-example, [34] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating structured
”Proceedings ofthe VLDB Endowment, vol. 11, no. 5, pp. 580–593, queries from natural language using reinforcement learning,” arXiv
2018. preprint arXiv:1709.00103, 2017.
[16] S. Zhang, Y. Hu, and G. Bian, “Research on string similarity al- [35] K. Guu, P. Pasupat, E. Z. Liu, and P. Liang, “From language to
gorithm basedon levenshtein distance,” in2017 IEEE 2nd Advanced programs: Bridging reinforcement learning and maximum marginal like-
Information Technology, Electronic and Automation Control Conference lihood,” arXiv preprint arXiv:1704.07926, 2017.
(IAEAC).IEEE, 2017, pp.2247–2251. [36] G. B. Boullanger and M. Dumonal, “Search like a human: Neural
machinetranslation for database search,” Technical report, Tech. Rep.,
2019.
www.ijacsa.thesai.org 820 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 2, 2023

[37] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, Today: Pro- ceedings, 2021.
M. Krikun,Y. Cao, Q. Gao, K. Machereyet al., “Google’s neural ma- [39] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for
chine translationsystem: Bridging the gap between human and machine automatic evaluation of machine translation,” in Proceedings of the 40th
translation,” arXivpreprint arXiv:1609.08144, 2016. annual meeting on association for computational linguistics, pp. 311–
[38] M. K. Chakravarthy and S. Gowri, “Interfacing advanced nosql database 318, Association for Computational Linguistics, 2002.
with python for internet of things and big data analytics,” Materials

www.ijacsa.thesai.org 821 | P a g e

View publication stats

You might also like