Question Answering System On Education Acts Using NLP Techniques
Question Answering System On Education Acts Using NLP Techniques
ABSTRACT— Question Answering (QA) system in information document where the user has to find out the proper answer for
retrieval is a task of automatically answering a correct answer to which he or she is actually looking for. In order to solve this
the questions asked by human in natural language using either a problem at present most of the researchers working in various
pre-structured database or a collection of natural language domain such as Web Mining, NLP, Information retrieval and
documents. It presents only the requested information instead of information extraction and so far focused on open-domain QA
searching full documents like search engine. As information in
day to day life is increasing, so to retrieve the exact fragment of
and close domain QA system. Systems can be divided into two
information even for a simple query requires large and expensive types on the basis of the domains.
resources. This is the paper which describes the different • Closed-domain question answering:
methodology and implementation details of question answering Closed-domain question answering refers to specific domain
system for general language and also proposes the closed domain related questions and can be seen as an easier task because
QA System for handling documents related to education acts NLP systems can provide domain-specific knowledge. It has
sections to retrieve more precise answers using NLP techniques. very high accuracy but limited to single domain. The example
of such system is medicines or automotive maintenance.
Index Term— Question Answering, NLP, Information • Open-domain question answering:
Retrieval, Education Acts Open-domain question-answering deals with the questions
which are related to every domain. In this systems, mainly
I. INTRODUCTION
have more data available from which the system extract the
answer. It can answer any question related to any domain but
Now days there are many search engines available. All with very low accuracy as the domain is not specific.
these search engines have great success and have remarkable
capabilities, but the problem with these search engines is that General Architecture of QA system:
instead of giving a direct, accurate and precise answer to the
user's query or question they usually provide list of document In QA system, User posed a query as a input in natural
related to websites which might contain the answer of that language. After that this query is going to search the document
question. Although the list of documents which are retrieved to extract all the possible answers for the user query.
by the search engine has lot of information about the search The basic architecture of Question-Answering system is as
topic but sometimes it may not contain relevant information shown in Figure
which the user is looking for [11].
The basic idea behind the QA system is that the users just
have to ask the question and the system will retrieve the most
appropriate and correct answer for that question and it will
give to the user. For example, for the following set of
questions such as:
“Which vitamin is present in milk?"
“What is the birth place of Shree Krishna?
Users are more interested in the answers such as
Vitamin A, E, D and K and Mathura rather than a large .
Fig.1 Architecture of Question-Answering System
Query Generation: - In query generation, Query Logic b. Text Retrieval and answer snippet extraction:
Language (QLL) is used for expressing the input question. After identifying query words the answer candidates are
retrieved from the collected document for answer
Database Search: - Here the search of the possible results is identification. The collected document is indexed which has
done in stored database, the related data that satisfy the given total keyword match with the question keyword are selected
query with selected keyword are sent to the next process for answer extraction. For this, it checks the count of match of
the query keyword with each sentence. The sentences which
Related Document: - The result which was generated by the have a some match to the query keyword are selected as the
previous stage is stored as a document. answer candidates are represented with the help of triplet
containing the sentence, index and count of the match. The
Answer Display: - The result is stored as a document. Then the
index is used to extract the actual sentence. The index term is
result is converted into accurate text for that the user is
assigned at the time of text splitting and count of the match
looking for and that answer displayed to the user.
which gives the term that match with the question. These
II. RELATED WORK candidate answers are passed to the next module selection of
answer candidates.
In paper [1], author developed geographical domain question c. Answer identification:
answering system which gives answer to a user question Answer identification has two sub-modules as, Scoring and
related to information about various cities. For designing Ranking and Answer extraction. In case of Scoring and
system, first author create knowledge base document and Ranking of the candidates answer s selection of the winner
perform document pre-processing which involved noise candidate is done using matching window sizes Candidate
Removal, tokenization, sentence splitting and document answer which has the highest score is selected as the winner
tagging by using Named entity Tagger, Parser and Word net candidate and The result of winner candidate is processed for
tool. Question processing, Document processing and answer answer extraction. Answer Extraction is done by using Named
processing are the main element in this system. Question Entity Recognition in that the expected named entity of the
processing deal with sub-classification of question and question is find by analyzing the question word and then
reformulation. Question classification done by plain matching question word nearest to the surrounding word are analyzed to
pattern algorithm. After that passage retrieval was there in identify the expected answer entity.
which pre-processed and indexed corpus was used for passage
retrieval. Retrieval module produced candidate answer which Restricted domain Question Answering System [4] describes
becomes input for answer extraction in which ranking was new architecture which divided into four modules as Question
performed based on semantic relation with the used of Word Processing, Document processing, paragraph extraction and
net tool. After ranking final answer is display with contain Answer Extraction.
maximum rank. a. Question Processing Module:
In Question Processing module the user Question is processed
In paper [2], Author proposed IPedagogy is a question to find some important information from it. Steps from which
answering system which works with natural language powered question Processing Module is process are given below.
queries and retrieve answers from selected information Steps in Question Processing Module:
clusters by reducing the search space of information retrieval. 1) Find out the type of question using Wh word.
In addition, IPedagogy is empowered by several natural 2) Find out the expected type of answer.
language processing techniques which direct the system to 3) Get the Keywords from the Question.
extract the exact answer for a given query. System is evaluated
with the use of mean reciprocal rank and it is noted that b. Document Processing Module:
system has 0.73 of average accuracy level for 10 sets of In document processing module, the documents which are
questions where each set is consisted of 35 questions. relevant to the given question are retrieved and processed.
Steps in Document Processing Module:
Author in paper [3] developed new architecture of Question
1) After processing the question, search related documents for
answering system in Malayalam, which finds answers of
that question using a reliable search engine.
Malayalam questions by analyzing documents in Malayalam
2) Take top ten relevant documents.
language. It handles the four classes of Malayalam question
3) Extract the relevant content from these documents
for closed domain. System classified into 3 modules as
4) Save these contents in a text file.
Question Analysis, Text Retrieval and answers snippet
extraction and Answer identification.
Authorized licensed use limited to: Zhejiang University. Downloaded on November 08,2024 at 11:52:29 UTC from IEEE Xplore. Restrictions apply.
IEEE Sponsored World Conference on Futuristic Trends in Research and Innovation for Social Welfare (WCFTR’16)
d. Answer Extraction Module: An input for the system will be a query related to education
This module presents algorithms for extracting the potential acts or different information related to education. For example
answer for all the three categories of questions that is “what is the duty of parent to secure the children education?”,
Definition Type of Question, Descriptive Type of Question “What are the funding authorities of school?”The Question
and Factoid Type of Question. Using Standford parser toolkit keyword is calculated by removing stop words and performing
Author find out the grammatical structure of question and on stemming on question to extract the answer. The dataset of
basis of structure find out the potential answer from dataset education act related document is generated to form indexed
means those sentence which have the same head words as that term dictionary as metadata knowledge base storing the
of question have. related keywords of each document. Using these keywords,
the original passage or sentences are tagged to give candidate
In paper [5], author proposed a Chinese question
answering system which uses real-time network information answers from answer extractor. According to given question,
the constraint and candidate answer matched against each
retrieved by search engine. This system consist of three
other and highest score probable answer is retrieved as a final
module, question analysis, information retrieval and answer
extraction and also author used lots of NLP technologies such answer. The system will produce the accurate answer for
trained questions and then will test to measure the accuracy of
as question classification, syntactic dependency parsing and
untrained questions.
named entity recognition. After experiment many correct
answers were not the best answer with the highest scores. IV. DETAILS OF IMPLEMENTATION
Main reason is the lack of enough evidence for correct answer METHODS:
score so the answer scoring module is the main work in future.
1. Collection and Study of relevant data set:
III. PROPOSED APPROACH
The ¿rst design module of the proposed work includes the
With help of literature survey of Question Answering Collection and Study of relevant data. For the said work the
Systems, we can conclude that the closed domain QA system relevant data set is the records of education acts related data.
gives accurate answer than the open domain QA system. If we The required data set is collected and studied from different
see the current scenario, there is no QA system for exactly websites related to education acts. Based on observation, next
answering the queries on document of education act related module of system is to create corpus.
information, which ensures the correct answers. So, the idea
for developing the Question Answering System on education A. Creation of Dataset (Corpus):
act is proposed as shown in figure 4. The second phase of the Question Answering System is to
create Corpus. As we have to design the closed domain QA
Authorized licensed use limited to: Zhejiang University. Downloaded on November 08,2024 at 11:52:29 UTC from IEEE Xplore. Restrictions apply.
IEEE Sponsored World Conference on Futuristic Trends in Research and Innovation for Social Welfare (WCFTR’16)
System, the most important work is to decide the domain. For each term t, we store a list of all documents that contain
There are so many QA systems already present for diīerent t.
closed domains. So we are dealing with the new domain for
answering the user queries on education acts and information
about education. The education acts related documents of each
education sections are necessary to know in diīerent ways for
diīerent users. On diīerent websites, Information about
education related documents are available for interested users.
But for finding exact answers on different questions related
these documents can be given by using only QA system. So
we have gone through the diơerent websites and taken the text
data from the websites: www.legislation.gov.uk. For each
section of education acts, there is one text ¿le; such for 583
sections, 583 text ¿les are stored as corpus. Education acts
related text ¿le contain information about each section of
education such as information about school, funding Figure 5.1: Structure of index term dictionary
authorities of school, areas of school, duties of teacher related
to student, duties of parents to secure children education and 3. Question Preprocessing:
many more information related to education. Total 583
sections are available related to education acts. The given input query is preprocessing by performing some
preprocessing operation on it i.e. POS tagging, stop words
B. Preprocessing: removal and stemming.
After creating corpus some preprocessing operations are
performed on each text ¿le of corpus. Major tasks in pre- a. User Query:
processing are stop words removal and stemming. User will enter the query related to education system. For
a. Stop words Removal: example, the user can ask the question “what is the duty of
Removing stop words reduce the dimensionality of term parent to secure children education?” or what primary school?
space. The most commonly finding words are in text Or any query related to education system
documents are prepositions, articles, and pro- nouns etc that
does not provide the meaning of the documents. Stop words b. POS tagging:
are eliminated from documents because those words are not First we perform POS (part of speech) tagging operation on
considered as keywords in Information Retrieval applications. input query to tag each word of user query with its type such
E.g. the English stopword like “is, for, the, in, etc” are remove as verb, noun etc. For tagging each word POS standford tagger
from each text ¿les of dataset by maintaining English is used.
stopword dictionary.
c. Extracted Keyword:
b. Stemming: From the user query, the keywords are extracted. These
Stemming is most important process by which the diīerent keywords are got by removing the symbols and stopwords
forms of word is replaced with basic root word. e.g. from user query, also stemming is applied on keywords so as
automate(s), automatics, automation all reduced to automatic match with index term dictionary term for document retrieval.
For stemming, English stemwords dictionary (for example a English stop words and stemmed words dictionary is
file containing set of document that contain stem words) is maintained to extract keywords.
maintained for extracting keywords. For example:
¾ In case of POS Tagging:
2. Index Term Dictionary Input- What is the duty of parents to secure education of
children?
After pre-processing the extracted keyword are stored in index Output- What/WP is/VBZ the/DT duty/NN of/IN parent/NN
term dictionary. Extracted keywords contain only steam words to/TO secure/VBZ education /PRP Children/NNS
which obtain after performing stemming. Index term
dictionary is created by using java and stored as table in ¾ After Removing Stop-word from question:
mysql. Dictionary forms a structure like invert index Input- What is the duty of parents to secure education of
containing two columns as term and posting. Term is nothing children?
but a extracted keyword and posting is name of ¿le. The Output- duty parents secure education children
following ¿gure shows the structure of index term dictionary.
¾ After Stemming:
Output-keyword [duty parent secure education children]
Authorized licensed use limited to: Zhejiang University. Downloaded on November 08,2024 at 11:52:29 UTC from IEEE Xplore. Restrictions apply.
IEEE Sponsored World Conference on Futuristic Trends in Research and Innovation for Social Welfare (WCFTR’16)
4. Document Retrieval:
Score = (A n B) / (A u B)
Fig. File name-section11.txt
Where, A = set of extracted keywords.
6. Answer Extraction:
B= set of files keywords
In case of answer extraction, POS tagging is apply on all filter
The of Keywords contain one or two files that contain document which is obtain in keywords ranking. After applying
accepted answer. POS tagging, we check the sense in between extracted which
E.g. Input Query- What is the duty of parent to secure children is obtain by query and filter document.
education? e.g. Suppose extracted keywords from query are
[Duty, parent, education, children]
Output: Extracted document after jaccard We check where,
[Duty, parent] used as - noun in document
[Education] used as- pronoun in document
[Children] used as- NNS
After checking sense in between query and document. We
extract that paragraph which contains same sense as the query
sense.
7. Answer:
Authorized licensed use limited to: Zhejiang University. Downloaded on November 08,2024 at 11:52:29 UTC from IEEE Xplore. Restrictions apply.
IEEE Sponsored World Conference on Futuristic Trends in Research and Innovation for Social Welfare (WCFTR’16)
Authorized licensed use limited to: Zhejiang University. Downloaded on November 08,2024 at 11:52:29 UTC from IEEE Xplore. Restrictions apply.