An Automated System For Question Generation and Answer Evaluation
An Automated System For Question Generation and Answer Evaluation
Answer Evaluation
Abstract—An innovative educational tool called the ”Auto- how we develop, manage, and evaluate instructional materials.
mated Question Generation and Answer Evaluation System” This technological marvel combines the power of natural
was created to make the assessment process more efficient. By language processing (NLP) to generate questions automat-
automating the creation of questions, it conforms to curriculum
standards for a broad range of subjects and ability levels. More- ically from a variety of content sources and the capacity
over, it makes use of sophisticated machine learning and Natural to precisely evaluate answers, providing real-time feedback.
Language Processing (NLP) methods to thoroughly assess student These systems exhibit versatility and adaptability, finding their
responses and offer tailored feedback that surpasses traditional application in a wide range of fields such as language learning,
grading. This system promises to be a useful tool in contemporary corporate training, and education. This investigation explores
education because it increases educational efficiency, encour-
ages personalised learning, maintains fairness, and continuously the complex inner workings of an Automated Question Gen-
adapts through machine learning. This signifies a noteworthy eration and Answer Evaluation System, providing information
progression in the mechanisation of question formulation and about its capabilities, possible uses, difficulties, and supporting
response evaluation. Through the utilisation of natural language technologies. Teachers, trainers, and content producers can
processing and machine learning, this system provides educators, maximise learning and assessment experiences for students
content creators, and learners with an adaptable tool that
enhances learning, content creation, and assessment procedures around the world by learning how this cutting-edge technology
in a variety of fields. works.
Index Terms—Artificial Intelligence (AI), Natural Language
Processing (NLP),Bloom’s Taxonomy,Automated Text Scoring, II. L ITERATURE S URVEY
Web based Platform, Automatic Question Generation, Artificial In [1] provides a safe, automated system that addresses the
Nueral Network (ANN), Frequency-Inverse Document Frequency
(TF-IDF), optical character recognition (OCR), Convolutional problems of question paper generation and subjective answer
Neural Networks (CNN), Handwritten Answer Evaluation System evaluation in the educational setting. To generate question pa-
(HAES), Named Entity Recognizer (NER), Large Language pers, administrators use a Bloom’s Taxonomy-based database,
Models (LLM). which allows for one-click question generation. Security is
ensured through cryptography, and answer evaluation is au-
I. I NTRODUCTION tomated using a question and keyword database, which uses
The need for quick and easy tools to create questions a keyword matching algorithm to improve accuracy and save
and measure responses has increased dramatically in the ever time. This system meets the demands of modern education
changing fields of education, training, and evaluation. Auto- while maintaining individualised student-teacher communica-
mated Question Generation and Answer Evaluation Systems tion. It simplifies examination procedures, providing better
have been made possible by the advancements in Artificial accuracy and efficiency in question paper creation and answer
Intelligence (AI) and Natural Language Processing (NLP). evaluation, and ultimately improving the educational experi-
These smart technologies have the ability to completely change ence. .
In [2] tackles the challenge of automated text scoring (ATS)
in academics, particularly for lengthy or handwritten answers.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 09,2025 at 10:41:17 UTC from IEEE Xplore. Restrictions apply.
It presents a comprehensive approach, including keyword- algorithms for natural language processing (NLP) and machine
based evaluation, similarity metrics (e.g., cosine, Jaccard, learning for grading. Cosine similarity techniques are used to
synonyms, bigrams), and a deep learning model trained on score the sentences on the assessed response sheets. Machine
the ASAP-AES dataset for assessing longer responses. By learning, which uses answer key files from human evaluators
offering these three methods, the paper aims to address varied and training data from assessed answer texts, is primarily
evaluation criteria across domains. It emphasizes the historical responsible for predicting student scores. During testing, un-
context of manual answer evaluation’s time and resource- scored response texts are fed into the model, offering a more
intensive nature and the potential for bias. The overarching reliable and efficient way to assess student.
goal is to enhance efficiency and impartiality in evaluation In [7] discusses the difficulties in automating natural lan-
while accommodating domain-specific needs. guage understanding Question Generation (QG) and Question
In [3] emphasizes the difficulty of creating questions auto- Answering (QA). It presents a Sentence-to-Question genera-
matically, a task that is ideally suited for Natural Language tion challenge in which parts-of-speech tagging and named
Processing (NLP) methods. Its goal is to produce questions entity recognition are used to encode pertinent information
from input documents or text in a variety of languages after complicated sentences are digested into simpler ones. To
and circumstances. In question bank-based paper generation, identify question kinds, sentences are classified according to
dialogue systems, educational games, and natural language their subject, verb, object, and preposition. The TREC-2007
summarization, Automatic Question Generation (AQG) is an dataset is used in the study’s experiments. It also emphasizes
essential component. NLP is a branch of artificial intelligence the function of QG in intelligent tutoring systems and the
that makes it possible for computers to understand and use significance of active question generating in learning. In order
human language. It also explores applications, taxonomies, to help students ask pertinent questions and enhance quality
question kinds, question generating, and related studies. assurance, the research focuses on producing factoid-type
In [4] addresses the automation of classifying exam ques- inquiries based on a specified target, such as ”What,” ”Where,”
tions using Bloom’s taxonomy, which is a crucial compo- ”When,” ”Who,” and ”How many/How much” questions.
nent of evaluating students’ cognitive skills. Cross-domain In [8] focuses on addressing online evaluation problems, in
question categorization accuracy is 85.2when an Artificial particular with explanatory answers that are more difficult to
Neural Network (ANN) technique is combined with Term evaluate than multiple choice questions, and how they can be
Frequency-Inverse Document Frequency (TF-IDF) for feature addressed. It is an arduous process to evaluate the explanations
extraction. The trend in education toward student-centered offered online, due to factors such as eye strain and uneven
learning emphasizes how crucial it is to develop questions that grading resulting from depression or lack of interest. To ad-
are in line with learning objectives and accommodate a range dress this challenge, the paper proposes an approach based on
of cognitive abilities. Although Bloom’s Taxonomy provides machine learning. Initially, NLP Natural Language Processing
a framework for classifying problems according to cognitive is applied to extract the keywords of both student answers
difficulty, manual classification is laborious and prone to and answer keys which include tasks such as tokenization,
errors. This study uses machine learning (ML) models that removal of stopword phrases or lemmatization. This system
are well-known for their effectiveness in classification tasks then uses Cosine Similarity, to measure the similarity between
to automate the procedure. a student’s answers and an answer key which can be used
In [5] the issue of grading handwritten answer sheets using for determining marks awarded. On 100 different student
optical character recognition (OCR) is presented with an answer scripts, the proposed system is trained and tested.This
automated solution. Convolutional Neural Networks (CNN) automated approach streamlines evaluation, offering benefits
are used by the system to recognise handwritten letters and such as instant feedback, flexibility, and reduced storage needs
numbers from scanned images. With 250 student photos in the compared to manual assessment, especially in the context of
dataset, it demonstrated an astounding 92.86testing accuracy. online education prompted by the Covid-19 pandemic. This
Handwritten recognition is a significant field of image pro- automated approach simplifies evaluation and offers benefits
cessing that enhances the efficiency and accuracy of grading. such as immediate feedback, flexibility and reduced storage
The system uses a segmentation method to extract questions requirements compared to manual assessment in particular
and answers from scanned images, with a focus on offline when it comes to online learning caused by the Covid-19
recognition. This split data is then used to train the CNN pandemic.
models, enabling automated grading of handwritten answer In [9] addresses the creation of an automated essay grading
sheets. system that incorporates Bloom’s taxonomy’s cognitive level
In [6] the Handwritten Answer Evaluation System (HAES), while creating essay questions. By assessing the efficacy of
an automated tool for marking student exam papers. Tra- the learning process, the study seeks to improve students’
ditional evaluation methods are time-consuming, resource- scores. The article outlines the many methods used in this
intensive, and often rely on human assessors whose conclu- investigation to identify the best machine learning strategy
sions can vary depending on gaps in knowledge and scheduling for categorizing essay test problems. 362 questions from past
conflicts. Handwritten answer sheets can have text extracted teacher-graded software engineering essay examinations were
using optical character recognition (OCR) by HAES. It uses compiled to create the training and test datasets, and exam
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 09,2025 at 10:41:17 UTC from IEEE Xplore. Restrictions apply.
question banks pertaining to software engineering questions to pinpoint important details and subjects. It chooses suitable
for the Bachelor of Science in Computer Science program question templates based on this analysis and creates questions
were utilized. The results demonstrate that, in comparison by adding pertinent keywords from the content. Questions
to exam questions created by teachers, essay questions con- like ”What is photosynthesis?” and ”How does photosynthesis
structed in accordance with BT cognitive level yield greater work?” might arise, for example, if the material is about
scores using EQG. photosynthesis.A set of expected responses, which can be
In [10] introduces an Automatic Question Generation Sys- made manually or acquired from reliable sources, is needed
tem (AQGS) that generates questions based on a given material by the Answer Evaluation Module for every question that
by utilizing Natural Language Processing (NLP) libraries. is generated. Answers are extracted from the text corpus or
Using Named Entity Recognizer (NER) and syntax tree, the predetermined sources, frequently with the use of named entity
system generates grammatically sound queries and translates recognition and regular expressions. Accuracy is assessed by
them to the appropriate level of Bloom’s Taxonomy. Exam- comparing the generated responses to the predicted replies
iners can save a significant amount of time by using the using similarity metrics like cosine similarity or BLEU score.
AQGS system, which can produce a huge number of questions Answers are given scores via a scoring system based on the
quickly. Exam question recurrence can be prevented with comparison results; erroneous answers can be eliminated by
the use of the technology, enhancing the evaluation process’s setting a threshold for an acceptable score.
impartiality. Based on Bloom’s Taxonomy, the method creates
questions that effectively assess a student’s knowledge and
guarantee goal-based learning.Human bias is eliminated and
each test paper is unpredictable due to the random question
selection process employed in the vast question bank. With
the ability to create new questions whose solutions are not
readily available online, the technique can be helpful for online
exams, particularly in pandemic situations, as it can discourage
students from cheating. A discussion of this technology’s
possible uses in education rounds out the study.
In [11] describes the use of an Automatic Question study
Generator System that generates test questions from a large
question bank database using a randomization technique to
handle multi-constraint problems in autonomous institutes.
The article emphasizes the difficulties instructors encounter in
covering every facet of the learning objectives and preventing
question duplication in later assessments. These issues are
addressed by the suggested system, which creates question Fig. 1. Architectural diagram of Automatic Question Generation and Answer
Evaluation System
papers automatically from a store of semantically tagged
questions. In addition, the article addresses the significance
of student assessment and the ways in which technology may A. Automatic question paper generation
help educators design a wide range of test questions without One useful application for automatic question production is
having to worry about plagiarism. in education, content creation, and assessment. Using Natu-
ral Language Processing (NLP) techniques, this system first
III. P ROPOSED SYSTEM analyzes a relevant text corpus to determine its structure and
The automatic question generating and answer evalua- content. The Question Generation Module, which includes text
tion system is intended to make the process of formulating preprocessing to clean and tokenize the text, content under-
questions and grading responses in a certain area or topic standing to pinpoint important parts, and question template
matter more efficient. In order to generate questions and selection, is the central component of the system. The content
answers, it starts with a sizable text corpus pertinent to the determines which question templates, like ”What is,” ”Why
field of interest. Language interpretation and processing are is,” or ”How does,” are used, and questions are created by
done using Natural Language Processing (NLP) frameworks adding relevant keywords from the text. As a result of this
like spaCy, NLTK, or sophisticated models like Transform- procedure, questions that are relevant to the content’s subject
ers (e.g., GPT-4).The Question Generation Module and the matter are automatically created.Because of its adaptability,
Answer Evaluation Module are the two main modules that the system can create questions across a range of subjects,
make up the system. Tokenizing the material and eliminating which makes it a valuable resource for educators, content
superfluous components are the first steps in the Question producers, and examiners. When users enter text or documents,
Generation Module’s preprocessing of the text. After that, it the system can be integrated into an intuitive user interface to
dives into content understanding, using methods like Named provide pertinent questions based on the content. In addition,
Entity Recognition (NER) and Part-of-Speech (POS) tagging the system has the capacity to adjust and advance in response
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 09,2025 at 10:41:17 UTC from IEEE Xplore. Restrictions apply.
to user input and maybe machine learning methodologies, C. Answer evaluation and Score generation
thereby augmenting the caliber of questions and producing The purpose of answer evaluation is to evaluate the accu-
inquiries that address certain learning goals. racy and caliber of responses given by people in a specific
context, like automated customer service or educational tests.
There are multiple essential parts to this system. First and
foremost, it requires a thorough comprehension of the ex-
pected or proper responses. These responses, which serve as
the foundation for comparison, can be manually entered or
obtained from reliable sources. Natural Language Processing
(NLP) techniques are used by the answer evaluation system
to process and comprehend both the provided and expected
answers. To find important details, entities, or facts in the
responses, it could be necessary to perform operations like
tokenization, linguistic analysis, and text preprocessing.The
goal of score generation is to evaluate an entity’s performance,
be it a process, a product, or a student, and then assign
scores based on predetermined criteria. Usually, the system
Fig. 2. Architectural diagram of Automatic Question Generation consists of multiple essential parts.First and foremost, it needs
a precise set of standards and measurements by which the
B. Answer generation scoring will be conducted. Depending on the context, these
standards may change. For instance, in the evaluation of
The process of creating an accurate and contextually rel- product quality, standards may center on user-friendliness,
evant computational model to generate answers to prompts durability, or creativity; in education, they might include
or questions is known as answer generation. To accomplish accuracy, creativity, or completeness.Second, this system relies
its goals, this system makes use of deep learning models, a heavily on data collection. Data can be collected in a variety
sizable text corpus, and Natural Language Processing (NLP) of ways, depending on the situation, including tests, surveys,
techniques. To ensure a clear understanding of the input text user feedback, and observations. Assignments, tests, or teacher
or questions, text preprocessing—which includes tokenization, evaluations are some sources of student performance data in
cleaning, and text structure analysis—is one of the essential education; however, product evaluation.
elements. The answer generation module, which uses sophisti-
cated NLP and LLM models like GPT (Generative Pre-trained
Transformer) or its equivalents, is the brains of the system. By
employing probabilistic language modeling and conditioning
on the input text, these models are able to produce answers
that are both coherent and appropriate for the given context.
To improve the model’s performance in a given domain, it can
be adjusted on particular datasets.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 09,2025 at 10:41:17 UTC from IEEE Xplore. Restrictions apply.
Step II: Simplify the text data by eliminating any unneces- tial credit. This is especially important for lengthy, intricate
sary characters, special symbols, and formatting. Lowercasing, responses.
eliminating punctuation, and managing line breaks or special Step VI: Handle situations where there are several valid an-
characters may all be part of this. swers or when the response depends on the context. Establish
Step III: Tokenize the text into words or subunits and policies or procedures for dealing with this kind of uncertainty
divide it into sentences or phrases (if it isn’t already in that when evaluating.
format). To generate questions, the text must be divided into Step VII: Clearly define annotation guidelines for potential
comprehensible sections using tokenization. human annotators who will participate in the assessment
Step IV: Use NER and part-of-speech tagging to find procedure. Make sure they are aware of the standards used
relationships, entities (such as names, dates, and locations), to evaluate responses provided by the system.
and parts of speech in the text. You can use this knowledge Step VIII: To gauge the consistency and dependability of
to help you formulate questions. the assessments made by human annotators, compute inter-
Step V: To break up the text into smaller informational rater agreement. For this reason, Cohen’s Kappa or Fleiss’
units, divide it up into sentences. You can use each statement Kappa are common measures.
as a starting point to ask questions. Step IX: Examine how confidence scores can be used
Step VI: Determine the text’s main subjects or themes. to show how confident the system is in its responses. This
This stage might assist in producing questions that highlight can enhance the comprehension of system behavior and help
particular elements of the text. prioritize responses for assessment.
Step VII: Locate crucial terms or keywords in the text. You Step X: To determine the kinds of mistakes the system
can use these keywords to create inquiries about the text’s makes, do an error analysis. This can assist in identifying
major ideas. areas in need of development and offer solutions to persistent
Step VIII: Create questions by utilizing NLP techniques. problems.
Step IX: Evaluate how well the questions were generated. Step X1: Take into consideration performing a subjective
Consider things like difficulty level, clarity, and relevancy. To assessment in which human evaluators rank the general caliber,
assess the quality, you can utilize human assessors or NLP applicability, and informativeness of responses. This offers
models. information beyond mechanized measurements.
RESULT: This module will provide a set of questions Step X11: To put your findings in a more comprehensive
that are generated from the documents provided by users. context, assess how well your system performs in comparison
The questions are generated using NLP techniques by taking to the most advanced and current systems, if any are available.
crucial terms and keywords from the document. The user can Step XIII: To provide a comprehensive assessment of
get both subjective and objective-type questions, and it also your system’s answer generation performance, summarize
considers things like difficulty level, clarity, and relevancy. and aggregate the results, taking into account any subjective
evaluations, partial credit, and quantitative metrics.
B. For answer Generation Step XIV: Present the answer evaluation results to your
Step I: Make sure the questions and topics in your test audience in an understandable format by using tables, clear
dataset are varied enough to cover a broad range of possible reporting, and visuals.
answer types and complexities. RESULT: The module generates answers for the given
Step II: Create a set of standard responses for every query document, consisting of several questions provided by users.
in the test dataset. These reference responses, which are best The user will provide two files as input, i.e., a document
supplied by real experts or taken from reliable sources, ought containing questions and a textbook. The answers to the
to be regarded as the gold standard. questions are generated by incorporating both NLP and LLM
Step III: Choose the right measures to assess the re- models, like GPT, and the file, like textbooks. The user will
sponse.Match Exact (EM): Determines the percentage of get a file that consists of questions and their corresponding
system-generated responses that precisely match answers from answers as the output.
references,F1-score: Recall and precision are used to assess
answer matching.Adjusted for response evaluation, taking into C. For Answer evaluation and Score generation
account n-grams and overlap in the system and reference Step I: Compile a dataset containing reference answers,
replies for BLEU, ROUGE, and METEOR. system-generated answers, and questions. Make sure there
Step IV: Establish a scoring system to rate the accuracy of are a variety of question kinds and subjects for a varied
responses provided by the system. Scores can be assigned by assessment.
this technique according to predetermined criteria or metric Step II: Based on your objectives and the type of answers,
values. For example, you may establish a cutoff point for an select the evaluation metrics that are most appropriate for the
acceptable F1-score or mandate a minimum word overlap for job, such as METEOR, F1-score, BLEU, ROUGE, or Exact
responses to be accepted. Match (EM).
Step V: Take into account giving answers that are only Step III: Using the selected metrics, compare each ques-
half accurate or that include some accurate information par- tion’s system-generated response to the reference responses.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 09,2025 at 10:41:17 UTC from IEEE Xplore. Restrictions apply.
Step IV: Using the evaluation metrics as a guide, determine [6] Sanuvala, G., and Fatima, S. S. (2021, February). A study of automated
the scores for each answer-question pair.You might also take evaluation of student’s examination paper using machine learning tech-
niques. In 2021 International Conference on Computing, Communica-
confidence scores or partial credit into account. tion, and Intelligent Systems (ICCCIS) (pp. 1049-1054). IEEE.
Step V: To get an overall system performance score, add [7] Ali, H., Chali, Y., and Hasan, S. A. (2010, July). Automatic question
up all of the question-answer pairs’ scores. generation from sentences. In Actes de la 17e conférence sur le
Traitement Automatique des Langues Naturelles. Articles courts (pp.
Step VI: Examine mistakes and recurring themes in the 213-218).
system’s responses to pinpoint areas that require enhancement. [8] Rani, A. M. (2021). Automated explanatory answer evaluation using
Step VII: Carry out a subjective assessment in which human machine learning approach. Design Engineering, 1181-1190.
[9] Contreras, J. O., Hilles, S., and Bakar, Z. A. (2021, June). Essay Ques-
annotators assign a score to each response’s general quality, tion Generator based on Bloom’s Taxonomy for Assessing Automated
relevancy, and informativeness. Essay Scoring System. In 2021 2nd International Conference on Smart
Computing and Electronic Enterprise (ICSCEE) (pp. 55-62). IEEE.
Step VIII: Using tables, graphs, or visualizations, clearly [10] Joshi, S., Shah P., and Shah, S. Automatic Question Paper Generation,
and understandably present the results and scores. according to Bloom’s Taxonomy, by generating questions from text
RESULT: This module will evaluate the given answer using Natural Language Processing. International Research Journal of
Engineering and Technology.
paper, generate scores for each answer, and give the grand [11] Phulmogare M.P., Ankar, M. R., Daware M. S., and Shedge, M. K.
total of the corresponding answer sheet. The user will provide N. (2018). Automatic Generation of Question Paper Using Bloom’s
an answer key and a handwritten answer sheet as input. The Taxonomy.
handwritten answer sheet is converted into a text document
using OCR. And the text document will be compared with the
provided answer key, and a score will be generated accordingly
V. C ONCLUSION
This study provides an in-depth investigation into the cre-
ation of an automated system for generating questions and
evaluating responses, enhanced by a strong grading system. By
combining machine learning and natural language processing
methods, the suggested system shows that it can produce a
wide range of contextually appropriate and varied questions
in different fields, which reduces the workload associated
with creating questions by hand. Additionally, the system
guarantees impartiality and consistency in evaluating user
responses by utilizing sophisticated algorithms for answer
evaluation. The integration of an advanced grading system
augments the usefulness of the system by furnishing users
with constructive criticism, thus promoting their education
and competency growth. Overall, the research signifies a
significant advancement in educational technology, offering
educators and learners alike a powerful tool for enhancing
engagement, efficiency, and effectiveness in the learning pro-
cess.
R EFERENCES
[1] Ragasudha, I., and Saravanan, M. (2021, March). Secure Automatic
Question Paper with Reconfigurable Constraints. In 2021 Sixth Inter-
national Conference on Wireless Communications, Signal Processing
and Networking (WiSPNET) (pp. 16-20). IEEE.
[2] Kumar, A., Kharadi, A., Singh, D., and Kumari, M. Subjective Answer
Evaluation system.
[3] Patil, P. M., Bhavsar, R. P., and Pawar, B. V. (2022, November). A
Review on Natural Language Processing based Automatic Question
Generation. In 2022 International Conference on Augmented Intelli-
gence and Sustainable Systems (ICAISS) (pp. 01-06). IEEE.
[4] Ifham, M., Banujan, K., Kumara, B. S., and Wijeratne, P. M. A. K.
(2022, March). Automatic Classification of Questions based on Bloom’s
Taxonomy using Artificial Neural Network. In 2022 International Con-
ference on Decision Aid Sciences and Applications (DASA) (pp. 311-
315). IEEE.
[5] Shaikh, E., Mohiuddin, I., Manzoor, A., Latif, G., and Mohammad, N.
(2019, October). Automated grading for handwritten answer sheets using
convolutional neural networks. In 2019 2nd International conference on
new trends in computing sciences (ICTCS) (pp. 1-6). IEEE.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 09,2025 at 10:41:17 UTC from IEEE Xplore. Restrictions apply.