0% found this document useful (0 votes)
214 views9 pages

NLP Based Automatic Answer Script Evaluation

This document summarizes a research paper about developing a natural language processing (NLP) based system for automatically evaluating answer scripts. The system extracts text from answer scripts, summarizes the text, calculates similarity measures between the summary and stored correct answers, and assigns weights to the similarity measures to score the answer scripts. The paper discusses using keyword-based summarization and four similarity measures (Cosine, Jaccard, Bigram, and Synonym) as parameters for the final mark assignment. Experimental results found the automatic evaluation assigned similar marks as manual evaluation, making it a useful technique.

Uploaded by

Navya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
214 views9 pages

NLP Based Automatic Answer Script Evaluation

This document summarizes a research paper about developing a natural language processing (NLP) based system for automatically evaluating answer scripts. The system extracts text from answer scripts, summarizes the text, calculates similarity measures between the summary and stored correct answers, and assigns weights to the similarity measures to score the answer scripts. The paper discusses using keyword-based summarization and four similarity measures (Cosine, Jaccard, Bigram, and Synonym) as parameters for the final mark assignment. Experimental results found the automatic evaluation assigned similar marks as manual evaluation, making it a useful technique.

Uploaded by

Navya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330824663

NLP-based Automatic Answer Script Evaluation

Article · December 2018

CITATIONS READS
5 727

2 authors:

Md. Motiur Rahman Fazlul Hasan Siddiqui


Chittagong Veterinary and Animal Sciences University Dhaka University of Engineering & Technology
10 PUBLICATIONS   27 CITATIONS    7 PUBLICATIONS   5 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Md. Motiur Rahman on 22 December 2020.

The user has requested enhancement of the downloaded file.


NLP-based Automatic Answer Script Evaluation

NLP-based Automatic Answer Script Evaluation


Md. Motiur Rahman1 and Fazlul Hasan Siddiqui2*

Dept. of Physical and Mathematical Sciences, Chittagong Veterinary and Animal Sciences University, Chittagong, Bangladesh
1

2
Dept. of Computer Science and Engineering, Dhaka University of Engineering & Technology, Gazipur, Bangladesh

ABSTRACT

The answer script evaluation is an important part of assessing students’ performance. Typically, an answer
script evaluation is done manually that sometimes can be biased. The evaluation depends on various factors like
mood swing of the evaluator, the inter-relation between the student and evaluator. Additionally, evaluation is a
very tedious and time-consuming task. In this paper, a natural language processing-based method is shown for
automatic answer script evaluation. Our experiment consists of text extraction from answer script, measuring
various similarities between summarized extracted text and stored correct answers, and then assign a weight
value to each calculated parameters to score the answer script. For summary generation from the extracted
text, we have used keyword-based summarization techniques. Here four similarity measures (Cosine, Jaccard,
Bigram, and Synonym) are used as parameters for generating the final mark. Automatic evaluation of answer
scripts has been found very useful from our experiments, and often the assigned marks is the same as manually
scored marks.
Keywords: Automatic Evaluation, NLP, Text Summarization, Similarity Measure, Marks Scoring

1. Introduction The natural language processing is an area of artificial


intelligence which deals with the interaction between
There are various assessment strategies that are used human languages and computer [3]. The most challenging
to evaluate a student’s performance. The most widely task in natural language processing involves speech
used technique is a descriptive question answering. recognition, natural language understanding, and
In this technique, a student expresses his/her opinion natural language generation. The NLP is widely used in
in response to the question in a long textual way. The machine translation, question-answering, automatic text
automatic descriptive answer evaluation system will be summarization, answer script evaluation, etc. [3-4]. Text
very cooperative for various universities and educational summarization helps to find out precise data from a longer
institutions to assess a student’s performance very text document, and speeds up the evaluation process.
effectively [1]. A student may answer a question by
following different grammatical styles, and chooses The text summarization is a process of creating a short,
different words similar to the actual answer. The motivation accurate summary of the longer text. It is very time wasting
behind the automated answer script evaluation comes task to generate a summary of longer article manually.
from less time consuming, less manpower involvement, Hence an NLP-based automatic text summarization
prohibiting human evaluator’s psychological changes, technique is used to facilitate and speed up the text
and very easy to keep record and extraction [2]. It also processing. Two types of text summarization techniques
assures that mood swings or change in perspective of the are used for generating the summary. The extractive
human assessor will not affect the evaluation process. text summarization technique is used to select phrases
and sentences from the source document, and generates
The automatic answer script evaluation based on Natural a new summary [5]. The abstractive text summarization
Language Processing (NLP) will help us to overcome the technique is the opposite of extractive technique. It
difficulties faced in the manual evaluation. Here a student’s generates entirely new phrases and sentences to hold the
written answer is provided as input and the system will meaning of the source document [6]. The NLP-based
automatically score marks after the evaluation. The strategies are very well suited for generating summary
system considers all possible factors like spelling error, rather than the manual process. The summarized text will
grammatical error, and various similarity measures for be fed as input to compute various similarity measures.
scoring marks. The natural language processing technique
is used to make the handling of used English language The similarity measure is a technique to find how much
much easier. two sentences are similar in the sense of semantic,

*
Corresponding author’s email: [email protected]

DUET Journal 35 Volume 4, Issue 1, December 2018


NLP-based Automatic Answer Script Evaluation

syntactic and structure. Similarity measure will enable their proposed method and their proposed method was
us to decide the scoring marks to a answer script [7]. able to evaluate short questions with accuracy up to 86
For measuring similarity, different algorithms, like the percentage [12].
cosine similarity, Jaccard similarity, bigram similarity
and synonym similarity are used [8]. The individual A simple short question evaluation method was developed
similarity measure algorithm defines a separate meaning. by Md Arafat Sultan et al. They gave the short question,
The cosine similarity between two documents generates its correct answer as input and find the only semantic
a metric which tells how two documents are related by similarity of student response with respect to the correct
looking at the angle as a substitute of magnitude. The answer. They also focused on short text similarity and
Jaccard similarity defines the similarity between two set augmented similarity. They computed performance of
of documents and it is computed by dividing the length of their model with Mohler et al. dataset and simpler bag-of-
intersection by the length of the union of two document words model. They witnessed that their proposed model
sets. The bigram similarity deals with the structure of works better with the bag-of-words model [13].
two sentences and tells whether two are similar or not in Michael Mohler et al. developed a model for automatic
respect of structure [9]. The synonym similarity tells how short answer grading. They used unsupervised techniques
much two sentences are similar in respect of synonyms. for automatic short answer grading. They considered
To make ease the manual evaluation process, automatic knowledge-based and corpus-based similarity, and the
marks scoring has become very popular. Automatic marks effect of domain and size of corpus [14]. They added
scoring can be accomplished with the help of machine automatic feedback from student answer in order to
learning. In machine learning approach, some parameters improve the performance. Their developed model
are used to train a machine learning algorithm, and after outperformed than the previously proposed model.
training it can automatically assign score [10]. Another However, they did not take into account the grammatical
approach is assigning a weight value to the respective and spelling error for grading.
parameter, based on importance, and then multiply Jonathan Nau et al. described a method for automatic short
the parameter value and weight value. The summation question answering for the Portuguese language [15].
of the above multiplication defines the marks of the They combined latent semantic analysis and WordNet
corresponding answer. path based similarity measure using linear regression to
For making the answer script evaluation system faster and predict the score for short questions. They compared the
effective, a digital method, based on NLP, is presented in predicted scores to human scores, which was found very
this paper as automatic answer script evaluation. useful in their proposed combined method.
P. Selvi et al. introduced a method for automatic short
answer grading system, which is based on simple lexical
2. BACKGROUND matching [16]. They performed some comparison with
existing method and they found that their proposed model
Answer script evaluation is a very crucial part of student worked well in few cases. It can grade short question with
assessment. A teacher follows various ways like short 59 percent accuracy.
question answering, descriptive question answering and
multiple choice question to assess students [11]. The
evaluation of multiple choice question and short question 2.2 Automatic Descriptive Question Evaluation
is easy and less time consuming, while descriptive System
question answering takes more time to evaluate. Several
methods have been developed for automatic answer script The evaluation of descriptive question is quite difficult in
evaluation. Some of them are mentioned in the following comparison with short question evaluation. It takes more
subsections. time to evaluate, and accuracy depends on various factors
[17]. Hence, many researchers have proposed many
methods for automatic descriptive answer evaluation.
2.1 Automatic Short Question Evaluation System Some are presented below.

A vector-based technique for short question evaluation Shehta et al. developed a model for automatic descriptive
was performed by Ahmed Magooda et al. They observed answer evaluation [18]. They divided their proposed
sentence representation techniques and the wide range system in student module and tutor module. Their model
of similarity measures for the automatically grading takes the student answer and tutor answer as input and
question. For similarity measures, they considered string calculates the semantic similarity between two answer that
similarity, knowledge-based similarity, and corpus-based helps to score marks. They used full NLP to implement
similarity. They used two different datasets to perform their model. Their developed model doesn’t fit for all type

DUET Journal 36 Volume 4, Issue 1, December 2018


NLP-based Automatic Answer Script Evaluation

of data since they focused only on semantic similarity. 3.1 Text Extraction
There were some other factors that influenced the scoring
marks. The captured image from the answer script has been used
as input for text extraction. For extracting text from the
A pattern matching algorithm based method was proposed image, a python class pytesseract has been used. Before
by Pranali Nikam et al. for the assessment of a descriptive extracting text, the noise from the image is removed to
answer [19]. In their study, they represented the student increase the extraction accuracy. Pytesseract is a class
answer and true answer in the form of a graph, and then based OCR and has Unicode (UTF-8) support, and
they matched the pattern between the two graphs. They can recognize more than 100 languages. The result of
match each word of student answer with the true answer. pytesseract is shown in Fig. 1. and Fig.2. The extracted
If any word does not match with the true answer, then text has been used for further processing and computes
find the synonym of that word, match that synonym with various similarity measures.
the true answer. If matching found, replace the original
word with synonym and compute the similarity. Here if
two sentences are out of order with the same word, it gets 3.2 Summary Generation
confused and provide wrong scoring.
From the image, the text is extracted as text format and
A text similarity-based method for automatic scoring of natural language processing is used to make an automatic
descriptive type tests was developed by Izuru Nogaito et summary of the long text. Summary generation will help
al [20]. They measured n-gram and word-level matching to speed up the text processing task by ignoring less
similarity BLUE and RIBES respectively. They also important sentence from the long text document. Several
calculated Doc2Vec based cosine similarity. They found that techniques are available for generating auto-summary.
the most effective similarity measure technique depends on In order to generate the summary of the long text, some
the type of question. Based on the question, the effectiveness keywords from the long text are selected based on the
of similarity measurement techniques is varied. occurrence of the word. Here the average frequent words
Marcelo Loor et al. [21] proposed a method with a have been selected as keywords where the most frequent
combination of LSA, BLEU, WMD and FUZZY LOGIC. and less frequent word are ignored. Then the weight of
They used LSA to find semantic similarity between two each sentence in the text is calculated based on the number
documents. They used WMD to calculate the cumulative of keyword in sentence squared and divided by the
distance that a word needs to pass to reach the reference window size. The window size is the maximum distance
word. The cumulative distance measure distance even if
there is no common word. Finally, they used fuzzy logic Early versions needed to be trained with images of
to score the marks. They applied their proposed model on each character, and worked on one font at a time.
the various dataset and found that accuracy varies between Advanced systems capable of producing a high
0.71 and 0.85. degree of Irecognition accuracy for most fonts are
now common, and with support for a variety of digital
Most of the researcher focused on semantic similarity for
image file format inputs. Some systems are capable of
scoring marks. They did not consider all other similarity
reproducing formatted output that closely approximates
parameters for deciding score. In this experiment, a noble
the original page including images, columns, and other
approach is proposed with different similarity measure
non-textual components.
and used this similarity measures as the parameter.
Finally, assign a weight value to each parameter based on Fig. 1: Imput image
importance to calculate the marks of that question.
======RESTART: C:\Python36\imtotext.py======
Early versions needed to be trained with images of
3. METHODOLOGY each character, and worked on one font at a time.
Advanced systems capable of producing a high
The aim of this study is to evaluate the descriptive answer
degr ee of recognition accuracy for most fonts are
script automatically and assign marks to this respective
now common, and with support for a variety of
question. In order to accomplish this, we take answer
digital image file format inputs. Some systems are
script as input. Python programming language is used
here for implementing every algorithm. Then NLP is used capable of reproducing formatted output tha t closely
to extract text from the answer script and process the data. approximates the original page including images,
Various similarity measure has been calculated that is columns, and other non-textual compon ents.
used as the parameter for assigning marks. Fig. 2: Output text

DUET Journal 37 Volume 4, Issue 1, December 2018


NLP-based Automatic Answer Script Evaluation

between two significant words in a sentence. Then sort 3.3 Text Preprocessing
the sentence in descending order based on their weight
value and finally take first n sentence as a summary of the The Summarized text contains some word which carries
long text. less information and can be ignored to facilitate further
text processing task. The way of converting data in
Pseudocode of Text summarization algorithm a form that a computer can understand is known as
preprocessing. The natural language processing is a very
1. Take text as input effective way to deals with the text preprocessing. Text
2. Tokenize the text into word preprocessing contains tokenize text into word, remove
3. Remove duplicate from word list StopWord, lemmatize word, remove duplicate word etc.
To accomplish this preprocessing using NLP, Natural
4. Count frequency of each word
Language Toolkit (NLTK) is a leading platform for
5. Calculate word percentage dividing word frequency building python program to work with human language
by length of word list data. It has the immensely built-in function to deals the
6. Remove most frequent word and less frequent word text preprocessing by typing fewer commands. An NLTK
by comparing word percentage with a max and min built-in function word_tokenize is used to split the text
threshold value and select average frequent word as into word and store in a list. The most important text
keywords preprocessing step is filter out the useless word. NLTK
7. Count window size for each sentence with the help of has a StopWord corpus which contains frequently
keywords occurred word those are useless to define the meaning of
the sentence. The StopWord corpus has been used to filter
8. Calculate weight of each sentence dividing square of out the unnecessary word.
no of keyword in sentence by window size
9. Sort the sentence in a descending order based on Another text preprocessing step is word lemmatization. A
weight value and select first n sentence as summary word may appear in different form in many languages like
a word walk may appear as walking, walked, and walks.
Another approach based on the bag-of-words ignoring Lemmatization is the process of converting the word
keywords is also used. In order to find the effective into the base form which is known as the lemma. It will
technique for the summary generation, we have calculated compress the length of the word list and save processing
Precision, Recall and F-score. The precision defines time. In order to lemmatize each word, an NLTK built-in
function WordNetLemmatizer is used which convert all
how much system summary (machine generated) is fact
word into corresponding base form.
relevant?
For carrying out some application over data, data need
Number of overlapping Sentence to be formatted in some common format. One kind of
Precision(P) = -------------------------------------------------------------------------------- (1)
Number of sentence in system summary format is bigram or digram which is a sequence of two
adjacent elements from the string of tokens. The bigram
The recall specifies how much of the reference summary frequency distribution is commonly used to analyze the
(human generated) is recovering the system summary?
structural similarity of text. To generate bigram, bigram
function of NLTK is used and it returns a list of bigram
Number of overlapping Sentence
Recall(R) = ------------------------------------------------------------------------------------------ (2) from all words. Here the frequency of each word also
Number of sentence in reference summary
counts and stored in a dictionary where word used as key
F-score is the correlation measure that combines the and store no of occurrence as data in the dictionary. Then
precision and recall. The basic way to calculate F-score is the word dictionary with frequency and bigram are used
to compute the harmonic average of precision and recall. for measuring various similarity.
2.P.R
F-score = ----------------- (3)
P+R 3.4 Similarity Measure
Here, the F-score of keyword based summary In many cases, it is needed to define whether two sentences
generation technique is greater than the bag-of- are similar or not. Similarity measures is a term which
word based summary generation. Then the generated tells two sentences are similar or not by considering the
summary is compared with the true answer to find different angle of similarity. Several similarity measure
various similarity measure. Summary generation techniques are available that can be performed. In this
techniques and findings have discussed in result and experiment, cosine similarity, Jaccard similarity, bigram
discussion section in details. similarity, synonym similarity are performed.

DUET Journal 38 Volume 4, Issue 1, December 2018


NLP-based Automatic Answer Script Evaluation

Cosine similarity is very interesting similarity measure the structural similarity, Bigram similarity measure has
technique which looks at the angle of two documents and been performed. The pseudocode is presented in below.
tells how much they are similar.
(A, B) Pseudocode of Bigram Similarity
Cosine-similarity (A, B) = ------------------ (4)
||A||.||B||
1. Take two word lists as input
Where A and B are the word vector and each component 2. Generate bigram from two word lists. Bigram is the
of the vector contains word frequency or TF-IDF value. sequence of two adjacent tokens in string
Here cosine similarity measure is carried out between 3. Compute the no of common bigram in two bigram
student answer and true answer. Cosine similarity measure list.
provides a very prominent result in terms of similarity. 4. Divide the no of common bigram by average bigram
Cosine similarity has been implemented in this experiment length of two bigram list.
in python language. The pseudocode is shown below.
5. Division will produce the bigram similarity.
Pseudocode of Cosine Similarity
In many languages, a word has many synonyms that hold
1. Take dictionary of word and frequency as input. the similar meaning. Hence, during the evaluation of the
2. Create two word vector where one for student answer answer script, the synonym of the word have to consider
and another for true answer. Length of each vector for scoring marks. In this study, each word of student
should be the length of total word list. answer in matched with the true answer. If no matching
word found in true answer, then retrieve all synonym
3. Calculate dot product of two vector
of that word and again match with the true answer. To
4. Compute norm of first vector generate a synonym of a word, an NLTK wordnet function
5. Compute norm of second vector synsets is used. Synonym similarity is measured based on
6. Multiply first and second norm how much actual and synonym word of student answer
is matched with the true answer and then divide it by
7. Divide dot product result by multiplication result and average word length of two documents.
it will provide cosine similarity.
Jaccard similarity measure is another similarity measure Pseudocode of Synonym Similarity
technique which tells the degree of similarity by measuring
the intersection and union of two word list. 1. Take two word list as input
|A∩B| 2. Match each word of student answer with true answer
Jaccard Similarity(A,B) = ------------------ (5) and count no of matching.
|A∪B|
3. If there exist no matching word in true answer, then
Where A and B are two word lists. Jaccard similarity is generate synonym of that answer.
measured by dividing the intersection of two word lists
4. Match each synonym on that word with true answer
with the union of that two word list. The intersection
and count no of matching.
defines how much common word are between two word
lists and the union defines total word in both lists. 5. Divide the no of matching value by average length of
two documents.
Pseudocode of Jaccard Similarity
6. Division will generate synonym similarity value.
1. Take two word lists as input.
The efficient evaluation of answer script also depends on
2. Perform intersection operation between two word grammatical and spelling correctness. In this experiment,
lists. The AND (&) operation performs intersection. the grammatical and spelling mistake is also taken into
consideration. To count the spelling and grammar error,
3. Perform union operation between two word lists. a python package language check is used. The computed
Here add the length of two word list and subtract the four similarity measure and grammatical-spelling error
length of intersection that is the union of two word are used as the parameter for automatic marks scoring.
list.
4. Divide intersection result by union result that will 3.5 Marks Scoring
produce Jaccard similarity
The one purpose of this study is to automatically score
In this study, the structural similarity between two
marks after evaluation. It is the final step of the experiment
documents is also taken into account. In order to compute

DUET Journal 39 Volume 4, Issue 1, December 2018


NLP-based Automatic Answer Script Evaluation

and the accuracy of this step will enhance the overall and Jaccard similarity. These parameters are used to
impact of this study. automatically evaluate three types of question (M5, M10,
and M15) in terms of marks. The different weight value is
Here, a weight value is assigned to each parameter based assigned to each parameter based on question types. The
on the importance of the parameter. To improve the weight assigned to each parameter are shown in Table
accuracy of assigning weight value, a survey study over II. The weight value is taken after averaging the survey
50 samples has been carried out. The average weight value that for each parameter. From Table II, it is found
value estimation from the survey is accepted and applied. in the survey that the importance of synonym parameter
is more and grammatical-spelling error parameter is less
(6) for the evaluation of answer script. The high weight value
indicates that the importance of that parameter is more
Where Pk is the kth parameter and Wk. is the weight value for deciding the marks. The value of parameter comes
of kth parameter. After assigning the weight value to each between zero to one based on the similarity and presence of
parameter, the weight value and the parameter value is the error. The higher parameter value means the similarity
multiplied. Then add all value of multiplication which is between two documents is more and vice-versa. In this
the final marks of that answer script. study, thirty answer script of three types of question is
evaluated and marks are taken for testing the accuracy of
In order to test our experiment, total thirty sample the proposed model. Additionally, the above mentioned
descriptive questions and the student answer to that five parameters are calculated from that thirty answer
question has been evaluated in a manual way. Three types script and used them for automatic marks scoring. The
of question in terms of marks are considered for this manual evaluated marks and auto-score marks are shown
experiment. These are 5 marks question (M5), 10 marks in Table III. From Table III, we see that our proposed
question (M10) and 15 marks (M15) question It has been automatic answer script evaluation system score marks
seen that most of the cases, our proposed method has very near to the manually scored marks. The comparison
scored score marks very near to manual judgment. of automated scored marks and manually computed
marks are shown in Fig. 3. From Fig. 3, we have found
that there is a slight difference between automated scored
4. RESULT AND DISCUSSION marks and manually scored marks. Most of the cases
the automated assigned marks and manually assigned
The goal of this study is to evaluate the descriptive answer marks are very close. When the student answer and the
script automatically and score marks. This will reduce the true answer contain more structural similarity as well as
time for evaluating answer script and bring equality for synonym similarity, the automated scored marks are very
evaluation. To satisfy those requirements, we used a weight close to the manually scored marks. On the other hand, a
parameter-based technique for automatic evaluation. The notable difference between the automated scored marks
summary generation of extracted text plays an important and manually scored marks exist when the student answer
role for the effectiveness of this experiment. For accepting and the true answer have less structural similarity while
an efficient technique for the summary generation, more Jaccard and Cosine similarity. It is also noticed
we have calculated F-score of the generated summary from Table III. and Fig. 3 that the difference between the
of two techniques with the comparison of reference manually scored marks and automated scored marks is
summary. The estimated F-score of the keyword based small for short question (M5) and opposite happens for
summarization and bag-of-words based summarization descriptive question (M15).
are shown in Table I. Table I. indicates that the F-score
of our used summarization technique is greater than the
bag-of-words based summarization technique. 5. CONCLUSION AND FUTURE WORK
In this experiment, five parameters have been considered
for scoring marks. These are synonym similarity, bigram In this experiment, we have developed a natural language
similarity, grammatical-spelling error, cosine similarity processing-based method for automatic answer script
evaluation and marks scoring. Our system consists of the
Table 1: F-score Calculation following steps (1) text extraction from the image, (2)
text summarization using keyword-based technique, (3)
Keyword-based Bag-of-word based
text preprocessing for further analysis, (4) finding various
Summarization Summarization
similarity measures, and (5) marks scoring. In the first step,
Precision 0.9 0.83
the text is extracted using pytesseract which works based
Recall 0.83 0.41 on OCR. Then the extracted text is summarized using
F-score 0.86 0.53 keyword based summarize technique. Here we accept the

DUET Journal 40 Volume 4, Issue 1, December 2018


NLP-based Automatic Answer Script Evaluation

average frequent word as the keyword and ignore most [5] S. R. Rahimi, A. T. Mozhdehi, and M. Abdolahi, “An
frequent and less frequent word. The summarized text is overview on extractive text summarization,” 2017
preprocessed with the aid of NLTK which is a leading IEEE 4th International Conference on Knowledge-
platform for building python program. Here tokenization, Based Engineering and Innovation (KBEI), Tehran,
stopword removal, lemmatization, bigram generation, and 2017, pp. 0054-0062.
word frequency count are performed as a preprocessing. [6] P. K. Rachabathuni, “A survey on abstractive
We also consider grammatical and spelling error for answer summarization techniques,” 2017 International
script evaluation. After preprocessing, four similarity Conference on Inventive Computing and Informatics
measures – synonym similarity, bigram similarity, cosine (ICICI), Coimbatore, 2017, pp. 762-765.
similarity and Jaccard similarity measure are computed,
which are used as the parameter for final marks scoring. [7] V. U. Thompson, C. Panchev, and M. Oakes,
In order to score marks, a weight value is assigned to each “Performance evaluation of similarity measures
parameter after doing a survey on best weight estimation. on similar and dissimilar text retrieval,” 2015 7th
The weight value is multiplied with parameter value to International Joint Conference on Knowledge
score final marks to that question. In this system, we have Discovery, Knowledge Engineering and Knowledge
considered three types of questions based on marks, and Management (IC3K), Lisbon, 2015, pp. 577-584.
the answer script based on that question is evaluated in a [8] A. R. Lahitani, A. E. Permanasari, and N. A.
manual way. Manual marks are compared with automated Setiawan, “Cosine similarity to determine similarity
scored marks to validate our developed method. In most measure: Study case in online essay assessment,”
cases, we have found that our proposed method scored 2016 4th International Conference on Cyber and IT
marks similar to the manually assigned marks. It happens Service Management, Bandung, 2016, pp. 1-6.
for a very few cases that the automated assigned marks
[9] Y. Oganian, M. Conrad, A. Aryani, K. Spalek, and
are slightly higher or lower than the manually assigned
H. R. Heekeren, “Activation Patterns throughout
marks. The limitation of our research is that we assign
the Word Processing Network of L1-dominant
a weight value to each parameter manually by doing a
Bilinguals Reflect Language Similarity and
survey. Therefore, our next goal is to introduce machine
Language Decisions,” in Journal of Cognitive
learning algorithm that will be trained by various
Neuroscience, vol. 27, no. 11, pp. 2197-2214, Nov.
calculated parameters, and algorithm will predict the
2015.
marks of that answer script. Also in the future, we will
introduce some new techniques for effective and precise [10] L. Gao and H. Chen, “An automatic extraction
summary generation. method based on synonym dictionary for web reptile
question and answer,” 2018 13th IEEE Conference
on Industrial Electronics and Applications (ICIEA),
References
Wuhan, 2018, pp. 375-378.
[1] V. Nandini, P. Uma Maheswari, “Automatic [11] T. Bluche, C. Kermorvant, C. Touzet, and H. Glotin,
assessment of descriptive answers in online “Cortical-Inspired Open-Bigram Representation for
examination system using semantic relational Handwritten Word Recognition,” 2017 14th IAPR
features”, The Journal of Supercomputing, 2018. International Conference on Document Analysis and
Recognition (ICDAR), Kyoto, 2017, pp. 73-78.
[2] D. V. Paul and J. D. Pawar, “Use of Syntactic
Similarity Based Similarity Matrix for Evaluating [12] A. Magooda , M. A. Zahran , M. Rashwan , H. Raafat,
Descriptive Answer,” 2014 IEEE Sixth International and M. B. Fayek, “Vector Based Techniques for
Conference on Technology for Education, Clappana, Short Answer Grading, “ Proceedings of the Twenty-
2014, pp. 253-256. Ninth International Florida Artificial Intelligence
Research Society Conference,2014,pp.238-243
[3] K. Meena and L. Raj, “Evaluation of the descriptive
type answers using hyperspace analog to language [13] M. A. Sultan C. Salazar, and T, Sumner, “Fast and
and self-organizing map,” 2014 IEEE International Easy Short Answer Grading with High Accuracy,”
Conference on Computational Intelligence and Proceedings of NAACL-HLT 2016, pp.1070–1075
Computing Research, Coimbatore, 2014, pp. 1-5. [14] M. Mohler and R. Mihalcea, “Text-to-text Semantic
[4] Y. Oganian, M. Conrad, A. Aryani, K. Spalek and Similarity for Automatic Short Answer Grading,”
H. R. Heekeren, “Activation Patterns throughout International Journal of Artificial Intelligence in
the Word Processing Network of L1-dominant Education 25 (2015), 118 – 125
Bilinguals Reflect Language Similarity and Language [15] J. Nau, A. H. Filho, and G. Passero, “Evaluating
Decisions,” in Journal of Cognitive Neuroscience, Semantic Analysis Methods For Short Answer
vol. 27, no. 11, pp. 2197-2214, Nov. 2015. Grading Using Linear Regression, “PEOPLE:

DUET Journal 41 Volume 4, Issue 1, December 2018


NLP-based Automatic Answer Script Evaluation

International Journal of Social Sciences (2017), Graph Alignments, “International Journal of Artificial
Volume 3 Issue 2, pp. 437 – 450. Intelligence in Education 27 (2016), 83-89.
[16] P. Selvi and A. K.Bnerjee, “Automatic Short – [19] P. Nikam, M. Shinde, R. Mahajan, and S. Kadam,
Answer Grading System (ASAGS),” InterJRI “Automatic Evaluation of Descriptive Answer Using
Computer Science and Networking (2010), Vol. 2, Pattern Matching Algorithm,” International Journal
Issue 1, pp.18-23. of Computer Sciences and Engineering (2015) Vol.-
[17] S. K. Chowdhury and R. J. R. Sree, “Dimensionality 3(1), pp.69-70.
reduction in automated evaluation of descriptive [20] M. S. M. Patil and M. S. Patil, “Evaluating Student
answers through zero variance, near zero variance Descriptive Answers Using Natural Language
and non-frequent words techniques - a comparison,” Processing,” International Journal of Engineering
2015 IEEE 9th International Conference on Research & Technology (IJERT) 2014, Vol. 3 Issue 3.
Intelligent Systems and Control (ISCO), Coimbatore, [21] M. Loor and G. De Tré, “Choosing suitable similarity
2015, pp. 1-6. measures to compare intuitionistic fuzzy sets that
[18] M. A. G. Mohler, R. Bunescu, and R. Mihalcea, represent experience-based evaluation sets,” 2015
“Learning to Grade Short Answer Questions using 7th International Joint Conference on Computational
Semantic Similarity Measures and Dependency Intelligence (IJCCI), Lisbon, 2015, pp. 57-68.

DUET Journal 42 Volume 4, Issue 1, December 2018

View publication stats

You might also like