Base Paper
Base Paper
ABSTRACT In order to achieve quality education as a defined one of the sustainable goals, it is necessary
to provide information about the education system according to the stakeholders’ requirements. The process
to obtain the information about university/institute is a critical stage in the academic journey of prospective
students who are seeking information about the specific courses which makes that university/institute unique.
This process begins with exploration to general information about universities through websites, rankings,
and brochures from various sources. Most of the time, information available on different sources leads to
discrepancies and influences student’s decisions. By addressing inquiries promptly and providing valuable
information, universities can guide individuals in making informed choices about their academic future.
To address this, the chatbot application is the most effective tool to be implemented and make it functional
on university’s functional website. A chatbot is an artificially intelligent tool which can interact with humans
and can mimic a conversation. This tool can be implemented using advanced Natural Language Processing
(NLP) models to provide the pre-defined answers to the student’s queries. Chatbot is very helpful for query
resolution during the counseling process of the institute as it will provide official/uniform information and
can be accessed 24 × 7. Therefore, the aim of this research work was to implement a chatbot using various
NLP models and compare them to identify best one. In this work, five chatbot models were implemented
using neural networks, TF-IDF vectorization, sequential modeling and pattern matching. From the results,
it was observed that neural network-related models had better accuracy than TF-IDF and pattern matching
model, and sequential modeling is the most accurate model because it prevents over-fitting. Furthermore,
a chatbot having any kind of optimizer can improve the result and it is most important that pattern matching,
and semantic analysis should be the parts of a chatbot for real time scenarios.
INDEX TERMS Conversational AI, natural language processing, artificial intelligence, chatbots, neural
networks, sequential modeling, pattern matching, semantic analysis.
IV. METHODOLOGY
The development of university information chatbots involves
defining objectives, understanding user requirements, select-
ing a suitable platform, integrating with university systems,
and deploying across various channels. The overview of an
developed solution is explained step-wise as follows:
• Preparation of Questions Related to Counseling Process:
The solution begins with a meticulous preparation
of questions related to the counseling process. This
FIGURE 1. Methodology for chatbot development.
involves understanding the varied needs and concerns of
users, including prospective students, and parents. The
question preparation phase includes input from counsel-
ing experts to ensure the chatbot is equipped to address a learning models are integrated to enhance the chatbot’s
wide range of inquiries related to admissions, academic understanding of user intent and context. The solution
programs, career guidance, and support services. embraces a comparative analysis of different technolo-
• Handling Various Forms of the Same Query Using gies to select the most suitable ones, considering factors
Semantic Analysis: To enhance the effectiveness of like accuracy, scalability, and ease of integration with
the chatbot, semantic analysis is employed to handle existing systems.
various forms of similar queries. Through natural
language processing techniques, the chatbot is trained Figure 1 depicts the methodology for developing a chatbot.
to recognize the semantic meaning behind different The dataset is created by collecting all the questions
expressions of the same question. This enables the asked about a particular technical university from various
chatbot to provide consistent and accurate responses social media portals and the university’s students and
regardless of how users phrase their queries, ensuring faculty. Answers are obtained for these questions from
a more user-friendly and efficient interaction. authorized sources from the university. The dataset has-
• Capability to Process All Types of Questions: Sim- around 250 questions formed in different ways. The following
ple to Complex: The developed solution ensures the methodology is used for using these questions and answers to
chatbot’s versatility in processing questions of vary- design a chatbot. In the first step, raw data is pre-processed
ing complexity. Whether users have straightforward and converted into a format that is easier and more effective
queries about admission deadlines or complex inquiries for further processing steps. It also normalizes the raw data in
regarding academic policies, the chatbot is designed the dataset and reduces the number of features in the feature
to comprehend and respond appropriately. The system set. This leads to a decrease in the complexity of fitting the
is equipped with an extensive knowledge base and data to each classification model.
advanced algorithms to tackle a diverse set of questions, The pre-processing steps are explained below:
providing comprehensive support across the counseling • Converting to Lowercase: The raw text is changed
spectrum. to lowercase to avoid numerous variants of the same
• Implementation and Analysis of Chatbots Using Vari- word, and all the terms, regardless of their casing, are
ous Technologies: The implementation of the chatbot standardized/normalized to lowercase so they can be
solution is characterized by the use of various tech- counted together.
nologies to optimize performance and user experience. • Tokenization: Tokenization is dividing a text stream
Technologies such as natural language processing into meaningful elements called tokens. Tokens can be
(NLP), machine learning algorithms, and possibly deep words, sentences, or any other part of the sentence.
• Bag Of Words: Neural networks cannot understand
words and require numbers as input. Therefore, the
Bag of Words model is used to convert words into
machine-recognizable vectors of numbers. There are
two types of Bag of Words: a list of zeros and ones that
signify whether the word is present in the sentence. The
other kind counts the number of occurrences of the most
frequently used words. The size of the Bag of Words is
the number of unique root words. The Bag of Words is
represented as a list of 0s and 1s where each position in
the list means if a comment exists or not in the sentence. FIGURE 2. Structure of a neural network.
sentence=hello, how are you?
words = [‘‘are’’, ‘‘bye’’, ‘‘hello’’, ‘‘hi’’, ‘‘how’’, ‘‘i’’,
‘‘thank’’, ‘‘you’’] information in only one direction. Contrary to feed-forward
bag = [1,0,1,0,1,0,0,1] neural networks, Long Short-Term Memory (LSTM) uses
• Removing Stop Words: The stopwords are the words recurrent neural networks, where the information flow is non-
that occur most frequently in a document and contain linear. While dealing with sequential data or data with a
very little information that is not usually relevant. For temporal link, LSTMs are favored. However, LSTMs have
example, in the English language, there are some words disadvantages: they are comparatively slow and require a
such as ‘‘a,’’ ‘‘about,’’ ‘‘above,’’ ‘‘after,’’ ‘‘again,’’ and sizeable high-quality dataset to get acceptable results.
‘‘against’’ all contain meagre information, and thus Once the model is ready, the next is to design a chatbot.
they are called stopwords. Removing stopwords reduces There are many ways of creating a chatbot; all of them will
vector space and improves the model’s performance by have different performances. Even if the same query is fed
increasing accuracy and reducing the training time and to all the chatbots, their responses might be different. In this
number of calculations. section, the following five chatbots are explained:
• Stemming: The stemming is the process of removing • Smart Bot: In this chatbot, the neural network is created
prefixes and suffixes(affixes) of words. When several using TensorFlow (TfLearn)
types of features are stemmed into a single feature, • Sam: In this chatbot, the neural network is created using
it reduces the amount of features in the feature space PyTorch
and increase the performance of the classifier. For • Big Mouth: This chatbot is created using TF-IDF
Now, all the words in the ‘‘words’’ list are stemmed using
LancasterStemmer().stem() function and all the duplicate B. SAM: NEURAL NETWORK USING PYTORCH
comments are removed using set(words). ‘‘words’’ and The explanation of Algorithm 2 is given in the following
‘‘labels’’ are then sorted. A bag of words as ‘‘bag’’(empty paragraphs. The data is stored in intents. json file, and it
list) is created, where the size of ‘‘bag’’ is the number of root contains a list of intents. Each intent or class has a tag,
words in the database. For every word in the ‘‘words’’ list, a pattern, and a response. The ‘‘tag’’ defines the intent or
if that word exists in the sentence, then one is appended to class. The ‘‘pattern’’ is a list of possible questions for the
‘‘bag’’; else 0 is appended to ‘‘bag’’. corresponding class. The ‘‘response’’ is a list of possible
A Deep Neural Network is created using TfLearn. The answers to the questions of that ‘‘tag.’’ The chatbot will take
size of the input layer is equal to the size of the ‘‘bag.’’ The the message from the user, identify the ‘‘tag’’ of the message,
input to the neural network is ‘‘bag.’’ Two fully connected and give the corresponding response.
hidden layers of eight neurons each are added to the network. Pre-processing steps are applied to the data. Every question
Fully Connected layers mean that all possible connections are of every intent is tokenized using nltk.word_tokenize() and is
present, wherein every input of the input vector influences appended to the ‘‘all_words’’ list. Every unique tag is stored
every output of the output vector. An output layer of size in the ‘‘tags’’ list. Now, ‘‘all_words’’ is a list that contains
equal to the number of tags in the dataset is added to the all the tokenized words of the dataset, and ‘‘tags’’ is a list
network. The softmax activation function is applied to each that contains all the tags of the database. All the punctuation
neuron in the output layer. ‘‘n_epochs’’ is the number of tokens are removed; every word in the dataset is converted
times the model will see the same training data. In this to lower case using the lower() function, and the words are
model, ‘‘n_epochs’’ is set to 1000. The softmax activation stemmed using PorterStemmer().stem() function from nltk.
function converts the output to a list of probabilities, with ‘‘all_words’’ list is sorted using sorted(all_words) function
each value denoting the possibility that the sentence belongs and all the duplicate words are removed using set(all_words)
to the corresponding tag and that the sum of all probabilities function. ‘‘tags’’ list is also sorted. To create a bag of words,
a ‘‘bag’’ list is created. ‘‘bag’’ list is of length equal to the Algorithm 3 Algorithm for TF-IDF Vectorization
size of the list ‘‘all_words’’ or the number of unique stemmed Data = Query input from the user
Output = Response from chatbot
words in the database. ‘‘bag’’ list is initialized with 0. For
every word in ‘‘sentence,’’ use its corresponding ‘‘index’’ in sentTokens ← Sentence tokenized data from database
Append the query to sentTokens
‘‘all_words’’ to set bag[index] to 1. tfidf ← TF − IDF vectorized data with stop words removed
A Feed Forward Neural Network is created using a vals ← Cosine Similarity between the user
query(tfidf [−1]) and tfidf
torch-module, a base class for all neural network modules. reqTFIDF ← Maximum Cosine Similarity in vals
A feed-forward neural network is an artificial neural network if reqTFIDF > 0 then
return the corresponding response
in which the connections between nodes do not form a cycle. else
One linear input layer of size equal to the ‘‘bag’’ list is created. return‘‘I do not understand.."
end if
Two hidden linear layers having eight neurons are created.
One output layer of size similar to the size of the ‘‘tags’’ list
is formed. A ReLu activation function is defined. Training Algorithm 4 Algorithm for Greeting in TF-IDF Vector
data is passed through the input layer; then, the activation Chatbot
Data = Query from the user
function is applied. This data is fed to the hidden layer, Output = Response from the chatbot if the query is a greeting
and then the activation function is used, which is provided
greetingInput ← List of greeting input words
to another hidden layer. The activation function is applied, greetingOutput ← List of greeting output words
and this data is fed to the output layer. The learning rate is while word ∈ query do
if lowerCase(word ) ∈ greetingInput then
a vital hyperparameter that determines how fast the neural return random greetingResponse
network converges to an optimum value. In this model, the end if
end while
value of the learning rate is 0.001. When the user query
is passed through the neural network, the ‘‘tag’’ with the
highest probability is chosen, and the response is given to
the user. Term Frequency-Inverse Document Frequency (TF-IDF)
stores the component of resulting scores assigned to each
word. The goal of TF-IDF vector is to calculate the word
C. BIG MOUTH: USING TF-IDF VECTORIZATION frequency scores for the text that are more interesting (less
The term ‘‘Term Frequency’’(TF) is used to count how common). Term Frequency is used to calculate the frequency
many times a time appears in a document33 There are 5000 of each word, whereas, Inverse Document Frequency down
words in document ‘‘T1,’’ and the word ‘‘alpha’’ appears scales the score of frequently occurring words.
ten times. As a result, the term ‘‘alpha’’ frequency in The explanation of Algorithm 5 is given the following
document ‘‘T1’’ will be paragraph. The corpus consists of full stop separated
TF = t/s answers in the form of a text file. The corpus is
where t is number of occurrences in a file and s is the total loaded and is tokenized using nltk.word_tokenize and
number of words in the document nltk.sent_tokenize. The tokens are lemmatized using
TF = 10/5000 = 0.002 nltk.stem.WordNetLemmatizer().lemmatize(). Words which
The inverse document frequency gives less weight to are there in string.punctuation (set of punctuations) are
frequently occurring words and more weight to infrequently removed. Input is taken from the user in form of a
occurring words. For example, if we have ten documents and ‘‘query’’. Greeting is implemented using the pattern
the term ‘‘alpha’’ appears in five of them, we may calculate matching algorithm. If the query contains any words from
the inverse document frequency as GREETING_INPUT (list of predefined greeting inputs),
IDF = log(M/m) then the chatbot will return a random response from
where M is the total number of documents in the corpus and GREETING_OUTPUT (list of predefined greeting outputs).
m is then number of documents ‘‘query’’ is tokenized and lemmatized and the result is
containing the required term appended to the list ‘‘sent_tokens’’. sklearn is the library
IDF = log(10/5) = 0.301 used for TF-IDF vectorization and to calculate the cosine
The detailed steps of TF-IDF Vectorization are shown in similarity. Cosine similarity is calculated between every
Algorithm 3 and Algorithm 4. sentence from the corpus and the user query, the sentence
In 2020, Abhishek Jaglan et al. 34authors wrote that having the highest cosine similarity is given as the output
textual data can not be employed in the model directly, instead (cosine similarity values are sorted in descending order and
it has to be converted to numerical vectors. This can be the first value is selected) 35.
done by assigning a unique number to each word, and given
data can be encoded with the length of vocabulary of known D. HERCULES: USING SEQUENTIAL MODELING
words. The Bag-of-Words model is a way of representing The data is stored in intents. json file and contains a list
whether the words exists in the ‘‘sentence’’ or not regardless of intents. Each intent or class has a tag, a pattern, and a
of their sequence of appearance. response. The ‘‘tag’’ defines the intent or class. The ‘‘pattern’’
Algorithm 5 Algorithm for Chatbot Using IF-IDF Vectoriza- Algorithm 6 Algorithm for Chatbot Using Sequential
tion Modeling
data = load data.txt data = load data from JSON file
senttokens = sentence tokenization of data Implemet Lemmatization
wordtokens = word tokenization of the text Initialize lists words, classes, docx , docy
Remove punctuations from the text while intent ∈ data do
Initialize lists GREETINGINPUT , GREETINGOUTPUT while pattern ∈ intent do
with sample greeting wrds = tokenize words in the pattern
inputs and outputs Append "wrds" to words
while word ∈ sentence.split do Append "wrds" to docx
if word ∈ GREETINGSINPUT then Append "tag" to docy
Print out a random GREETINGSOUTPUT end while
end if if tag ∈/ labels then
end while Append "tag" to labels
userInput = Input from the user end if
userresponse = tokenized user query end while
Append userresponse to senttokens remove punctuations from ‘‘words’’
tfidfVec = Create tfidf Vector of senttokens words = stemed "words" converted to lowercase
vals = Cosine Similarity between "senttokens" and "userresponse" sort ‘‘words’’ and ‘‘labels’’
idx = Id of the sentence with the highest Cosine Similarity Initialize lists: training, output
if reqtfidf = 0 then while sentence ∈ docsx do
Print ‘‘I am sorry, I didnt understand you’’ Initialize "bag"(bag of words)
else wrds = stem every word in "sentence"
Print senttokens[idx] while word ∈ words do
end if if word ∈ wrds then
append 1 to bag
else
append 0 to bag
end if
is a list of possible questions of the corresponding class. The end while
Append "bag" to training
‘‘response’’ is a list of possible answers to the questions of
that ‘‘tag.’’ The chatbot will take the message from the user, outputrow[labels.index(docsy[x])] = 1
Append "bag" to training
identify the ‘‘tag’’ of the message, and give the corresponding Append outputrow to output
response. end while
create a Sequential model with softmax activation function
The explanation of Algorithm 6 is given in the following
paragraphs. Every question of every intent is tokenized
using nltk.word_tokenize() and is appended to the ‘‘words’’
list. All the tags are stored in the ‘‘labels’’ list. ‘‘words’’
of probabilities, wherein each value denotes the likelihood
contains all the words in the database. Every word in
of the sentence belonging to the corresponding tag, and the
‘‘words’’ is converted to lowercase using lower() function.
sum of all possibilities equals 1. The model is optimized
Now, all the words in the ‘‘words’’ list are lemmatized
using ‘‘Adam.’’ The Adam Optimizer is an adaptive learning
using WordNetLemmatizer().lemmatize() function and all the
rate method, which computes individual learning rates for
duplicate words are removed using set(words). Both ‘‘words’’
different parameters. When the user query is passed through
and ‘‘labels’’ are sorted. A bag of words is created having the
the neural network, the tag with the highest probability is
variable name ‘‘bag,’’ where the size of ‘‘bag’’ is the number
chosen, and the response is given to the user.
of root words in the database. For every word in the ‘‘words’’
list, if that word exists in the sentence, then one is appended
E. ALICE: USING AIML
to ‘‘bag’’; else, 0 is appended to ‘‘bag’’.
In 2014, authors Srivastava et al. 36wrote that dropout is In AIML, categories are the basic unit of knowledge. Each
a technique that prevents overfitting and provides a way of category has a pattern and a template. The pattern describes
efficiently combining different neural networks. The term the query, and template describes the chatbot’s responses.
‘‘dropout’’ refers to dropping out units randomly from the The template tag can have a list of possible responses for
hidden or visible layers in the neural network. By dropping the chatbot to choose from, and it will randomly give one
a team, it is temporarily removed from the web and its response.
incoming and outgoing connections by setting its weight to There are two types of AIML classes:
zero. • Atomic Category: It is an AIML classification where
A Neural Network with three layers is created. A dense the query are an exact match. This type of classification
or fully connected input layer is equal to a ‘‘bag’’ size with does not contain any wildcards.
a ReLu activation function. Dropout is used, which will < category >
drop 50% of the units. A fully connected hidden layer of < pattern > Good Morning < /pattern >
64 neurons is created, and the ReLu activation function is < template > Good Morning to you too! <
applied. Dropout is used, which will drop 30% of the units. /template >
A dense output layer of size equal to the number of tags in < /category >
the database is created with the Softmax activation function. • Default category: wildcard symbols such at ∧ and ∗
The Softmax activation function converts the output to a list are used in the pattern. ∗ wildcard captures one or more
words and ∧ wildcard captures 1 or more words.
< category >
< pattern > Hi, ∧ < /pattern >
< template > Hi, Good to see you < /template >
< /category >
These five chatbots were created to see how different
technologies and algorithms impact a chatbot’s performance.
Confusion matrices were calculated using the sklearn library.
Furthermore, all five chatbots were implemented, and several
epochs were varied. Lastly, 250 queries were executed on all FIGURE 3. Query-wise accuracy of the all chatbots.
the chatbots, and all the responses were observed.
V. RESULT ANALYSIS In this model, the neural network is not created. Instead,
The results of all the implemented chatbot are represented TF-IDF Vectorization converts every sentence into a vector,
in accuracy and validation. The first section contains the and Cosine Similarity calculates the similarity between every
confusion matrices and accuracy, calculated based on a sentence and the query. This model needs to understand
sample training dataset. In the second section, 150 queries the meaning of the query; it simply finds the most similar
were implemented on all the chatbots, and their responses sentence. Table 1 is the confusion matrix of Big Mouth.
were observed to check if they categorized the query correctly In this model, Sequential modeling is used while creating
or not. Lastly, the last section contains screenshots of the the neural network, which was designed to prevent the prob-
conversation with the bots. lem of overfitting; this improves the model’s performance.
Table 1 is the confusion matrix of Hercules.
A. CONFUSION MATRICES AND ACCURACY BASED ON In this model, AIML is used to create pattern-matching
SAMPLE TEST DATASET rules. No computation is done here, and the query is matched
A test dataset having 144 queries of one university is to the predefined rules. The programmer needs to understand
used to test the chatbot models. Confusion matrices and the AIML functionalities to get acceptable results deeply.
accuracies are calculated using sklearn metrics library. (using Table 1 is the confusion matrix of ALICE.
confusion_matrix and accuracy_score functions)
The neural network was created using TensorFlow in B. QUERY ANALYSIS ON CHATBOTS
this model, and multiple pre-processing steps were applied. One hundred fifty simple queries were created, and 15 had
The Lancaster Stemming algorithm was used in the pre- spelling mistakes. All these queries were implemented on all
processing phase, which is more accurate. Furthermore, the the chatbots, and their responses were observed to check if
softmax activation function is applied to the output layer, they categorized the question correctly or not implemented.
increasing the neural network’s performance. Table 1 is the Figure 3 shows the number of queries correctly answered
confusion matrix of Smart Bot. by each chatbot along with the accuracy of each model.
Table 2 depicts how many questions were correctly answered
TABLE 1. Confusion matrix of all the chatbots. by each model.
Table 2 depicts the various queries and how many questions
were correctly answered by each model.
1) SMART BOT
As shown in Figure 4 At the 1000th epoch, the training loss
of the model is 0.35567, and the accuracy is 0.9738. If we
increase the number of epochs to 1500, the training loss of
In this model, the neural network was created using the model reduces to 0.15079, and the accuracy increases to
PyTorch, and the ReLu activation function was applied to the 0.9949. On the contrary, if the number of epochs becomes
input and hidden layers. This impacted the performance of 500, the training loss becomes 0.24558, and the accuracy
the model. Table 1 is the confusion matrix of Sam. becomes 0.9817.
TABLE 2. Question-wise performance of chatbots.
2) SAM
At the 1000th epoch, the training loss of the chatbot is
0.0003 as shown in 6 If the number of epochs is increased to
1500, the loss in training decreases to 0.0001. On the contrary,
if the number of epochs is reduced to 500, the training loss
It can be observed that increasing the number of epochs rises to 0.0017.
increases the accuracy of the chatbot and decreases the It can be observed that by increasing the number of epochs,
training loss. the training loss decreases.
FIGURE 6. Training of sam.