0% found this document useful (0 votes)
51 views14 pages

A Conditional Generative Chatbot Using Transformer

This document proposes a novel conditional generative chatbot model that uses a transformer model in both the generator and discriminator of a conditional Wasserstein generative adversarial network (cWGAN). The model aims to improve upon existing sequential generative chatbot models by leveraging the parallel processing capabilities of the transformer model. Experimental results on two datasets show that the proposed model outperforms state-of-the-art alternatives according to various evaluation metrics.

Uploaded by

Seven 7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views14 pages

A Conditional Generative Chatbot Using Transformer

This document proposes a novel conditional generative chatbot model that uses a transformer model in both the generator and discriminator of a conditional Wasserstein generative adversarial network (cWGAN). The model aims to improve upon existing sequential generative chatbot models by leveraging the parallel processing capabilities of the transformer model. Experimental results on two datasets show that the proposed model outperforms state-of-the-art alternatives according to various evaluation metrics.

Uploaded by

Seven 7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

A Conditional Generative Chatbot using Transformer Model

Nura Esfandiari a, Kourosh Kiani a*, Razieh Rastgoo a


a
Electrical and Computer Engineering Faculty, Semnan University, Semnan, 3513119111, Iran
*
Corresponding author ([email protected])
Email: [email protected], [email protected], [email protected]

ARTICLE INFO ABSTRACT

Keywords: A Chatbot serves as a communication tool between a human user and a machine to achieve
Generative Chatbot an appropriate answer based on the human input. In more recent approaches, a combination
Deep Learning of Natural Language Processing and sequential models are used to build a generative
Conditional generative Chatbot. The main challenge of these models is their sequential nature, which leads to less
model accurate results. To tackle this challenge, in this paper, a novel end-to-end architecture is
Transformer model proposed using conditional Wasserstein Generative Adversarial Networks and a transformer
Dialog system model for answer generation in Chatbots. While the generator of the proposed model consists
of a full transformer model to generate an answer, the discriminator includes only the encoder
part of a transformer model followed by a classifier. To the best of our knowledge, this is the
first time that a generative Chatbot is proposed using the embedded transformer in both
generator and discriminator models. Relying on the parallel computing of the transformer
model, the results of the proposed model on the Cornell Movie-Dialog corpus and the Chit-
Chat datasets confirm the superiority of the proposed model compared to state-of-the-art
alternatives using different evaluation metrics.

1. Introduction
Regarding the increasing use of social networks, a new technology called Chatbot has been developed as a tool for
human-computer interaction. Chatbots have been widely applied in various fields such as e-commerce (Miklosik et al., 2021;
Tran et al., 2021), education (Okonkwo & Ade-Ibijola, 2021), banking and insurance (Mogaji et al., 2021), Data collection
and management (Tsai et al., 2020), and Health (Ayanouz et al., 2020). Chatbots automatically give more attractive answers
to users through easy and efficient communications. The key factor in the suitable design of Chatbots is to provide
understandable answers to the user (Adamopoulou & Moussiades, 2020). To this end, various approaches have been recently
developed to build Chatbots. In general, the approaches of Chatbot development can be classified into two categories: open
and closed domain (See Fig. 1). Chatbots with the ability to answer on more than one domain, are called open domains. In
contrast, closed domain Chatbots can answer only to questions concerning a particular domain.

Fig. 1. Classification of Chatbot approach

Open and closed domains Chatbots can be categorized into Rule-based and Artificial Intelligence (AI)-based Chatbots.
Rule-based Chatbots rely on the user’s input with a base template. This type of Chatbots choose a predefined answer from
a set of answers (Ramesh et al., 2017). However, they are inflexible and limited to some predefined answers. ELIZA and
ALICE are based on this approach (Masche & Le, 2017). AI-based Chatbots are trained to have human-like conversations
using a hybrid of NLP and deep learning approaches. More concretely, AI-based Chatbot classified into three categories:
Retrieval, Learning, and Generative-based methods. The first category, the retrieval-based method, chooses the most similar
answer from a dataset using a functional scoring metric (Y. Zhu et al., 2021). The second category, the learning-based
method, usually learns patterns from training data, containing questions with answers, through deep learning models to
generate relevant answers. The last category, the generative-based method, is usually based on the Sequence to Sequence
(Seq2Seq) learning models (Dhyani & Kumar, 2021; Y. Peng et al., 2019; Wang et al., 2019; Yang et al., 2018). However,
the main challenge of these models is their sequential nature, which is led to less accurate results. To tackle this challenge,
in recent years, researchers have developed some models based on the transformer models (Lin et al., 2022; Masum et al.,
2021; B. Peng et al., 2022; Shao et al., 2019; Shengjie et al., 2020). Relying on the parallel computing as well as the self-
attention mechanism of the transformer model, the performance of the models developed using transformers has been
improved in comparison with the Seq2Seq models (Vaswani et al., 2017). Although, Learning-based models are not able to
generate various answers. To overcome this shortcoming, the generative-based models have been developed to learn the
answers distribution.
Some models such as Sequence Generative Adversarial Networks (SeqGAN) (L. Yu et al., 2017) and stepwise GAN
(StepGAN) (Tuan & Lee, 2019) have been developed as the third category, the generative-based model. The sequential
nature of these models is led to use reinforcement learning techniques for completing the generator’s answers. These models
suffer from slow convergence due to their high variance and low processing speed. To tackle the challenge of sequential
nature as well as the accuracy improvement in these models, in this paper, a novel architecture is proposed based on
conditional Wasserstein Generative Adversarial Networks (cWGAN) using transformer model which processes parallelly
during the training phase. This architecture enhances the efficiency through generated human-like answers.
The main contributions of this paper can be listed as follows: 1) Model: We propose a novel model using the transformer
model, which is used in both generator and discriminator of a cWGAN. To the best of our knowledge, this is the first time
that such a model is proposed in Chatbot. 2) Extensive experiments conducted on two challenging datasets show that our
architecture outperforms state-of- the-art methods in the field.
The rest of this paper is organized as follows: Section 2 reviews related works in Chatbot. The proposed architecture is
explained in Section 3. In the following, the experimental results and discussion are presented in Section 4 and 5,
respectively. Finally, conclusion and future works are provided in Section 6.

2. Literature review
In this section, we briefly review the recent works in four categories: Rule-based, Retrieval -based, Learning-based, and
Generative-based.

2.1. Rule-based models


In the rule-based models, the characteristic variables of the input expression are first specified. Then, a predefined answer
is provided based on the variables and rules (Adamopoulou & Moussiades, 2020). Rule-based approaches can be divided
into two categories: pattern matching methods and standard task-oriented systems (Z. Peng & Ma, 2019). In pattern matching
methods, Chatbots match the user's input to the pattern of the rules and select a predetermined answer from the set of answers
using pattern matching algorithms. Task-oriented systems guide the user to complete certain items. Since the 1990s, a great
deal of research has been conducted into the design of Chatbots based on similar rules for providing services in specific
domains. These Chatbots are known as task-oriented modular chat systems that guide the user to perform some structured
tasks, such as restaurant and film reservations. Since 1966, the development of the Eliza Chatbot has begun with a pattern-
based approach. This Chatbot analyzed the input sentence based on the parsing rules established by the keywords in a
sentence (Weizenbaum, 1966). "Pari" adds some influential variables like "fear", "anger" and "distrust" to the more complex
rules. These rules have made the conversation more humane-like (Colby, 1975). ALICE uses Artificial Intelligence Markup
Language (AIML), which is the category constituting the unit of knowledge to combine a template and an optional field
(Wallace, 2009). In addition, several platforms, such as Microsoft LUIS, IBM Watson Assistant, and Dialog flow, have been
developed to assist the user in making Chatbots. These types of Chatbots have drawbacks despite their simplicity, quick
implementation, and cost-effectiveness. Inflexibility, lack of learning, inability to create new answers, and being limited to
some predefined answers are the most important challenges of rule-based Chatbots.

2.2. Retrieval -based


The retrieval-based Chatbots select the best matching answer for the user’s question by searching a pre-constructed
conversational repository (Yan et al., 2016). Lowe et al. (Lowe et al., 2015) developed a retrieval-based Chatbot using the
Term Frequency - Inverse Document Frequency (TF-IDF) method. The TF-IDF vectors of each candidate's question and
answer are computed by concatenating all TF-IDF scores. The candidate answers with the highest cosine similarity to the
question vector are selected as the final answers. Lu et al. (Lu & Li, 2013) proposed an architecture to overcome the short-
text matching problem of the developed models. Later, Convolutional Neural Network (CNN), Recurrent Neural Network
(RNN) and its extensions such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) were widely used
in this field (Z. Peng & Ma, 2019). For instance, Zhou et al. proposed an approach by adding an attention mechanism to the
deep network to provide question and answer matching (Zhou et al., 2018). In another work, Shu et al. (Shu et al., 2022)
proposed a retrieval-based model to search for answers related to the question by combining keyword extraction modules
and two-stage transformer. In overall, while the retrieval-based approach is widely used by researchers, no new answer is
generated and only the most probable answer is retrieved from the database. This can restrict the Chatbot answers. Moreover,
it is essential to have a database in the inference phase.

2.3. Learning -based


This approach is usually based on Seq2Seq learning model. In the case of long sentences and conversations (more than
20 words), all essential information of a source sentence should be compressed into a fixed-length vector which is
challenging (Vaswani et al., 2017). To tackle this challenge, different approaches have been suggested by researchers. For
instance, making the structural changes, such as adding the word embedding matrix (Serban et al., 2017), modification of
the encoder or decoder (Z. L. Gu, 2016; W.-N. Zhang et al., 2019), and adding the attention mechanism (Dhyani & Kumar,
2021; Palasundram et al., 2021; Y. Peng et al., 2019; Yang et al., 2018) are some of the solutions suggested by researchers.
In addition to this, some researchers have used transformer-based model, that replaces the recurrent layers commonly used
in encoder-decoder architectures with multi-headed self-attention. Rao et al. (Rao K Yogeswara & Rao, 2022), have
developed a hybrid model for deep transfer learning-based text generation using the Elmo language model for embedding,
Variational Autoencoder (VAE), and Bi-directional Long Short-Term Memory (BiLSTM). A transformer-based answer
generation model, named DIALOGPT, was also presented as a pretrained model by Zhang et al. (Y. Zhang et al., 2019).
This model is publicly released to facilitate the development of more intelligent dialogue systems. Another suggested model
is a Chatbot model using a Bidirectional Encoder Representations from Transformers (BERT) model, which only has an
encoder (S. Yu et al., 2021). While the learning-based models have obtained promising results, they cannot generate various
answers due to not having the ability to learn the answer distribution.

2.4. Generative-based
Generative-based methods learn the answer distribution using the generative models, such as SeqGAN (L. Yu et al., 2017)
and StepGAN (Tuan & Lee, 2019). These models use the Reinforcement Learning (RL) techniques. In the SeqGAN, The
RL reward signal comes from the discriminator, which is judged on a complete sequence, and fed back to the intermediate
state action steps using Monte Carlo search. However, the Monte Carlo Tree Search (MCTS) used in SeqGAN has a high
computational efficiency. In the StepGAN approaches, the Generative adversarial network (GAN) is evaluated in a step-
wise manner. In this way, the discriminator is modified to automatically assign scores and determine the suitability of each
generated subsequence. Wu and Wang added a new loss function (Truth guidance) to achieve a closer generated text to the
real data (Wu & Wang, 2020). Also, a discriminator network model was designed based on the self-attention mechanism to
obtain richer semantic features. In another work, a Transformer-based Implicit Latent GAN (TILGAN) model was proposed
(Diao et al., 2021), which combines a transformer autoencoder and GAN in the latent space with a novel design and
distribution matching based on the Kullback-Leibler (KL) divergence.
To generate more relevant answers, some researchers used a hybrid models, including the generative and retrieval-based
methods. For example, Zhang et al. (J. Zhang et al., 2019) fed the retrieved answers from the retrieval-based model to a
generative-based model as additional information for the discriminator and generator models. Zhu et al. (Q. Zhu et al., 2018)
also used the N-best retrieved answers as evidence for calculating the reward for the generator model. Zhang et al. (Zhang
et al., 2020) employed the answer generated by retrieval-based models as additional information for training the generative-
based models and placed a filter to select the best answer. Furthermore, an ensemble-based deep reinforcement learning was
also developed for generative Chatbots by Cuayáhuitl et al. (Cuayáhuitl et al., 2019). While the generative-based Chatbots
are able to generate more relevant answers, these models face some challenges, such as slow convergence due to their high
variance and low processing speed especially in large sentences. To overcome the challenge of sequential nature as well as
the accuracy improvement in these models, in this paper, we propose a novel architecture based on the cWGAN using
transformer model, processing parallelly during the training phase.

3. Proposed model
A vanilla GAN model consists of a generator (G) and a discriminator (D). Considering the scope of this paper, the GAN
input is a question in the form of sequence x. The discriminator learns to maximize score 𝐷(𝑥, 𝑦 ∗ ) and minimize score
𝐷(𝑥, 𝑦̂), while the generator learns to generate answer 𝑦̂ to maximize 𝐷(𝑥, 𝑦̂) as expressed in Eq. (1):
𝑚𝑖𝑛𝑚𝑎𝑥 𝐸[𝑙𝑜𝑔𝐷(𝑥, 𝑦 ∗ )] + 𝐸[log(1 − 𝐷(𝑥, 𝑦̂))] (1)
where (𝑥, 𝑦 ∗ )~𝑃𝑅 (𝑥, 𝑦) is the joint probability distribution of (x, y), and 𝑥~𝑃𝑅 (𝑥) denotes the probability distribution of
x from training data. As the GAN models may never converge and have a problem of mode collapses, different variants of
the GAN model have been suggested by researchers(Brownlee, 2019). One of these variants is the Wasserstein GAN
(WGAN), which is an extension of the GAN that seeks an alternate way of training the generator model to better approximate
the distribution of data observed in a given training dataset. Instead of using a discriminator to classify or predict the
probability of generated data as real or fake, WGAN changes or replaces the discriminator model with a critic that scores
the reality or fakeness of a given data. The goal is to minimize the distance between the data distribution in the training
dataset and the generated examples. This method can promote stable training while working with gradients. One of the
reasons for this convention is that there is no Sigmoid activation function to limit the values to 0 or 1 corresponding to real
or fake. Considering the superiority of the WGAN compared to the GAN model, we propose a novel architecture based on
cWGAN and transformer model for generating answers in Chatbot. As shown in Fig. 2, the proposed architecture consists
of two modules: generator and discriminator which are connected in a single network; thus, the proposed system is trained
end-to-end.

Fig. 2. General architecture of the proposed approach.

3.1. Generator modules


Generator is a full transformer model that generates fake answers in test phase. The architecture of the Generator in training
phase is illustrated in Fig. 3. In the training phase, the encoder and decoder input embedding need to be prepared. Therefore,
for the encoder input, the real questions are tokenized and embedded by the pretrained BERT model, including 12 layers,
768 features, and 12 heads. To increase the training speed, a linear model was adopted to reduce the dimensionality of the
features to 64 as shown in Eq. (2):
𝑦 = 𝑓(∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖 + 𝜃) (2)
Subsequently, the output of the linear model is concatenated with the position encoding, preserving the word’s positions
in the sentence. Positional encoding is a matrix, which gives context based on the word position in a sentence as Eq. (3) and
Eq. (4):
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin(𝑝𝑜𝑠/10002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙 ) (3)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos(𝑝𝑜𝑠/10002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙 ) (4)
where “pos” refers to the position of the “word” in the sequence; while “d” means the size of the word/token embedding.
“i” refers to each of dimensions of the embedding. “d” is fixed, while “pos” and “i” vary.
The same process with some variations is repeated for the decoder. In the decoder, instead of a question, we have a real
answer in the input of the decoder. Furthermore, encoder output is concatenated with the Z vector to increase the variety in
generated answers. The transformer works slightly different during training and inference phase. During inference, only a
question is presented as input sequence. There is not any real answer as target sequence that could be passed to the decoder.
Since, decoder aims to generate an answer 𝑦̂ as close as possible to the real answer, the output is generated in a loop and fed
the output sequence from the previous timestep to the decoder in the next timestep till the end token.
Fig. 3. Architecture of the proposed Generator in training phase.

In Generator, a full transformer with N = 8 identical layers and 16 heads is used. Generator module is first trained
separately by Maximum Likelihood Estimation (MLE) to increase the convergence probability. Then, the model is fine-
tuned by adversarial network to learn the answers distribution. The linear transformation and Gumbel SoftMax function are
utilized to convert the decoder output to the predicted next-token probabilities. The Gumbel-SoftMax distribution is a
continuous distribution capable of approximating samples from a categorical distribution and providing a hard output by
using an argmax function. So that, the output of decoder will be an index vector, which is given as the input of the
discriminator. In adversarial phase, the generator is updated upon obtaining the discriminator model. In GAN model, this is
achieved by gradient likelihood ratios of objective function which can be derived by Eq. (5):
1
𝐿𝐺 = ∇𝜃𝐺 ∑𝑚
𝑖=1 log[𝐷(𝐺(𝑄𝑖 + 𝑍))] (5)
𝑚
where 𝐿𝐺 shows the loss function of generator, m is the number of generated sequences while Q+Z denotes the question
regarded as condition data concatenated with Z vector. As discussed before, we employ the WGAN, aiming to overcome
the challenges of GAN model. In this way, the objective function of WGAN, 𝐿𝑤𝐺 , can be calculated by Eq. (6):
1
𝐿𝑤𝐺 = ∇𝜃𝐺 ∑𝑚
𝑖=1 𝑓(𝐺(𝑄𝑖 + 𝑍)) (6)
𝑚
where f has to be a 1-Lipschitz function.
3.2. Discriminator modules
Discriminator includes only encoder part of transformer followed by a classifier. It provides the probability of real or
fake answers at the time step of t and trains k epochs. In GAN model, discriminator is updated by gradient likelihood ratios
of objective function as expressed by Eq. (7):
1
𝐿𝐷 = ∇𝜃𝐺 ∑𝑚
𝑖=1 −log[𝐷(𝐴𝑖 , 𝑄𝑖 )] − log[1 − 𝐷(𝐺( 𝑄𝑖 ))] (7)
𝑚
where 𝐿𝐷 is loss function of discriminator, A is the real answer, and Q represents the question that is considered as the
condition data. In the WGAN model, the discriminator is considered as a critic model; 𝐿𝐶 can be calculated by Eq. (8):
1
𝐿𝐶 = ∇𝜃𝐺 ∑𝑚
𝑖=1 𝑓(𝑞𝑖 ) − 𝑓(𝐺(𝑞_𝑖 )) (8)
𝑚
where f has to be a 1-Lipschitz function.
Fig. 4. Architecture of the Discriminator in the proposed model.
As we show in Fig. 4, in each iteration step, the discriminator is trained once with fake pair data and once with real pair
data. Real question and answer pairs are tokenized and indexed by the pretrained BERT model. We cannot use BERT for
embedding due to the computational complexity of the graph data. Therefore, the index matrix of fake answer is concatenated
with index matrix of question as fake pairs. Afterward, a linear model is adopted to reduce the dimensionality of the input
matrix and sentence embedding. Then, the output of linear model is concatenated with position encoding matrix to serve as
the input matrix for the first layer of the encoder. The output of the last layer of the encoder is fed to a classifier which is a
linear network. This network has two scores outputs: the reality or fakeness of a given answer. In the inference phase, we
only have generator module while the discriminator model is removed.

4. RESULTS
To demonstrate the performance of the generative Chatbot model, this section presents the details of the dataset and the
results in comparison with other methods. First, the implementation details are explained. Then, two datasets used for the
evaluation, are briefly introduced, followed by the used evaluation metrics. Finally, the results of the proposed architecture
are compared with the state-of-the-art models.

4.1. Implementation details


Evaluations were carried out on a Core (TM) i5-12600K with 128GB RAM in Microsoft Windows 11 operating system
and Python software with NVIDIA GeForce RTX 3090. The PyTorch library was used to implement the model. The
implementation parameters are listed in Table 1.
Table 1. Details of the parameters used in proposed architecture
Parameters value Parameters value
Learning rate 0.00005 Number of layers 8
Batch size 64 Dataset split ration for test data 20%
Epoch numbers 400 Sentence Max Length 30
Processing way GPU Number of Heads 16
Dropout 0.5 BERT features size 768

4.2. Datasets
Two datasets, Cornell Movie Dialogs Corpus and Chit-Chat, are employed to evaluate the proposed model. Cornell Dataset
contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. This dataset includes
220,579 conversational exchanges between 10,292 pairs of movie characters from 617 movies and a total of 304,713
utterances. The Chit-Chat dataset also encompasses 7,168 conversations from 258,145 utterances and 1,315 unique
participants. This dataset was from the Chit-Chat Challenge of the BYU Perception, Control, and Cognition Laboratory.
4.3. Evaluation metrics
The proposed model is evaluated using BiLingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for
Gisting Evaluation (ROUGE-L), and F-measure. BLEU was originally developed for machine translation evaluation. In this
measure, the degree of overlap between the generated sentence and the ground truth sentence is obtained based on n-gram.
Unlike BLEU which focuses on precision, ROUGE-L is concentrated on the recall as it needs to calculate the similarity of
the generated sentence and the ground truth. F-measure is a valuable metric for evaluating the performance of classification
algorithms, which can be defined as a compromise between recall and precision.

4.4. Experimental results


Here, the effectiveness of the proposed architecture is presented on the Cornell Movie Dialogs Corpus and Chit-Chat
datasets using the BLEU4, ROUGE-L, and F-measure metrics (See Table 2). As the results of this table confirm, the
proposed model has a better performance on the Chit-Chat dataset. This comes from this point that the Chit-Chat dataset is
based on chat and conversation while the Cornell dataset relies on movie dialogues.
Table 2. Performance of proposed architecture in terms of different metrics
Dataset BLEU4 ROUGE-L F-measure
Cornell 71.06 81.83 62.2
Chit-Chat 96.3 96.5 98.91

Several state-of-the-art approaches, including S2S+Attention(Shu et al., 2022), BERT(Shu et al., 2022), Retrieval-
augmented generation (RAG) (Patrick Lewis, 2020), Open Domain Response Generation (OPRG) (Shu et al., 2022), BILSTM
and Deep Transfer learning-based Text Generation (DTGEN) (Rao K Yogeswara & Rao, 2022), are used to compare to the
proposed model with ROUGE-L metric on Cornell dataset (Table 3). According to Table 3, the proposed model outperforms
all models due to the extracting richer features by transformer as well as learning data distribution by GAN.
Table 3. The comparison results of the proposed model with the different methods on Cornell dataset using ROUGE-L metric.
Method Approach ROUGE-L Year
S2S+Attention Seq2Seq 0.337 -
BERT BERT +LSTM 0.399 -
RAG Retrieval model+ BERT 0.427 2020
Retrieval model+
OPRG 0.452 2022
Transformer
BILSTM Seq2Seq 0.74 -
DTGEN Transformer + Seq2Seq 0.815 2022
Proposed GAN+ Transformer 0.818 -

Table 4 shows the comparison results of the proposed model with the MLE(Tuan & Lee, 2019), SeqGAN(L. Yu et al.,
2017), and StepGAN(Tuan & Lee, 2019) in Chit-Chat dataset using the BLEU4 metrics. The approaches used for comparison
employ GAN model for generating sequence data like text. According to Table 4, the proposed model outperforms all
approaches using the BLEU-4 metric in the Chit-Chat dataset due to using an extended hybrid of transformer model and
cWGAN methods. Compared to the state-of-art methods, the proposed model takes the advantage of data distribution
learning in answer generation to modify all evaluation metrics. According to the experimental results, the proposed model
can generate more accurate and semantically relevant answers for the Chatbot dialogue.
Table 4. The comparison results of the proposed model with the different methods on Chi-Chat dataset using BLEU-4 metric.
Method Approach BLEU-4 Year
MLE Seq2Seq 0.25 -
SeqGAN GAN +MC 0.26 2017
StepGAN GAN +QN 0.23 2019
GAN+
Proposed 0.96 -
Transformer
For investigating the generated answers of the proposed Chatbot, various questions from different domains were asked
from the proposed Chatbot. Table 5 shows the results of the Chatbot trained using Cornell dataset in two areas of greeting
and general questions and conversations. In addition, Table 6 shows the results of the proposed Chatbot trained on the
Chit-Chat dataset in two areas of greeting and general questions and conversation.

Table 5. Some examples of the generated answers using the proposed Chatbot on Cornell dataset.
Greeting and General question
User input Real answer Chatbot answer
hello Mr. parker how are you? hello Jo thanks you hello Ju thank you
I heard that was you good? well, it was nice seeing you well, it was good seeing you
how much your goanna takes? I do not know how much do you want I do not know how much do you want
hey hey what are you doing right now hi what are you doing right now
how many girlfriends did you have? I do not know exactly I do not know truly
feeling better? I just cannot believe it I actually not believe it
General Conversation
User input Real answer Chatbot answer
is this for like real? unfortunately, yes respect yes
how do you know that? she told me, he was in jail he told me, maybe they have in jail
I tried talking to her at the mine it did not work then try again then go again
I think you are lying that has that is what kind of
what are you saying what are you saying
problem
oh, I put a few games on for your daughter I hope
of course, not sure, course not
you do not mind
I am not playing a game now yes, you are playing word games yes, you are playing cook games
there is no other way? but there is another entrance to this place there is another answer to this place
I have changed my mind you cannot change your mind fully you should not change your mind fully
hey ben there were a couple of guys looking for you what did they look like what did they look like

Table 6. Some examples of the generated answers using the proposed Chatbot on Chit-Chat dataset.
Greeting and General question
User input Real answer Chatbot answer
can you hear me? Hello there Hello there
how are you today? I am great thanks for asking me I am great thanks for asking
can you talk with me? sure, ask me a question sure, ask me a question
what age are you? I am a bot so I do not have an age I am a bot so I do not have an age
I am leaving now goodbye goodbye
General Conversation
User input Real answer Chatbot answer
what kind of thing can you respond
I am here to help answer your questions I am here to help answer your questions
to?
how many sisters do you have? I do not have a family the same way humans do I do not have a family the same way humans do
do you have a gender identity? since I am digital, I do not actually have a
since I a m digital I do not actually have a gender
gender
do you think you are the most we think in very different ways but it has it is we think in very different ways but it has it is safe
intelligent safe to say you are smarter to say you are smarter
5. DISCUSSION
The main challenge of most recent Chatbot approaches is their sequential nature, which is led to less accurate results. Proposed
model resolves this challenge by using a hybrid of cWGAN and transformer model. According to the experimental results, our
model can generate more accurate and semantically relevant answers for the Chatbot dialogue. While the proposed model benefits
from the answer distribution learning, the generated answers have little diversity. This is due to the nature of the dataset used in the
training phase. Both datasets lack of multiple answers for each question. Here, we discuss the proposed model from three
perspectives as follows:
• Analysis of generator pretraining: Before the adversarial training of the proposed model, the generator model is
separately trained by a transformer model. The point is that pretraining of the generator model has a direct impact on
the whole performance of the proposed model. So, it is important that how many epochs we need to use in the pre-
training phase. In the best try, 200 epochs are used for training the transformer as a pretrained model of the generator.
After that, the proposed model is trained with adversarial learning in 400 epochs. Based on Table 7, combining the
generator pretraining with the adversarial learning improves the efficiency of the Chatbot in all metrics for Cornell
dataset.
Table 7. Performance of the proposed model with different configurations on Cornell dataset.
Type of model BLEU4 ROUGE-L F-measure
Only generator pretraining 67.72 78.9 53.16
Only Adversarial 68.99 80.12 57.36
Combine generator pretraining with Adversarial 71.06 81.83 62.02

• Analysis of loss function: Loss function plays a key role in the performance of deep models. In this way, the loss
function of generator pretraining is first described. Then, the loss functions of generator and discriminator in adversarial
phase are considered and analyzed. In the generator pretraining, MLE is used as the loss function for transformer model
employed in generator pretraining phase. As shown in Fig. 5 and Fig. 6, the training and validation losses of generator
pretraining model converged in epoch 200 for both Cornell and Chit-Chat datasets.

Fig. 5. Generator pretraining loss in Cornell dataset


Fig. 6. Generator pretraining loss in Chit-Chat dataset.

In the adversarial training phase, the WGAN is used instead of GAN model. WGAN uses a new cost function using
Wasserstein distance which has a smoother gradient and trains regardless of the implementation of the generator.
Wasserstein distance is calculated by Eq. (9):

𝑊(𝑃𝑟𝑒𝑎𝑙 , 𝑃𝑓𝑎𝑘𝑒 ) = 𝑠𝑢𝑝 𝐸𝑥~𝑃𝑟𝑒𝑎𝑙 [f(x)]-𝐸𝑥~𝑃𝑓𝑎𝑘𝑒 [f(x)] (9)


where sup is the least upper bound, f shows a Lipschitz function, and x denotes a real or fake answer. To calculate the
Wasserstein distance, we just need to find a Lipschitz function. We can build a deep network to learn it. This network is
similar to the discriminator, but it has no Sigmoid function and outputs a scalar score rather than the probability. This score
can be interpreted as the realness of the input data and considered as critic. The Wasserstein loss function can be summarized
for generator and critic (discriminator) in proposed model in Eq. (10) and Eq. (11), respectively:

𝐺𝐿𝑜𝑠𝑠 = − [𝑎𝑣𝑔 𝑐𝑟𝑖𝑡𝑖𝑐 𝑠𝑐𝑜𝑟𝑒 𝑜𝑛 𝑓𝑎𝑘𝑒 𝑎𝑛𝑠𝑤𝑒𝑟] (10)

𝐶𝐿𝑜𝑠𝑠 = [𝑎𝑣𝑔 𝑐𝑟𝑖𝑡𝑖𝑐 𝑠𝑐𝑜𝑟𝑒 𝑜𝑛 𝑟𝑒𝑎𝑙 𝑎𝑛𝑠𝑤𝑒𝑟]– [𝑎𝑣𝑔 𝑐𝑟𝑖𝑡𝑖𝑐 𝑠𝑐𝑜𝑟𝑒 𝑜𝑛 𝑓𝑎𝑘𝑒 𝑎𝑛𝑠𝑤𝑒𝑟] (11)

According to Fig. 7, in Cornell dataset, the loss function of the generator is first low due to having a generator pretraining
model. After a few epochs, the loss function increases followed by a smooth decrease to less than its initial value. This
represents the learning of data distribution in the adversarial learning. Moreover, the discriminator has a low loss function
at the beginning. Upon learning the data distribution by the generator and generating better fake data, the value of the
discriminator loss function smoothly increases. Eventually, both generator and discriminator loss functions are converged.
Fig. 7. Generator (LossG) and discriminator (LossD) loss in Cornell dataset.

According to Fig. 8, in Chit-Chat dataset, the loss function in the generator is initialized with a positive value close to zero,
due to its generator pretraining model. After several epochs, the loss function increases followed by a slow degradation to
close-zero values. Furthermore, the discriminator initially has a negative loss function, which gradually increases to close-
zero values by learning the data distribution. Finally, both generator and discriminator loss functions converge to zero.

Fig. 8. Generator (LossG) and discriminator (LossD) loss in Chit-Chat dataset.


6. CONCLUSION AND FUTURE TREND
In this paper, we proposed a novel end-to-end model using the combination of the cWGAN and transformer model. The
proposed model consists of two networks: Generator and Discriminator. Generator is a full Transformer model and
Discriminator includes only the encoder part of a transformer model followed by a classifier. We evaluated the proposed
model on two datasets using different evaluation metrics. The results confirmed the superiority of the proposed model
compared to state-of-the-art approaches according to BLEU4, ROUGE-L and F-measure metrics. Relying on the WGAN
capabilities as well as the transformer model, the proposed model generates accurate, semantically relevant, and human-like
answers. Future works can add reinforcement leaning to increase semantic relations between question and answer in various
domains.

DECLARATIONS
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data Availability
The datasets analyzed during the current study are available on these web pages:
Cornell_Movie-Dialogs_Corpus: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Chit-Chat dataset: https://pypi.org/project/chitchat-dataset/

Declaration of competing interest


The authors certify that they have no conflict of interest.

REFERENCES

Adamopoulou, E., & Moussiades, L. (2020). Chatbots: History, technology, and applications. Machine Learning with
Applications, 2, 100006. doi:https://doi.org/10.1016/j.mlwa.2020.100006
Ayanouz, S., Abdelhakim, B. A., & Benhmed, M. (2020). A Smart Chatbot Architecture Based NLP and Machine Learning
for Health Care Assistance. Niss2020. doi: https://doi.org/10.1145/3386723.3387897
Brownlee, J. (2019). Generative Adversarial Networks with Python: Deep Learning Generative Models for Image Synthesis
and Image Translation: Machine Learning Mastery.
Colby, K. M. (1975). Artificial Paranoia: A Computer Simulation of Paranoid Processes. In. New York Elsevier Science
Inc.
Cuayáhuitl, H., Lee, D., Ryu, S., Cho, Y., Choi, S., Indurthi, S., Kim, J. (2019). Ensemble-Based Deep Reinforcement
Learning for Chatbots. Neurocomputing, 366. doi:https://doi.org/10.1016/j.neucom.2019.08.007
Dhyani, M., & Kumar, R. (2021). An intelligent Chatbot using deep learning with Bidirectional RNN and attention model.
Materials Today: Proceedings, 34, 817-824. doi:https://doi.org/10.1016/j.matpr.2020.05.450
Diao, S., Shen, X., Shum, K., Song, Y., & Zhang, T. (2021). TILGAN: Transformer-based Implicit Latent GAN for Diverse
and Coherent Text Generation. Paper presented at the Findings, Association for Computational Linguistics (ACL).
Lin, T.-H., Huang, Y.-H., & Putranto, A. (2022). Intelligent question and answer system for building information modeling
and artificial intelligence of things based on the bidirectional encoder representations from transformers model.
Automation in Construction, 142, 104483. doi:https://doi.org/10.1016/j.autcon.2022.104483
Lowe, R., Pow, N., Serban, I., & Pineau, J. (2015). The Ubuntu Dialogue Corpus: A Large Dataset for Research in
Unstructured Multi-Turn Dialogue Systems. Paper presented at the SIGDIAL Conference.
Lu, Z., & Li, H. (2013). A Deep Architecture for Matching Short Texts. Paper presented at the NIPS.
Masche, J., & Le, N.-T. (2017). A Review of Technologies for Conversational Systems. Paper presented at the International
Conference on Computer Science, Applied Mathematics and Applications.
Masum, A. K. M., Abujar, S., Akter, S., Ria, N. J., & Hossain, S. A. (2021). Transformer Based Bengali Chatbot Using
General Knowledge Dataset. Paper presented at the 2021 20th IEEE International Conference on Machine Learning
and Applications (ICMLA).
Miklosik, A., Evans, N., & Qureshi, A. M. A. (2021). The Use of Chatbots in Digital Business Transformation: A Systematic
Literature Review. IEEE Access, 9, 106530-106539. doi:https://doi.org/10.1109/ACCESS.2021.3100885
Mogaji, E., Balakrishnan, J., Nwoba, A. C., & Nguyen, N. P. (2021). Emerging-market consumers’ interactions with banking
Chatbots. Telematics and Informatics, 65, 101711. doi:https://doi.org/10.1016/j.tele.2021.101711
Okonkwo, C. W., & Ade-Ibijola, A. (2021). Chatbots applications in education: A systematic review. Computers and
Education: Artificial Intelligence, 2, 100033. doi:https://doi.org/10.1016/j.caeai.2021.100033
Palasundram, K., Sharef, N., Kasmiran, K., & Azman, A. (2021). SEQ2SEQ++: A Multitasking-Based Seq2seq Model to
Generate Meaningful and Relevant Answers. IEEE Access, PP, 1-1. doi:10.1109/ACCESS.2021.3133495
Patrick Lewis, E. P., et al., . (2020). Retrieval-augmented generation for knowledgeintensive nlp tasks. Advances in Neural
Information Processing Systems., 33, 9459–9474.
Peng, B., Galley, M., He, P., Brockett, C., Liden, L., Nouri, E., Gao, J. (2022). GODEL: Large-Scale Pre-Training for Goal-
Directed Dialog.
Peng, Y., Fang, Y., Xie, Z., & Zhou, G. (2019). Topic-enhanced emotional conversation generation with attention
mechanism. Knowledge-Based Systems, 163, 429-437. doi:https://doi.org/10.1016/j.knosys.2018.09.006
Peng, Z., & Ma, X. (2019). A survey on construction and enhancement methods in service Chatbots design. CCF
Transactions on Pervasive Computing and Interaction, 1(3), 204-223. doi:https://doi.org/10.1007/s42486-019-
00012-3
Ramesh, K., Ravishankaran, S., Joshi, A., & Chandrasekaran, K. (2017). A Survey of Design Techniques for Conversational
Agents. Paper presented at the In Information, communication and computing technology, Singapore: Springer.
Rao K Yogeswara, & Rao, K. S. (2022). Modeling text generation with contextual feature representation and dimension
using deep transfer learning and BI-LSTM. Journal of Theoretical Applied Information Technology, 100(9).
Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., & Bengio, Y. (2017). A Hierarchical Latent
Variable Encoder-Decoder Model for Generating Dialogues. Aaai'17, 3295–3301.
Shao, T., Guo, Y., Chen, H., & Hao, Z. (2019). Transformer-Based Neural Network for Answer Selection in Question
Answering. IEEE Access, 7, 26146-26156. doi:https://doi.org/10.1109/ACCESS.2019.2900753
Shengjie, S., Liu, J., & Yang, Y. (2020). Multi-Layer Transformer Aggregation Encoder for Answer Generation. IEEE
Access, PP, 1-1. doi:10.1109/ACCESS.2020.2993875
Shu, C., Zhang, Z., Chen, Y., Xiao, J., Lau, J. H., Zhang, Q., & Lu, Z. (2022). Open Domain Response Generation Guided
by Retrieved Conversations. IEEE Access, 1-1. doi:https://doi.org/10.1109/ACCESS.2022.3225647
Tran, A. D., Pallant, J. I., & Johnson, L. W. (2021). Exploring the impact of Chatbots on consumer sentiment and
expectations in retail. Journal of Retailing and Consumer Services, 63, 102718.
doi:https://doi.org/10.1016/j.jretconser.2021.102718
Tsai, M.-H., Yang, C.-H., Chen, J., & Kang, S.-C. (2020). Four-Stage Framework for Implementing a Chatbot System in
Disaster Emergency Operation Data Management: A Flood Disaster Management Case Study. KSCE Journal of
Civil Engineering, 25. doi:https://doi.org/10.1007/s12205-020-2044-4
Tuan, Y.-L., & Lee, H.-y. (2019). Improving Conditional Sequence Generative Adversarial Networks by Stepwise
Evaluation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP, 1-1.
doi:https://doi.org/10.1109/TASLP.2019.2896437
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Polosukhin, I. (2017). Attention is All You
Need. Nips'17, 6000–6010.
Wallace, R. S. (2009). The Anatomy of A.L.I.C.E. In R. Epstein, G. Roberts, & G. Beber (Eds.), Parsing the Turing Test:
Philosophical and Methodological Issues in the Quest for the Thinking Computer (pp. 181-210). Dordrecht:
Springer Netherlands.
Wang, Y., Rong, W., Ouyang, Y., & Xiong, Z. (2019). Augmenting Dialogue Response Generation With Unstructured
Textual Knowledge. IEEE Access, 7, 1-1. doi:https://doi.org/10.1109/ACCESS.2019.2904603
Weizenbaum, J. (1966). ELIZA—a computer program for the study of natural language communication between man and
machine. 9(1 %J Commun. ACM), 36–45. doi:https://doi.org/10.1145/365153.365168
Wu, Y., & Wang, J. (2020). Text Generation Service Model Based on Truth-Guided SeqGAN. IEEE Access, PP, 1-1.
doi:https://doi.org/10.1109/ACCESS.2020.2966291
Yan, R., Song, Y., & Wu, H. (2016). Learning to Respond with Deep Neural Networks for Retrieval-Based Human-
Computer Conversation System. Sigir '16, 55–64. doi:https://doi.org/10.1145/2911451.2911542
Yang, M., Tu, W., Qu, Q., Zhao, Z., Chen, X., & Zhu, J. (2018). Personalized response generation by Dual-learning based
domain adaptation. Neural Networks, 103, 72-82. doi:https://doi.org/10.1016/j.neunet.2018.03.009
Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017). Seqgan: Sequence generative adversarial nets with policy gradient. Paper
presented at the Proceedings of the AAAI conference on artificial intelligence.
Yu, S., Chen, Y., & Zaidi, H. (2021). AVA: A Financial Service Chatbot Based on Deep Bidirectional Transformers.
Frontiers in Applied Mathematics and Statistics, 7, 604842. doi:https://doi.org/10.3389/fams.2021.604842
Z. L. Gu, H. L., and V. O. K. Li. (2016). Incorporating Copying Mechanism in Sequence-to-Sequence Learning. Paper
presented at the in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics.
Zhang, J., Tao, C., Xu, Z., Xie, Q., Chen, W., & Yan, R. (2019). EnsembleGAN: Adversarial Learning for Retrieval-
Generation Ensemble Model on Short-Text Conversation. Paper presented at the Sigir'19.
Zhang, L., Yang, Y., Zhou, J., Chen, C., & He, L. (2020). Retrieval-Polished Response Generation for Chatbot. IEEE Access,
PP, 1-1. doi:https://doi.org/10.1109/ACCESS.2020.3004152
Zhang, W.-N., Zhu, Q., Wang, Y., Zhao, Y., & Liu, T. (2019). Neural Personalized Response Generation as Domain
Adaptation. World Wide Web, 22(4), 1427–1446. doi:https://doi.org/10.1007/s11280-018-0598-6
Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., Dolan, W. B. (2019). DIALOGPT : Large-Scale
Generative Pre-training for Conversational Response Generation. Paper presented at the Annual Meeting of the
Association for Computational Linguistics.
Zhou, X., Li, L., Dong, D., Liu, Y., Chen, Y., Zhao, W., Wu, H. (2018). Multi-Turn Response Selection for Chatbots with
Deep Attention Matching Network. Paper presented at the Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics.
Zhu, Q., Cui, L., Zhang, W., Wei, F., Chen, Y., & Liu, T. (2018). Retrieval-Enhanced Adversarial Training for Neural
Response Generation. Paper presented at the Annual Meeting of the Association for Computational Linguistics.
Zhu, Y., Nie, J.-Y., Zhou, K., Du, P., Jiang, H., & Dou, Z. (2021). Proactive Retrieval-Based Chatbots Based on Relevant
Knowledge and Goals. Sigir '21, 2000–2004. doi:https://doi.org/10.1145/3404835.3463011

You might also like