2019 Idt
2019 Idt
net/publication/334521767
CITATIONS READS
12 1,173
4 authors, including:
Hisashi Hatakeyama
Tokyo Institute of Technology
4 PUBLICATIONS 18 CITATIONS
SEE PROFILE
All content following this page was uploaded by Yasunobu Sumikawa on 06 February 2020.
1 Introduction
If we have enough data, we can apply supervised learning algorithms that automati-
cally create FAQ datasets as well as [7]. After obtaining many questions from chatbots
equipped with small datasets, we can apply the supervised learning algorithms to im-
prove scalability for dataset creation.
The contributions of this study are summarized as follows:
1. To the best of our knowledge, we are the first to create a chatbot to enhance e-
learning system used in a Japanese university in practice.
2. We propose a novel framework to create FAQ dataset.
3. We evaluated a chatbot trained on a dataset that is created with the framework and
obtained over 81% in terms of macro-average F1 -score.
2 Related Works
Analyzing Q&A data has been performed by many researchers. Many of the studies
seek to improve user experiences [4, 6] or results of classification. As the objective of
this study is to support dataset creation for improving the accuracy of a chatbot, which is
essentially a multi-class classifier, we focus on comparing the studies trying to improve
results of classification with this study.
Finding similar questions to exploit FAQ data is a popular way to improve the accu-
racy of the classifier for Q&A. One of the most popular approaches is to train language
or translation models by probability-based-estimation or neural network [2, 3, 5]. This
kind of approach is powerful; however, it assumes that a large amount of data is avail-
able to be applied to their models. In contrast, we assume that we can use small Q&A
data, and therefore, it is difficult to employ methods for estimating language models. To
support creating FAQ dataset from the small size of dataset, we design our framework
as an unsupervised learning using a lexical analysis and an entropy-based method.
Supporting dataset creation is another study related to this study. Behúň et al. pro-
pose an automatic annotation tool for collecting ground truth to a purely visual dataset
by Kinect [1]. Rodofile et al. design a modular dataset generation framework for cyber-
attacks [7]. These studies make assumptions that they can use large datasets or it is easy
to create large datasets; the targets are different from our study.
3 Data Collection
We first collected raw data from logs of users of the e-learning system introduced in
Tokyo Metropolitan University and recorded the questions they asked and answers pro-
vided by system engineers who managed the e-learning system in practice. We collected
the data from April 1, 2015 to July 31, 2018. The dataset includes 200 Q&A pairs in
total.
4 Categorization
In this section, we introduce our categorization scheme for the collected raw Q&A
based on features of the e-learning system. The objective is to organize answers; this is
useful for analyzing the kinds of features users often have difficulties with and under-
standing the feature we should focus on when preparing FAQ data. From the collected
data and manual investigation thereof, we propose 11 categories as shown in Tab. 1.
New Question
Answer Combining
Finally, the framework outputs a list of the top-k important words as a ranking style.
The ranking function just sorts results of Eq. 2. The top-ranked words may help us with
creating new questions by combining or paraphrasing them.
The Eq. 4 shows pairs of answers whose MI scores are over a given threshold.
Showing pairs is enough because we can incrementally use our framework; in other
words, even if we can combine more than three answers, we can apply the Eq. 4 to the
dataset more than twice. From this simple way, we can combine two or more similar
answers as an answer.
6 Experimental Evaluation
6.1 Setup
Classification Algorithm. We used the IBM Watson to implement the chatbot program.
Data Collection for Evaluation. Tab. 2 shows the statistics of the dataset used for
this evaluation. We used 79 answers and 44 questions to measure the accuracy of the
chatbot. Note that the 44 questions were not used to train the chatbot. Tab. 3 details how
many answers and questions were prepared for each category.
Comparisons. In this paper, we used only the classification algorithm (Watson),
as our framework is designed for dataset creation. To evaluate the effectiveness of the
framework, we used the following two datasets.
C1 C2 C3 C4 C5 C6
Num. of answers 2 3 2 2 2 4
Num. of questions 4 6 3 4 4 9
C7 C8 C9 C10 C11 Total
Num. of answers 2 1 2 1 2 23
Num. of questions 4 1 4 1 4 44
where TP, FP, and FN mean true positive, false positive and false negative, respectively,
and L is a set of the label defined in Tab.1. Note here that the precision is defined
as the proportion of predicted labels that are truly relevant. The recall is defined as
the proportion of truly relevant labels that are included in predictions. The trade-off
between precision and recall is formalized by their harmonic mean, called F1 -score. In
the label-based measurements, the higher these scores are, the better the performances
of the model are.
Regarding the example-based loss functions, hamming loss (HL), ranking loss (RL)
and log loss (LL) are popular measurements. HL calculates the fraction of the wrong
labels to the total number of labels. RL means a proportion of labels’ pairs that are
not correctly ordered. Finally, LL calculates scores from probabilistic confidence. This
metric can be seen as cross-entropy between the distribution of the true labels and the
predictions. Their formal definitions are given as follows:
N L
1 XX
HL = [[yi,l , ŷi,l ]] (8)
NL i l
N
1X X 1
RL = ([[ŷi < ŷ j ]] + [[ŷi = ŷ j ]]) (9)
N i y >y 2
j k
L
X
LL = − yi log(pi ) (10)
i
Table 4. Scores for both baseline and our approaches. The abbreviated names of measurements
are for: macro-average precision (maP), macro-average recall (maR), macro-average F-score
(maF), hamming loss (HL), ranking loss (RL), log loss (LL)
C1 C2 C3 C4 C5 C6
Num. of answers 9 9 15 3 3 11
Baseline
Num. of questions 13 18 23 3 12 41
Num. of answers 9 13 17 3 3 11
Proposed dataset
Num. of questions 50 68 89 15 19 66
C7 C8 C9 C10 C11 Total
Num. of answers 5 1 8 3 3 70
Baseline
Num. of questions 12 5 21 3 4 155
Num. of answers 5 2 8 3 5 79
Proposed dataset
Num. of questions 25 11 44 15 25 427
In the example-based loss functions, the smaller these scores are, the better the perfor-
mances of the model are.
Tab. 4 compares all measurements of our framework with that of the baselines. The
conclusion is that using our framework improves all measurements. Especially, macro-
average precision is improved over 25% compared with the baseline. The main reason
is that we can increase the number of questions. Looking at Tab. 5, the proposed dataset
has twice as many questions as the baseline does.
We then performed error analysis. Fig. 2 shows confusion matrices of our approach.
The former one shows what answers the chatbot outputs for test questions whereas the
later one shows the result by mapping answers to their categories. In Fig. 2 (a), we use
indexes for answers; for example, if we use a question whose answer is the second one,
we use A2 in the figure. The index numbers start in order from 1 to 79, as our dataset has
79 answers as shown in Tab. 5. From Fig. 2 (a), we can see that the chatbot sometimes
performs ”mis-answering” for several questions. On the other hand, Fig. 2 (b) shows
that the chatbot wrongly predicts only for two categories. For a better understanding
of the results, we measured inner- and inter-category similarity by using the Jaccard
index. This measurement calculates the similarity by counting the number of unique
words shared by given two sets after normalizing their sizes. The formal definition is
A2
A8
A9 5
A11
A12
A14
A20
A22 3.0
A26 4 C1
A27
A40 C2
A42 2.4
A43 C3
A45 3
A46
A47 C4
A51 1.8
A52 C5
A54 2
A56 C6
A57 1.2
A59 C7
A62
A64 C8
A65 1 0.6
A66
A67 C9
A73
A76 C10 0.0
A77 0
A79 C11
A2
A8
A9
A1
A11
A12
A24
A 20
A22
A26
A47
A40
A42
A43
A45
A46
A57
A51
A52
A54
A56
A57
A 69
A62
A64
A65
A66
A77
A73
A76
A77
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
9
(a) (b)
Fig. 2. (a) Answer level based confusion matrix of the proposal. The x axis represents correct
labels whereas labels predicted by classifier are on the y axis (b) Category level based confusion
matrix of the proposed dataset. The x axis represents correct labels whereas labels predicted by
classifier are on the y axis
C1 C2 C3 C4 C5 C6
Baseline 0.19 0.17 0.13 0.32 0.25 0.10
Proposed dataset 0.11 0.09 0.07 0.06 0.22 0.08
C7 C8 C9 C10 C11
Baseline 0.20 0.27 0.12 0.30 0.23
Proposed dataset 0.12 0.07 0.06 0.06 0.03
given as follows:
| W q1 ∩ W q2 |
Jaccard(q1 , q2 ) = (11)
| W q1 ∪ W q2 |
where Wq1 indicates a word set of a question q1 . Tab. 6 shows scores of the inner-
category similarities calculated with the Jaccard index. We can see that relatively high
scores occupy this table. In contrast, Fig. 3 (a) shows scores of inter-category similari-
ties scores calculated by the Jaccard index between all combinations of the two different
categories. Overall, the scores are lower than that of inner-category similarity. These ob-
servations indicate that we should improve the quality of question texts to distinguish
in the same category.
In addition, from Fig. 2 (b), we can observe that several questions of C5 (Uploading)
and C11 (Basic Usage) are mis-predicted to answers of C1 (Documents) and C9
(Contact), respectively. Mispredictions of C5 questions as C1 are understandable as
the two categories (C1 and C5) can share file-related words. Indeed, Fig. 3 (a) shows
0.10 0.10
C1 C1
C2 C2
C3 0.08 C3 0.08
C4 C4
C5 0.06 C5 0.06
C6 C6
C7 0.04 C7 0.04
C8 C8
C9 C9
0.02 0.02
C10 C10
C11 C11
0.00 0.00
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
(a) (b)
Fig. 3. (a) Inter-category similarity of proposed dataset (b) Inter-category similarity of baseline
the score of the Jaccard index between the two categories is quite high. Next, to iden-
tify why the chatbot wrongly showed an answer of C9 instead of C11, we manually
analyzed questions of two categories, C9 and C11. In our dataset, there is a question
about how to make available a function for sending e-mail between teachers and stu-
dents. This question is similar to C11 that collect questions related to how to use the
e-learning system.
Finally, we compared these results of our proposed dataset with that of the baseline.
We show inner-category, inter-category, and answer- and category-level confusion ma-
trices of the baseline in Tab. 6, and Figs. 3 (b) and 4. Looking at all similarity scores
(Tab. 6 and Fig. 3 (b)), they are all higher than that of the proposed dataset. This means
that our framework can suggest several kinds of words leading to increasing diversity
without decreasing the accuracy of the chatbot because our dataset is better than the
baseline.
7 Conclusions
In this paper, we introduce a novel framework for supporting chatbot dataset creation
specifically for an e-learning system. This framework has two methods: suggesting new
words for new questions and aggregating answers that are semantically similar to each
other.
In the future, we plan to analyze a) which questions users tend to have for each
month. In this paper, we assume that all Q&A data can occur independently of time.
However, there are some temporal questions regarding registration to classes that may
occur early in a semester and questions about tests that users may have late in a semester.
This temporal question analysis may improve the effectiveness of chatbots. The future
work also includes b) qualitative evaluation. This paper focuses on quantitative evalua-
tions; however, analyzing what users feel and think about using chatbots is also impor-
tant for practical usage.
A2
A8
A9 5
A11
A12
A14
A20
A22
A26 4 C1
A27
A40 C2
A42 2.4
A43 C3
A45 3
A46 C4
A47
A51
C5 1.6
A52
A54 2
A56 C6
A57
A59 C7 0.8
A62
A64 C8
A65 1
A66 C9
A67 0.0
A73
A76 C10
A77 0
A79 C11
A2
A8
A9
A1
A11
A12
A24
A20
A22
A 26
A47
A40
A42
A43
A 45
A46
A 57
A51
A52
A54
A56
A57
A69
A 62
A 64
A65
A66
A 77
A73
A76
A 77
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
9
(a) (b)
Fig. 4. (a) Answer granularity based confusion matrix of baseline. The x axis represents correct
labels whereas labels predicted by classifier are on the y axis (b) Category granularity based
confusion matrix of baseline. The x axis represents correct labels whereas labels predicted by
classifier are on the y axis
References
1. Behúň, K., Herout, A., Páldy, A.: Kinect-supported dataset creation for human pose estima-
tion. pp. 55–62. SCCG ’14, ACM, New York, NY, USA (2014)
2. Jeon, J., Croft, W.B., Lee, J.H.: Finding similar questions in large question and answer
archives. pp. 84–90. CIKM ’05, ACM, New York, NY, USA (2005)
3. Leveling, J.: Monolingual and crosslingual sms-based faq retrieval. pp. 3:1–3:6. FIRE ’12 &
’13, ACM, New York, NY, USA (2007)
4. Morris, M.R., Teevan, J., Panovich, K.: What do people ask their social networks, and why?:
A survey study of status message q&a behavior. pp. 1739–1748. CHI ’10, ACM, New York,
NY, USA (2010)
5. Otsuka, A., Nishida, K., Bessho, K., Asano, H., Tomita, J.: Query expansion with neural
question-to-answer translation for faq-based question answering. pp. 1063–1068. WWW ’18,
Republic and Canton of Geneva, Switzerland (2018)
6. Pinto, G., Torres, W., Castor, F.: A study on the most popular questions about concurrent
programming. pp. 39–46. PLATEAU 2015, ACM, New York, NY, USA (2015)
7. Rodofile, N.R., Radke, K., Foo, E.: Framework for scada cyber-attack dataset creation. pp.
69:1–69:10. ACSW ’17, ACM, New York, NY, USA (2017)
8. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining Multi-label Data, pp. 667–685. Springer US,
Boston, MA (2010)