0% found this document useful (0 votes)
32 views14 pages

BEINFO

Uploaded by

Naman Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views14 pages

BEINFO

Uploaded by

Naman Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Dial B E I NFO for Faithfulness: Improving Factuality of

Information-Seeking Dialogue via Behavioural Fine-Tuning


Evgeniia Razumovskaia∗, Ivan Vulić, Pavle Marković, Tomasz Cichy,
Qian Zheng, Tsung-Hsien Wen and Paweł Budzianowski
PolyAI Limited
London, United Kingdom
poly.ai

Abstract
Factual faithfulness is a crucial requirement
in information-seeking dialogue: the system
arXiv:2311.09800v2 [cs.CL] 4 Mar 2024

should respond to the user queries so that the


responses are meaningful and aligned with the
knowledge provided to the system. However,
most modern large language models (LLMs)
suffer from hallucinations, that is, they gener-
ate responses not supported by or even contra-
dicting the knowledge source. To mitigate the
issue and increase faithfulness of information-
seeking dialogue systems supported by the
LLMs, we introduce B E I NFO, a simple yet ef-
Figure 1: An example of an information-seeking dia-
fective method that applies ‘behavioural tun-
logue based on the DoQA dataset (Campos et al., 2020).
ing’ on the LLMs to aid information-seeking
Potential responses R1, R2, R3 at the bottom illustrate
dialogue. Relying on three standard informa-
different issues with two crucial aspects of factual faith-
tion seeking dialogue datasets, we show that
fulness: selectivity and response adequacy.
models tuned with B E I NFO become consider-
ably more faithful to the knowledge source both
for datasets and domains seen during B E I NFO- 2022). However, reliance only on the content from
tuning, as well as on unseen domains, when
the pretraining data also means that the model’s
applied in a zero-shot manner. In addition,
we present a ‘real-life’ case study on conversa-
responses might be generic or not be up to date,
tions with real users, showcasing that the mod- especially for queries responses to which change
els with 3B parameters (e.g., Flan-T5) tuned across time such as Who is the current prime min-
with B E I NFO demonstrate strong performance ister of the United Kingdom? An even more promi-
on data from real ‘production’ conversations: nent issue is hallucination (Zhang et al., 2023), a
when tuned on a limited amount of such real- phenomenon often observed even with the most
istic in-domain dialogues, they surpass much powerful LLMs: the models are prone to output in-
larger LLMs used ‘off-the-shelf’, both on auto-
coherent, irrelevant and/or even factually incorrect
matic and human evaluation metrics.
or unsupported statements (Naveed et al., 2023).
1 Introduction A widely used method to ground and control
the content of the output of an LLM is retrieval-
Pretrained large language models (LLMs), being augmented generation (RAG; Lewis et al., 2020),
able to generate natural and grammatical text and where the input to the model is complemented with
respond coherently to user queries, are the main- a retrieved external knowledge source relevant to
stay of modern NLP (Naveed et al., 2023). They the user’s query. However, even with the use of
have demonstrated their capabilities in a plethora RAG, the model’s output can be unpredictable and
of tasks where the general world knowledge, which not fully controllable: they still sometimes do not
can be learnt via pretraining directly from the data, adhere to the knowledge source and hallucinate
is required (Touvron et al., 2023; Hoffmann et al., (Shuster et al., 2021), which can decrease their

LTL, University of Cambridge. Work conducted during applicability in user-facing scenarios, as well as
the internship at PolyAI. raise concerns of their safety (Daheim et al., 2023).
The problem of adherence to the knowledge in specialized dialogue domains. The code for B E -
sources is especially important in the context of I NFO is available online at: [URL-ANONYMOUS].
information-seeking dialogue (Saeidi et al., 2018).
The core of this task is to maintain a conversation 2 Methodology
with the user and respond to their queries based on Task Definition. The aim of information-seeking
the provided knowledge source. Figure 1 presents dialogue is to provide the user with information
an example of information-seeking dialogue be- they need based on one or more knowledge sources,
tween the user and the system and potential re- which are typically retrieved from a large knowl-
sponses of the system. Orthogonally to improving edge base. More formally, given the knowledge
retrieval systems themselves (Wang et al., 2023a; source K, the dialogue history H and the user’s
Mo et al., 2023), prior work has attempted to com- query u, the system should output the response r
bat hallucinations with task arithmetic (Daheim which is factually faithful to K. Here, we follow
et al., 2023), conditioning generation on special Rashkin et al. (2021) and Dziri et al. (2022a)’s di-
control tokens (Rashkin et al., 2021), and by in- rect definition of faithfulness: the response should
corporating a token-level critic which judges the not contain any information which either contra-
faithfulness of the generated response (Dziri et al., dicts K or is not supported by K.
2021). However, the proposed approaches requires
either training an additional model or using com- Behavioural Tuning for Faithfulness. An effec-
plex inference processes such as context-aware de- tive model for faithful information-seeking dia-
coding (Shi et al., 2023). logue needs to perform two actions correctly: 1)
In this work, we propose B E I NFO, a simple select the correct part of information provided in
yet effective method that applies ‘behavioural fine- K (termed selectivity) and 2) provide the response,
tuning’ of LLMs to increase faithfulness of the gen- with the requirement to (i) inform the user when
erated responses for information-seeking dialogue K contains no information relevant to u, or (ii)
supported by the LLMs. The model is tuned on ask for clarification (termed (response) adequacy);1
a reasonably sized collection of publicly available see Figure 1 again. B E I NFO aims to improve on
dialogue data with the true knowledge source(s) both desiderata via behavioural fine-tuning (Ruder,
extended with randomly sampled facts from a large 2021) of any instruction-tuned LLM.
knowledge base. Intuitively, this should teach the To instill the capability for information-seeking
model to become more selective in the information dialogue into the model, we perform behavioural
it uses to generate the response and ‘prepare’ its ex- tuning on the combination of (i) conversational
pected behaviour (hence the term ‘behavioural tun- QA and (ii) information-seeking dialogue datasets.
ing’) for the intended task of knowledge-grounded In both tasks, the response has to be generated
dialogue. The tuned model can either be used ‘as based on some knowledge source K, making them
is’ or as a starting point to fine-tune it further to a suitable for faithful response generation. Further,
specific domain. beyond tuning on related tasks, we propose to aug-
First, we assess the effectiveness of B E I NFO ment the datasets to steer the model towards the
on three standard datasets for information-seeking selectivity and adequacy behaviour, as follows.
dialogue: FaithDial, TopiOCQA and DoQA. Our For selectivity, ground truth K provided in the
results demonstrate that B E I NFO leads to consis- dataset is extended with additional knowledge
tent improvements in factual faithfulness across sources K′ which are irrelevant to user query u,
several standard evaluation metrics, also with on serving as negative examples or distractors. In-
par or larger lexical overlap between the generated tuitively, distractors mimic the presence of infor-
and golden responses. The improvements are es- mation irrelevant to u in K′ , this way promoting
pecially pronounced when models tuned with B E - the model’s selectivity. We augment ground truth
I NFO are applied in a zero-shot manner to unseen 1
Put simply, in our setup response adequacy discerns be-
datasets and domains, indicating the usefulness of tween 1) the case when the model does have the correct infor-
mation in the knowledge source and should provide it versus
behavioural tuning for the task. We then present a 2) the case when the model is certain that it cannot provide a
case study focused on conversations with real users: correct answer to the user query or it does not even understand
the main result demonstrates that combining B E - the query and requires further clarification to be able to react
in the next turn. There might be other, finer-grained options of
I NFO with a small number of in-domain dialogues response adequacy beyond the two simple cases investigated
can substantially increase dialogue factuality even here, but we leave those investigations to future research.
knowledge source K with n distractors; they are FaithDial TopiOCQA DoQA
3
randomly sampled from the knowledge base of the Domains
Open Open
(Cooking, Travel,
Wikipedia-based Wikipedia-based
Music)
corresponding dataset. # dialogues 4,094 / 764 / 791 3,509 / 205 / 206 1,037 / 200 / 1,200
For response adequacy, we augment the fine- # turns
Avg. turns
36,809 / 6,851 / 7,101
9
45,450 / 2,514 / 2,502
13
4,612 / 911 / 5,394
4.48
tuning datasets with dialogues without any relevant Avg. length
17.25 6.92 12.99
of questions
K provided, making them unanswerable for the Avg. length
20.29 11.38 10.43
of responses
system. To construct such dialogs, for a dialogue
history H and a corresponding user query u we Table 1: Overall statistics of the used dialogue datasets.
randomly sample unrelated knowledge sources K′ . The number of conversations and turns are provided for
During fine-tuning, the response r is substituted train / dev / test splits of the datasets.
with a special response signifying that the combi-
nation of H and u cannot be answered based on Final Task
provided K′ . In our experiments, we augment the
Zero-shot
original dataset with 10% unanswerable dialogues.
Further Task-Specific Fine-Tuning. The output
Task BeInfo Task-only
of the ‘general’ behavioural fine-tuning step is a
‘behaviour-specialised’ LLM for factually faithful
information seeking dialogue. It can be used di- General-only

rectly ‘as is‘, or as a starting point for further task- General BeInfo
specific tuning, as illustrated in Figure 2.
Full BeInfo

3 Experimental Setup
Input LLM
Training Setup. In order to leverage inductive
biases of instruction-tuned models, the input for
Figure 2: An overview of different fine-tuning and infer-
B E I NFO includes the following: (i) instructions to ence setups for LLMs with and without B E I NFO (§3).
respond as factually accurately as possible, (ii) aug-
mented knowledge source which includes: ground
erwise, we rely on a multi-domain conversational
truth K and n = 4 distractors K′ for ‘answerable’
QA dataset, DoQA (Campos et al., 2020). The key
dialogues, and 5 randomly sampled K′ -s for unan-
statistics of the datasets are in Table 1, with further
swerable dialogues and (iii) dialogue history which
details and data analyses in Appendix B.
combines all the previous turns (the set H) and the
current user query u. An example input and instruc- Models. Prior work (Dziri et al., 2022a) has demon-
tion text are shown in Appendix A. The models are strated that instruction-tuned models such as the
then trained in a standard sequence-to-sequence Flan series (Chung et al., 2022) are a very strong
fashion with cross-entropy loss. The output is baseline for factuality in information-seeking di-
either ground truth responses for answerable di- alogue. Thus, we use them as a base for the pro-
alogues, where knowledge source K contains the posed method.2 In the experiments, we use Flan-T5
information to address user’s query, or a predefined (Chung et al., 2022) (BASE , L ARGE and XL) and
response ‘Could you please clarify or rephrase the Tk-Instruct-3B (Wang et al., 2022). All the back-
query?’ if the dialogue is unanswerable. Training bone models were pretrained on a large number
the models using B E I NFO proceeds at turn level: of tasks with instructions, which yields faster spe-
dialogue history at every turn is used as input. cialisation of the models to information-seeking
dialogue, especially when, as in our setup, the in-
Datasets. To perform behavioural fine-tuning, we
put/prompt includes a short description of the task.
use a standard dataset for information seeking dia-
logue, FaithDial (Dziri et al., 2022a), and an estab- Fine-Tuning and Inference Setups. The LLMs
lished conversational QA dataset, TopiOCQA (Ad- can be used directly in the final task in a fully zero-
lakha et al., 2022). Generalisation capabilities of shot manner or via in-context learning as ‘black
the models after the B E I NFO tuning are evaluated boxes’: this is a typical usage of very large models
on another domain and dataset (i.e., this could be in dialogue tasks. We can also conduct B E I NFO
seen as ‘zero-shot’ from the domain adaptation 2
We again note that B E I NFO can be applied on top of any
perspective). For this, unless explicitly stated oth- generative model.
tuning of ‘smaller LLMs’ via different regimes: Model BLEU ROUGE BERTS K-BERTS K-Precision
Flan-T5BASE 22.89 34.46 61.60 67.75 90
(i) fine-tuning directly on the task data but with +B E I NFO 22.76 34.04 61.71 77.55 100
augmented knowledge sources (if available) (i.e., Flan-T5L ARGE 26.16 39.57 64.61 71.38 93.86
+B E I NFO 26.34 38.55 63.19 75.55 100
task-only B E I NFO); (ii) fine-tuning only on the Flan-T5XL 28.66 41.99 65.89 67.21 94.12
available data from other dialogue datasets and +B E I NFO 26.65 39.39 64.60 80.19 100

porting the tuned model to the task in a zero-shot


Table 2: Results on DoQA without any in-task B E I NFO
fashion (i.e., general-only B E I NFO- an example is tuning. The models are tuned on a combination of Faith-
tuning on FaithDial and TopiOCQA and using the Dial and TopiOCQA. The results are averaged across
model for DoQA, or vice versa); (iii) finally, we three domains in DoQA – Cooking, Travel and Movies.
can run a stage of general B E I NFO followed by in- Full results are presented in Appendix C.
task B E I NFO (termed full B E I NFO). An overview
of the different setups is provided in Figure 2.
main. Therefore, we start by presenting the results
Evaluation Metrics. We rely on automated met- of the variant tuned with B E I NFO on FaithDial plus
rics to measure lexical similarity of the generated TopiOCQA, where inference is run on the dataset
responses and ground truth responses: BLEU (Pap- unseen during B E I NFO tuning: DoQA (i.e., general
ineni et al., 2002) and ROUGE (Lin, 2004). To B E I NFO from Figure 2). The results are presented
measure semantic similarity between generated in Table 2. They confirm that B E I NFO substan-
and gold responses, we use BERTScore (Zhang tially improves faithfulness while either improving
et al., 2019).3 To evaluate faithfulness, we use or only minimally affecting the similarity between
BERTScore and token-level precision between the generated responses and the gold response. Im-
generated response and the knowledge source K. portantly, the improvements hold across different
We denote BERTScore between ground truth and model sizes: Flan-T5 BASE, L ARGE and XL with
generated responses as “BERTS” and one between 250M, 780M and 3B parameters, respectively.
the knowledge source K and generated responses
Using a Smaller Dataset for B E I NFO Tuning.
as “K-BERTS”. In both cases we use BERTScore-
The previous results from Table 2 show B E I NFO’s
F1. Token-level precision between the generated
effectiveness when tuned on two reasonably sized
response and knowledge source K (K-Precision;
datasets, FaithDial with 36,809 turns, and Topi-
Adlakha et al., 2023) measures the proportion of to-
OCQA with 45,450 turns. Now, we test the oppo-
kens in generated response which occur in K. Prior
site direction: fine-tuning B E I NFO on a smaller-
work (Adlakha et al., 2023) demonstrates that K-
scale dataset like DoQA (4,612 turns) and evaluat-
Precision has the highest correlation with human
ing zero-shot on FaithDial. Besides further testing
(as well as GPT-4-elicited) faithfulness judgements
the versatility of the approach, we also probe sam-
among different automated metrics.
ple efficiency of the approach and its adaptability
Hyperparameters and Training Details. B E I NFO to smaller datasets and computational budgets.
was implemented using HuggingFace Transformers Results in Table 3 suggest that tuning the models
library (Wolf et al., 2020). The models were trained with B E I NFO even on smaller datasets without any
with AdamW (Loshchilov and Hutter, 2019). With subsequent in-task tuning consistently improves the
B E I NFO, we tune for 5 epochs, the learning rate is factuality of generated responses. Especially large
5e-5; when tuning the model to a specific dataset, gains were observed for larger models, both for
we run it for 10 epochs with the learning rate of faithfulness and semantic similarity between the
5e-6. We use the warm-up rate of 0.1 and linear generated responses and the ground truth, indicat-
decay, with the default weight decay rate of 0.01. ing the potential for sample efficiency of B E I NFO.
Beam search is run with the beam size of 10. Similar trends were observed when evaluating on
TopOCQA instead of FaithDial; see Appendix D.
4 Results and Discussion
Different Instruction-Tuned Models. Previous
Faithfulness on Unseen Data. One of the main results have already verified that B E I NFO can be ap-
aims of behavioural fine-tuning with B E I NFO is plied to Flan models of different sizes, and we now
to increase the factual faithfulness of responses in evaluate its impact on another instruction-based
zero-shot domain transfer, on unseen data in any do- model: Tk-Instruct-3B. We fine-tune the models
3
Similarly to Daheim et al. (2023), we use deberta-large- again on FaithDial and TopiOCQA and evaluate
mnli as an underlying model for computing the score. their performance on DoQA’s Travel domain test
Model BLEU ROUGE BERTS K-BERTS K-Precision
Flan-T5BASE 4.15 19.5 53.78 42.17 0
+B E I NFO 5.39 21.04 54.68 70.03 27.78
Flan-T5L ARGE 5.01 20.02 54.56 61.77 0
+B E I NFO 9.27 29.29 61.75 86.58 6.67
Flan-T5XL 5.26 22.21 56.13 65.52 6.67
+B E I NFO 10.2 30.76 62.78 88.50 100

Table 3: Zero-shot results on FaithDial. The models are


tuned on DoQA.

Model BLEU ROUGE BERTS K-BERTS K-Precision


Figure 3: Results of task-specific tuning on FaithDial
Flan-T5XL 25.88 41.68 66.91 66.42 100
(left) and TopiOCQA (right). ‘Task-only’ denotes Flan-
+B E I NFO 23.28 36.22 63.02 81.77 100 T5 tuned directly on FaithDial or TopiOCQA, again with
Tk-Instruct-3B 20.23 31.60 58.47 69.45 100
+B E I NFO 29.19 42.56 66.24 70.58 97.8 knowledge distractors. ‘Full’ denotes the model first
tuned with B E I NFO on both datasets and then further
Table 4: Zero-shot results on DoQA Travel domain. The tuned on each of the datasets; see Figure 2.
models are tuned on FaithDial + TopiOCQA.
Model BLEU ROUGE BERTS K-BERTS K-Precision
Task-only: FaithDial 27.29 41.71 75.31 64.80 73.38
General-only 38.75 69.57 80.40 72.24 81.24
set. While the absolute scores, as expected, do dif- Full: FaithDial
Task-only: TopiOCQA
33.87
36.24
55.85
68.64
78.46
80.94
74.18
73.38
79.63
83.63
fer between different underlying models, the results
in Table 4 indicate the positive effect of B E I NFO Table 5: Results on TopiOCQA when the B E I NFO
also on Tk-Instruct-3B. model is further fine-tuned on FaithDial after the orig-
inal FaithDial + TopiOCQA fine-tuning. ‘Task-only:
B E I NFO with Task-Specific Fine-Tuning. We TopiOCQA’ denotes direct tuning on TopiOCQA, which
have demonstrated that the models tuned with B E - serves as an upper bound in this experiment.
I NFO largely improve factual faithfulness on un-
seen datasets and domains (i.e., the general B E - previously learnt knowledge or skills when tuned
I NFO setup). Here, we study whether these models on new data (De Cao et al., 2021; Yuan et al., 2023).
can serve as an effective starting point for contin- To evaluate whether the models would retain their
ued task-specific fine-tuning. To this end, we first ability to respond faithfully to examples consid-
tune the models with B E I NFO on the combination erably different from the ones seen during fine-
of FaithDial and TopiOCQA as before, and then tuning, we evaluate the models tuned on FaithDial
continue fine-tuning/specialising the model on a on TopiOCQA.5 The scores in Table 5 demonstrate
single dataset (e.g., FaithDial or TopiOCQA): the that even after continued fine-tuning on FaithDial
full setup from Figure 2.4 the model retains high faithfulness scores on Top-
Figure 3 demonstrates that already task-only B E - iOCQA (cf. K-BERTS and K-Precision). At the
I NFO yields strong performance, while models with same time, degradation in scores for similarity to
B E I NFO perform on par or better on average than ground truth responses shows that further tuning
the models which were tuned to a specific dataset largely influences the style/form of the responses.
both on semantic similarity of generated responses The average response length in FaithDial in consid-
and factual faithfulness. While prior work (Daheim erably larger than that in TopiOCQA (see Appendix
et al., 2023) typically optimised one aspect (e.g., B), meaning that further tuning on FaithDial leads
semantic similarity) at the expense of the other the model to generate longer responses not match-
(faithfulness), and vice versa, here we show that ing the gold responses in TopiOCQA. In other
through the use of knowledge distractors B E I NFO words, these results show that further fine-tuning
achieves competitive performance on both aspects might influence the surface form of the responses
and retains the cross-dataset generalisation ability. but not the desired skill to respond faithfully gained
B E I NFO versus Catastrophic Forgetting. Fur- with B E I NFO. In practice, a general model tuned
ther, one issue which might arise from further spe- with B E I NFO on a wide range of tasks/domains and
cialising a model to a given task/dataset is a well- then specialised to one of them would still retain its
known phenomenon of catastrophic forgetting: pre- ability to respond faithfully for any of the domains
trained language models are prone to forgetting seen in the general ‘behavioural tuning’ step.
4 5
We present the results only with Flan-T5BASE as pre- We focus on TopiOCQA as the true responses in the
liminary experiments with larger model sizes demonstrated dataset are more grounded in the knowledge source K (see
similar relative trends. Appendix B).
5 Evaluating B E I NFO on Real GPT-4
4.63
Falcon-40B
3.60
XL-original
3.55
XL+B E I NFO (g)
3.98
XL+B E I NFO (t)
4.46
XL+B E I NFO (f)
4.81
Conversations
Table 6: Averaged GPT4-Eval scores (higher is better)
Experimental Setup. To probe the potential of
on the HOTEL -200 dataset. XL denotes the Flan-T5XL
B E I NFO for boosting real user-facing production model taken-off-the-shelf (XL-original) or fine-tuned
systems, we rely on a small internal dataset of via three different regimes of B E I NFO (t=task-only;
200 fully anonymised dialogs with real users in g=general-only; f=full).
the hotel reservation domain (termed HOTEL -200
henceforth); the dialogues concern hotel bookings
Results and Discussion. The main results are re-
and FAQ-s about its various facilities. It is cru-
ported in Table 6. While the zero-shot B E I NFO
cial to evaluate the models on examples also col-
approach with Flan-T5XL achieves a reasonably
lected from real user-system communication, as
high average faithfulness score in absolute terms, it
the language use is considerably different to some
is still far from that of GPT4, which serves as an up-
established datasets such as DoQA or FaithDial
per bound zero-shot system. Most importantly, the
compiled via crowdsourcing work. For instance,
progress in scores reveals the importance of various
the average length of the user query in HOTEL -200
B E I NFO fine-tuning stages. Even the general-only
is only 6.35 tokens, while it is 17.25 in FaithDial
fine-tuning stage without seeing a single in-domain
or 13 in DoQA (cf., Table 1).
training example yields an average score which is
As the data comes from real conversations, there substantially higher than that of the original Flan-
are no gold responses which could be used for auto- T5XL as well as higher than the score obtained by
mated evaluation. Thus, we resort to evaluation of the 40B Falcon model. Further, the scores indicate
correctness/factual faithfulness with an LLM: here, the importance of being able to fine-tune smaller
we use GPT4 (termed GPT4-Eval henceforth) as its models with in-domain data: the 3B model tuned
judgements were shown to be most correlated with with the full B E I NFO even outperforms GPT4 on
human judgements (Adlakha et al., 2023).6 For GPT4-Eval, and it also obtains strong performance
GPT4-Eval we prompt GPT4 to act as the evalua- with task-only B E I NFO.
tor providing it with natural language instructions,
These results further support our hypothesis that
knowledge source K, conversation history H with
B E I NFO actually ‘behaviourally prepares’ the mod-
user query u and the system-generated response.
els to respond to user’s queries in a factually faith-
In the instructions we request the model to rate
ful manner and tuning on further task-specific data
generated responses on a 7-point Likert-scale for
only amplifies its impact as it gets further adapted
faithfulness, available in Appendix E.
to the domain. Put simply, behavioural fine-tuning
We compare the following models and their con- via B E I NFO performs structural (or behavioural)
figurations: (i) GPT4 itself as the model responding adaptation, while further task-specific fine-tuning
to user query u, (ii) Falcon-40B (Almazrouei et al., combines the behavioural adaptation with (seman-
2023) as a strong open-source LLM,7 (iii) Flan- tic) domain adaptation.
T5XL tuned with B E I NFO, under the three differ-
ent regimes illustrated before in Figure 2 (general- Ablation: Distributions of Scores. We further
only, task-only, full). For the general-only and the study the actual distributions of GPT4-Eval scores
first stage of the full B E I NFO, we again rely on the for the four models variants of Flan-T5XL and
combination of FaithDial and TopiOCQA datasets. compare it against the distribution obtained by GPT-
To obtain data for the task-specific tuning stage, 4. The distributions are shown in Figure 4. As only
we collect 2,000 examples from the same conversa- a small fraction of responses is labelled with inter-
tional system, then generate ‘silver’ responses via mediate scores (1,2,3,5), the core differences lie in
GPT4 and treat the silver responses as true outputs relative distribution of perfect, poor and ‘not great,
for task-specific fine-tuning.8 not terrible’ responses (scores 6,0 and 4, respec-
tively).9 The model tuned with task-only B E I NFO
6
As running evaluation with large models such as GPT4 rarely provides wrong facts but mostly responds
behind proprietary APIs incurs large costs (Adlakha et al.,
2023), we only evaluate the outputs for a smaller dataset purposes: (i) as an evaluator; (ii) as an actual baseline system;
where other means of evaluation cannot be used. (iii) as a ‘silver data generator’.
7 9
Falcon-40B was an open-source large language model Score 4 usually corresponds to the system responding
with state-of-the-art results at the time of the experimentation. with a generic clarification question or notifying the user that
8
Note that here we use the GPT4 model for three different the information is not available.
Figure 4: Distribution of GPT4-Eval scores of 4 variants based on Flan-T5XL and GPT-4. See Table 12 in
Appendix E for the interpretation of the individual scores.

with not great, not terrible responses which do not


mislead the user but might not be helpful. On the
other hand, the model tuned with the general-only
B E I NFO and GPT-4 both yield responses that typ-
ically fall into the extreme categories. In other
words, the responses are either perfect (score 6) or
will provide the user with wrong information (score
0), which is not desirable for a user-facing system.
The model tuned with the full B E I NFO combines
the benefits of behavioural tuning with the use of in-
domain data: the model produces the least factually Figure 5: Density and K-BERTScore on HOTEL -200
unfaithful responses (score 0) while maintaining illustrating the trade-off between faithfulness (y-axis)
the ability to respond with information relevant and abstractiveness (x-axis) for Flan-T5XL for different
to the user’s query (a large number of responses setups: (i) XL-original: ‘off-the-shelf’ Flan-T5XL ; (ii)
with scores 6). In sum, ‘pre-tuning’ the model with B E I NFO general-only: Flan-T5XL tuned with B E I NFO
the general-only B E I NFO stages raises faithfulness on FaithDial and TopiOCQA without any in-task data;
iii) B E I NFO task-only: Flan-T5XL finetuned only on
of the model by extracting relevant information
task-specific data; iv) B E I NFO full. Numeric results are
from the knowledge source K while further tuning provided in Appendix F.
on task-specific data further helps avoid providing
misleading or irrelevant information to the user.
Faithfulness versus Abstractiveness. Increasing fine-tuning with B E I NFO improves the model’s fac-
faithfulness of a model to the underlying knowl- tuality but increases the extractiveness of the re-
edge source K can lead the model to respond with sponses. Tuning on task-specific data helps to raise
large extracted spans of text from K. Ideally, the re- the abstractiveness of the responses. Further anal-
sponses should be abstractive but factually faithful: yses and comparisons (cf. results in Appendix F)
in other words, they should transmit the informa- demonstrate that Flan-T5XL tuned with the full
tion provided in the knowledge source K but use B E I NFO is on par with GPT-4 and better than a
different means of expression of it. As in prior considerably larger Falcon-40B model.
work (Dziri et al., 2022a; Daheim et al., 2023), we Human Evaluation. In addition to automatic met-
use the Density metric proposed by Grusky et al. rics, we also conduct human evaluation on HOTEL-
(2018) to measure abstractiveness. This measures 200 with two annotators. They were tasked to rate
average length of spans copied from the knowledge each response on factuality using the same Likert-
source. We focus on K-BERTScore to measure scale as used for GPT4-Eval (see Appendix E).
faithfulness in this experiment: the average length Three models were assessed: XL+B E I NFO-general,
of the knowledge source K (≈120 words on aver- XL+B E I NFO-task-specific and XL+B E I NFO-full.
age) is relatively large with respect to the length Average human factuality scores were 3.64, 4.62
of generated responses (≈ 18–25 words) making and 4.95, respectively. This further proves the ef-
K-BERTScore suitable for this case. fectiveness of behavioural tuning for improved fac-
Figure 5 illustrates the trade-off between faith- tuality. To further assess relevance of automatic
fulness and abstractiveness for Flan-T5XL under GPT4-Eval, we also compute Pearson’s correla-
different fine-tuning setups on the HOTEL -200 tion coefficient ρ between human judgements and
dataset. The results demonstrate that general-only GPT4-Eval scores. This results in strong positive
correlation with ρ = 0.52, indicating that GPT4- contain hallucinations, making them unsuitable for
Eval can be used as a reasonable automatic proxy. supervised fine-tuning aimed at improving factu-
ality. To resolve this, Dziri et al. (2022a) released
6 Related Work a corrected version of WoW where the responses
were fixed to be factually consistent with the knowl-
Mitigating Hallucinations in Information- edge source. As behavioural fine-tuning heavily
Seeking Dialogue has achieved increased interest relies on the quality of the underlying data, we
recently with the omnipresence of large language have carefully selected and resorted to FaithDial
models (Wang et al., 2023b; Chuang et al., 2023; and TopiOCQA in the first stage of B E I NFO with
Daheim et al., 2022, 2023; Zhang et al., 2023). highest factual faithfulness of their ground truth
Previous methods can be largely divided into those responses (see Appendix B for further details).
which increase factuality of pretrained models via
further training or modification of the generation 7 Conclusion and Future Work
procedure. The former includes, e.g., tuning the
We presented B E I NFO, a simple yet effective
models with contrastive learning (Sun et al., 2023)
method that applies behavioural fine-tuning of large
or a special focus learning loss which reduces
language models underlying information-seeking
hallucinations on token level (Deng et al., 2023).
dialogue systems, with the goal of improving factu-
The latter includes, e.g., conditioning generation
ality of system responses. Instruction-tuned models
process on special control tokens (Rashkin et al.,
are fine-tuned on a collection of publicly available
2021), task arithmetic (Daheim et al., 2023)
dialogue data for two related tasks, conversational
or training a critic network which can detect
question answering and information-seeking dia-
problematic tokens and replace them (Dziri et al.,
logue, where the model must use the correct knowl-
2021). Other approaches have been developed
edge source among several ‘knowledge distractors’
to specifically improve faithfulness with respect
and provide a factually correct and adequate re-
to retrieved knowledge source in decoding. One
sponse. The main results indicated the effective-
proposed option is to do context-aware decoding
ness of B E I NFO both in in- and cross-dataset setups.
(CAD; Shi et al., 2023) where generative proba-
In addition, we demonstrated that further tuning on
bilities are contrasted between those based only
task-specific data might yield further gains in terms
on user query and those based on the user query
of faithfulness as well as reducing extractiveness,
and the knowledge source. The aim is to force
also in experiments with real conversations from a
LLMs to rely more on the knowledge source than
production-ready dialogue system.
the model’s internal knowledge from pretraining.
This work leads up to several potential direc-
In contrast to CAD, Chuang et al. (2023) propose
tions of future work. Firstly, B E I NFO is orthogonal
to contrast generation probabilities from different
to other existing approaches to improving faithful-
layers of LLMs to promote factual knowledge in
ness. For instance, a combination of CAD (Shi
the resulting output probabilities.
et al., 2023) and B E I NFO could further improve
Improving Faithfulness via Supervised Tuning. factuality of responses. Secondly, B E I NFO was
Task-specific supervised fine-tuning could be seen evaluated on information-seeking dialogue. An-
as an option to improve faithfulness of the model’s other interesting direction could be to applying it to
responses (Zhang et al., 2023). Prior work (Cao other language generation tasks where faithfulness
et al., 2023; Chen et al., 2023) has demonstrated to the knowledge sources is crucial, such as sum-
that fine-tuning on higher-quality data improves the marisation. Furthermore, the effectiveness of the
model’s factuality on benchmarks such as Truth- approach can be also tested on other instruction-
fulQA (Lin et al., 2022). In contrast, supervised tuned models (e.g., T0, Sanh et al., 2021) and mod-
fine-tuning on the data which includes numerous ir- els of larger sizes, e.g., Flan-UL2 and beyond.10
relevant or factually inconsistent responses can lead The code and models will be made available
the model to amplifying the noise in the training online at [URL], allowing the research commu-
data. A recent analysis from Dziri et al. (2022b) has nity to build stronger models for factually faithful
shown that over 60% of responses in three standard information-seeking dialogue.
datasets for information-seeking dialogue (WoW, 10
Due to a large number of experiments coupled with com-
Dinan et al., 2018; CMU-DoG, Zhou et al., 2018; putational constraints and feasibility, we focus on models that
and TopicalCHAT, Gopalakrishnan et al., 2019) do not go beyond 3B parameters.
Limitations following models for question answering. arXiv
preprint arXiv:2307.16877.
The experiments could be further extended by alter-
Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer Sule-
ing how the knowledge distractors K′ are sourced. man, Harm de Vries, and Siva Reddy. 2022. Top-
Firstly, the impact of the number n of knowledge iOCQA: Open-domain conversational question an-
distractors K′ on faithfulness performance should swering with topic switching. Transactions of the
be further studied. Also, another extension on this Association for Computational Linguistics, 10:468–
483.
front concerns different heuristics of how K′ is
sampled. Namely, in our experiments they were Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
sampled at random, while getting K′ which are shamsi, Alessandro Cappelli, Ruxandra Cojocaru,
semantically similar or distant from the true knowl- Merouane Debbah, Etienne Goffinet, Daniel Hes-
low, Julien Launay, Quentin Malartic, Badreddine
edge source K or user query u might further impact Noune, Baptiste Pannier, and Guilherme Penedo.
performance. 2023. Falcon-40B: an open large language model
In the experiments we focus on three widely used with state-of-the-art performance.
datasets for information seeking dialogue and two Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan De-
instruction-tuned models. B E I NFO can be further riu, Mark Cieliebak, and Eneko Agirre. 2020. DoQA
extended to other datasets such as CoQA (Reddy - accessing domain-specific FAQs via conversational
et al., 2019), MultiDoc2Dial (Feng et al., 2021) or QA. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
the DSTC9 (Kim et al., 2020) extension of Mul- 7302–7314, Online. Association for Computational
tiWOZ 2.1 (Eric et al., 2020). The evaluation on Linguistics.
production-ready dialogues, due to associated costs
Yihan Cao, Yanbin Kang, and Lichao Sun. 2023. In-
of evaluation, is conducted on 200 dialogues, and
struction mining: High-quality instruction data se-
we plan to run a larger-scale analysis, also spanning lection for large language models. arXiv preprint
other dialogue domains, in future work. arXiv:2307.06290.
We also tested whether B E I NFO can be used with
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa
parameter-efficient finetuning (PEFT) to reduce its Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini-
computational cost. Our preliminary experiments vasan, Tianyi Zhou, Heng Huang, et al. 2023. Al-
proved that B E I NFO can be effectively combined pagasus: Training a better alpaca with fewer data.
with PEFT. However, as PEFT techniques are out arXiv preprint arXiv:2307.08701.
of the scope of the paper and their use is orthogonal Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon
to the main experiments reported in this work, we Kim, James Glass, and Pengcheng He. 2023. Dola:
leave out the preliminary results and focus on full Decoding by contrasting layers improves factu-
ality in large language models. arXiv preprint
fine-tuning as our main setup. arXiv:2309.03883.
Given that B E I NFO uses instruction-tuned mod-
els and ‘behaviourally’ tunes them with a prede- Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
fined instruction, additional experimentation could Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
be conducted on how the wording of the instruc- 2022. Scaling instruction-finetuned language models.
tion influences the performance and whether one arXiv preprint arXiv:2210.11416.
can induce higher factuality by just changing the
Nico Daheim, Nouha Dziri, Mrinmaya Sachan, Iryna
instruction text. Gurevych, and Edoardo M Ponti. 2023. Elastic
Finally, the work on improving knowledge re- weight removal for faithful and abstractive dialogue
trieval systems as done e.g. by Mo et al. (2023) is generation. arXiv preprint arXiv:2303.17574.
out of scope of this work, and we focus on reduc- Nico Daheim, David Thulke, Christian Dugast, and
ing hallucinations of LLMs in information-seeking Hermann Ney. 2022. Controllable factuality in
dialogue directly, without the intervention to the document-grounded dialog systems using a noisy
knowledge retrieval component. channel model. In Findings of the Association for
Computational Linguistics: EMNLP 2022, pages
1365–1381, Abu Dhabi, United Arab Emirates. Asso-
ciation for Computational Linguistics.
References
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit-
Vaibhav Adlakha, Parishad BehnamGhader, Xing Han ing factual knowledge in language models. In Pro-
Lu, Nicholas Meade, and Siva Reddy. 2023. Eval- ceedings of the 2021 Conference on Empirical Meth-
uating correctness and faithfulness of instruction- ods in Natural Language Processing, pages 6491–
6506, Online and Punta Cana, Dominican Republic. Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
Association for Computational Linguistics. Newsroom: A dataset of 1.3 million summaries with
diverse extractive strategies. In Proceedings of the
Yifan Deng, Xingsheng Zhang, Heyan Huang, and Yue 2018 Conference of the North American Chapter of
Hu. 2023. Towards faithful dialogues via focus learn- the Association for Computational Linguistics: Hu-
ing. In Proceedings of the 61st Annual Meeting of the man Language Technologies, Volume 1 (Long Pa-
Association for Computational Linguistics (Volume pers), pages 708–719, New Orleans, Louisiana. As-
1: Long Papers), pages 4554–4566, Toronto, Canada. sociation for Computational Linguistics.
Association for Computational Linguistics.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Emily Dinan, Stephen Roller, Kurt Shuster, Angela Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Fan, Michael Auli, and Jason Weston. 2018. Wizard Diego de Las Casas, Lisa Anne Hendricks, Johannes
of wikipedia: Knowledge-powered conversational Welbl, Aidan Clark, et al. 2022. An empirical analy-
agents. In International Conference on Learning sis of compute-optimal large language model training.
Representations. Advances in Neural Information Processing Systems,
35:30016–30030.
Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Os-
mar Zaiane, Mo Yu, Edoardo M. Ponti, and Siva Seokhwan Kim, Mihail Eric, Karthik Gopalakrishnan,
Reddy. 2022a. FaithDial: A faithful benchmark for Behnam Hedayatnia, Yang Liu, and Dilek Hakkani-
information-seeking dialogue. Transactions of the Tur. 2020. Beyond domain APIs: Task-oriented con-
Association for Computational Linguistics, 10:1473– versational modeling with unstructured knowledge
1490. access. In Proceedings of the 21th Annual Meet-
ing of the Special Interest Group on Discourse and
Nouha Dziri, Andrea Madotto, Osmar Zaïane, and Dialogue, pages 278–289, 1st virtual meeting. Asso-
Avishek Joey Bose. 2021. Neural path hunter: Re- ciation for Computational Linguistics.
ducing hallucination in dialogue systems via path
grounding. In Proceedings of the 2021 Conference Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
on Empirical Methods in Natural Language Process- Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
ing, pages 2197–2214, Online and Punta Cana, Do- rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
minican Republic. Association for Computational täschel, et al. 2020. Retrieval-augmented generation
Linguistics. for knowledge-intensive nlp tasks. Advances in Neu-
ral Information Processing Systems, 33:9459–9474.
Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and
Siva Reddy. 2022b. On the origin of hallucinations
Chin-Yew Lin. 2004. ROUGE: A package for auto-
in conversational models: Is it the datasets or the
matic evaluation of summaries. In Text Summariza-
models? In Proceedings of the 2022 Conference of
tion Branches Out, pages 74–81, Barcelona, Spain.
the North American Chapter of the Association for
Association for Computational Linguistics.
Computational Linguistics: Human Language Tech-
nologies, pages 5271–5285, Seattle, United States.
Association for Computational Linguistics. Stephanie Lin, Jacob Hilton, and Owain Evans. 2022.
TruthfulQA: Measuring how models mimic human
Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, falsehoods. In Proceedings of the 60th Annual Meet-
Sanchit Agarwal, Shuyang Gao, Adarsh Kumar, Anuj ing of the Association for Computational Linguistics
Goyal, Peter Ku, and Dilek Hakkani-Tur. 2020. Mul- (Volume 1: Long Papers), pages 3214–3252, Dublin,
tiWOZ 2.1: A consolidated multi-domain dialogue Ireland. Association for Computational Linguistics.
dataset with state corrections and state tracking base-
lines. In Proceedings of the Twelfth Language Re- Ilya Loshchilov and Frank Hutter. 2019. Decoupled
sources and Evaluation Conference, pages 422–428, weight decay regularization. In Proceedings of the
Marseille, France. European Language Resources International Conference on Learning Representa-
Association. tions.

Song Feng, Siva Sankalp Patel, Hui Wan, and Sachindra Fengran Mo, Jian-Yun Nie, Kaiyu Huang, Kelong Mao,
Joshi. 2021. MultiDoc2Dial: Modeling dialogues Yutao Zhu, Peng Li, and Yang Liu. 2023. Learning
grounded in multiple documents. In Proceedings of to relate to previous turns in conversational search.
the 2021 Conference on Empirical Methods in Natu- In Proceedings of the 29th ACM SIGKDD Confer-
ral Language Processing, pages 6162–6176, Online ence on Knowledge Discovery and Data Mining, New
and Punta Cana, Dominican Republic. Association York, NY, USA. Association for Computing Machin-
for Computational Linguistics. ery.

Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham-
Chen, Anna Gottardi, Sanjeev Kwatra, Anushree mad Saqib, Saeed Anwar, Muhammad Usman, Nick
Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. Barnes, and Ajmal Mian. 2023. A comprehensive
2019. Topical-chat: Towards knowledge-grounded overview of large language models. arXiv preprint
open-domain conversations. In Interspeech 2019. arXiv:2307.06435.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Liang Wang, Nan Yang, and Furu Wei. 2023a. Learning
Jing Zhu. 2002. Bleu: a method for automatic evalu- to retrieve in-context examples for large language
ation of machine translation. In Proceedings of the models. arXiv preprint arXiv:2307.07164.
40th annual meeting of the Association for Computa-
tional Linguistics, pages 311–318. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
Hannah Rashkin, David Reitter, Gaurav Singh Tomar,
Hajishirzi. 2023b. Self-instruct: Aligning language
and Dipanjan Das. 2021. Increasing faithfulness
models with self-generated instructions. In Proceed-
in knowledge-grounded dialogue with controllable
ings of the 61st Annual Meeting of the Association for
features. In Proceedings of the 59th Annual Meet-
Computational Linguistics (Volume 1: Long Papers),
ing of the Association for Computational Linguistics
pages 13484–13508, Toronto, Canada. Association
and the 11th International Joint Conference on Natu-
for Computational Linguistics.
ral Language Processing (Volume 1: Long Papers),
pages 704–718, Online. Association for Computa-
tional Linguistics. Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Siva Reddy, Danqi Chen, and Christopher D. Manning. Naik, Arjun Ashok, Arut Selvan Dhanasekaran,
2019. CoQA: A conversational question answering Anjana Arunkumar, David Stap, Eshaan Pathak,
challenge. Transactions of the Association for Com- Giannis Karamanolakis, Haizhi Lai, Ishan Puro-
putational Linguistics, 7:249–266. hit, Ishani Mondal, Jacob Anderson, Kirby Kuznia,
Krima Doshi, Kuntal Kumar Pal, Maitreya Patel,
Sebastian Ruder. 2021. Recent Advances in Lan- Mehrad Moradshahi, Mihir Parmar, Mirali Purohit,
guage Model Fine-tuning. http://ruder.io/ Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma,
recent-advances-lm-fine-tuning. Ravsehaj Singh Puri, Rushang Karia, Savan Doshi,
Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Shailaja Keyur Sampat, Siddhartha Mishra, Sujan
Singh, Tim Rocktäschel, Mike Sheldon, Guillaume Reddy A, Sumanta Patro, Tanay Dixit, and Xudong
Bouchard, and Sebastian Riedel. 2018. Interpretation Shen. 2022. Super-NaturalInstructions: Generaliza-
of natural language rules in conversational machine tion via declarative instructions on 1600+ NLP tasks.
reading. In Proceedings of the 2018 Conference on In Proceedings of the 2022 Conference on Empiri-
Empirical Methods in Natural Language Processing, cal Methods in Natural Language Processing, pages
pages 2087–2097, Brussels, Belgium. Association 5085–5109, Abu Dhabi, United Arab Emirates. As-
for Computational Linguistics. sociation for Computational Linguistics.

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaumond, Clement Delangue, Anthony Moi, Pier-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
Raja, et al. 2021. Multitask prompted training en- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
ables zero-shot task generalization. arXiv preprint Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
arXiv:2110.08207. Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia
formers: State-of-the-art natural language processing.
Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau
In Proceedings of the 2020 Conference on Empirical
Yih. 2023. Trusting your evidence: Hallucinate
Methods in Natural Language Processing: System
less with context-aware decoding. arXiv preprint
Demonstrations, pages 38–45, Online. Association
arXiv:2305.14739.
for Computational Linguistics.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
and Jason Weston. 2021. Retrieval augmentation Zhangdie Yuan, Songbo Hu, Ivan Vulić, Anna Korho-
reduces hallucination in conversation. In Findings nen, and Zaiqiao Meng. 2023. Can pretrained lan-
of the Association for Computational Linguistics: guage models (yet) reason deductively? In Proceed-
EMNLP 2021, pages 3784–3803, Punta Cana, Do- ings of the 17th Conference of the European Chap-
minican Republic. Association for Computational ter of the Association for Computational Linguistics,
Linguistics. pages 1447–1462, Dubrovnik, Croatia. Association
for Computational Linguistics.
Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren,
Maarten de Rijke, and Zhaochun Ren. 2023. Con-
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
trastive learning reduces hallucination in conversa-
Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
tions. In Proceedings of the AAAI Conference on Ar-
uating text generation with bert. arXiv preprint
tificial Intelligence, volume 37, pages 13618–13626.
arXiv:1904.09675.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Bhosale, et al. 2023. Llama 2: Open founda- Yulong Chen, et al. 2023. Siren’s song in the ai ocean:
tion and fine-tuned chat models. arXiv preprint A survey on hallucination in large language models.
arXiv:2307.09288. arXiv preprint arXiv:2309.01219.
Kangyan Zhou, Shrimai Prabhumoye, and Alan W
Black. 2018. A dataset for document grounded con-
versations. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Process-
ing, pages 708–713, Brussels, Belgium. Association
for Computational Linguistics.
Please answer the following user query given
FaithDial TopiOCQA DoQA
the information and the conversation.
K-BERTS-F1 67.31 62.91 52.48
(y, K)
INFORMATION: K-Precision 46.23 80.67 97.73
I trim the stem and remove the outer leaves till
they snap to get to the fresh inner core and steam y Avg. length 17.17 10.89 13.29
them the night or morning before grilling so they
are cold and moist. I prefer steaming because I
want all of the nutrients to remain in the
artichoke. I cut them in half for the grill, remove Table 7: BERTScore-F1 and K-Precision between the
the choke and brush them with grapeseed oil where
they come into contact with the grill. First, ground truth knowledge source K and gold response y.
facing down grill them till they feel hot on top
then; flip them over to keep the yummy inner side Average length is calculated as an arithmetic mean of
tender. fill the cavity with garlic butter and...
lemon if you wish. I prefer the brown color to the number of whitespaced words in a response.
lemon flavor.

User: How do I best grill an artichoke?


Agent: Cut them in half for the grill, remove the
choke and brush them with grapeseed oil where they
come into contact with the grill.
C Per-Domain Performance on DoQA
User: Are there other ways to cook it?
Agent: Tables 8 – 10 present per-domain results of B E I NFO
(general-only) on DoQA.
Figure 6: Example with the prompt used in B E I NFO.
Model BLEU ROUGE BERTS K-BERTS K-Precision
Flan-T5BASE 23.77 34.19 61.54 67.80 100.0
+B E I NFO 23.96 34.65 61.74 79.54 100.0
A Example of Input Flan-T5L ARGE 27.35 39.79 64.83 71.78 100.0
+B E I NFO 28.17 40.57 64.07 77.77 100.0
Flan-T5XL ARGE 32.16 42.99 65.93 68.19 100.0
+B E I NFO 28.76 42.68 65.97 81.65 100.0
An example with instructions used in B E I NFO is
shown in Figure 6. The used prompt is similar to Table 8: Zero-shot results on DoQA Cooking.
the one which proved successful for conversational
question-answering in Adlakha et al. (2023).
Model BLEU ROUGE BERTS K-BERTS K-Precision
Flan-T5BASE 21.37 34.23 60.99 70.69 69.70
+B E I NFO 21.93 34.51 61.52 73.99 100.0
B Additional Dataset Statistics and Flan-T5L ARGE 23.64 37.34 62.99 72.47 81.58
+B E I NFO 25.57 38.99 63.22 71.52 100.0
Characteristics Flan-T5XL ARGE 27.94 41.30 64.83 67.01 82.35
+B E I NFO 27.90 39.27 64.81 77.14 100.0

We present overall statistics of the datasets used for Table 9: Zero-shot results on DoQA Movies.
B E I NFO and evaluation in Table 1.
Additionally, we analyse the characteristics of Model BLEU ROUGE BERTS K-BERTS K-Precision
factual faithfulness of the true responses with re- Flan-T5BASE 23.52 34.96 62.27 64.76 100.0
+B E I NFO 22.40 32.95 61.88 79.12 100.0
spect to the knowledge source. The results in Table
Flan-T5L ARGE 27.50 41.59 66.02 69.90 100.0
7 demonstrate that the responses in FaithDial (Dziri +B E I NFO 25.27 36.10 62.29 77.35 100.0
et al., 2022a) are semantically most similar to their Flan-T5XL ARGE 25.88 41.68 66.91 66.42 100.0
+B E I NFO 23.28 36.22 63.02 81.77 100.0
knowledge source, which is in line with the dataset
collection procedure aimed to make the dataset Table 10: Zero-shot results on DoQA Travel.
more factual than the original responses. Similar-
ity of contextual semantic token representations
(BERTS-F1) is reversely correlated to lexical over-
D Zero-Shot Results on TopiOCQA
lap between the response and knowledge source. The results on TopiOCQA when the smaller dataset
As B E I NFO is aimed at improving the model’s DoQA is used for B E I NFO fine-tuning are pre-
general factual faithfulness, the results suggest that sented in Table 11.
FaithDial (Dziri et al., 2022a) and TopiOCQA (Ad- Model BLEU ROUGE BERTS K-BERTS K-Precision
lakha et al., 2022) are best used for behavioural Flan-T5BASE 19.10 43.44 63.72 68.17 100.0
tuning and DoQA for testing the out-of-distribution +B E I NFO 16.08 31.41 58.87 68.85 100.0
Flan-T5L ARGE 23.26 42.0 63.64 75.83 100.0
capabilities of the model. The former two have a +B E I NFO 24.47 37.16 62.31 76.33 100.0
large semantic but not literal overlap between the Flan-T5XL 22.41 42.52 63.79 77.43 100.0
+B E I NFO 27.13 40.59 62.58 76.89 100.0
knowledge source and the corresponding golden
response, meaning that the behavioural tuning will Table 11: Zero-shot results on TopiOCQA when DoQA
not lead to model learning to ‘copy-paste’ from the is used for B E I NFO fine-tuning.
knowledge source to the response.
6 Only information asked for (perfect)
Information asked for, but also provided additional
5
information that is relevant to the supporting facts (good)
Follow-up or generic question (No specific information is asked
4
for), agent asked for clarification (not great not terrible)
Information asked for, but also provided additional information
3
that is irrelevant to the supporting facts (not bad)
2 Transfer the customer to the correct customer service department (ok)
No information asked for, but provided additional information
1
that is either relevant or irrelevant to the supporting facts (bad)
Information provided is not coming from the supporting facts
0
(terrible), or transfer customers to the wrong queue (poor)

Table 12: Likert-scale for evaluating faithfulness auto-


matically via GPT4.

E Evaluating Faithfulness with


GPT4-Eval
The 7-point Likert scale used to evaluate faithful-
ness via GPT4 (i.e., the GPT4-Eval evaluation met-
ric) is provided in Table 12.

F Results for Faithfulness vs.


Abstractiveness
The results for factual faithfulness and abstractive-
ness on real conversations for Flan-T5XL tuned
with B E I NFO and larger language models are
shown in Figure 7. Results demonstrate that B E -
I NFO approximates a much smaller model to the
performance of GPT-4 while overcoming the per-
formance of a much larger open-source model,
Falcon-40B. The exact numbers are shown in Table
13.
Figure 7: Density and K-BERTScore illustrating the
Model Density (↓) Coverage (↓) K-BERTScore (↑) trade-off between faithfulness (y-axis) and abstractive-
Flan-T5-XL 4.03 0.47 83.51 ness (x-axis).
B E I NFO (t) 2.35 49.18 84.64
B E I NFO (g) 12.10 0.73 88.33
B E I NFO (f) 2.32 0.60 86.30
Falcon-40B 5.72 0.46 84.25
GPT-4 2.01 0.64 87.49

Table 13: Results for faithfulness and abstractiveness


on real user conversations. We use: a) K-BERTScore
to measure faithfulness of the model to the knowledge
source K; b) Density and Coverage (Grusky et al., 2018)
to measure abstractiveness of the responses. (t)=task-
tuned; (g)=general-only; (f)=full.

You might also like