BEINFO
BEINFO
Abstract
Factual faithfulness is a crucial requirement
in information-seeking dialogue: the system
arXiv:2311.09800v2 [cs.CL] 4 Mar 2024
rectly ‘as is‘, or as a starting point for further task- General BeInfo
specific tuning, as illustrated in Figure 2.
Full BeInfo
3 Experimental Setup
Input LLM
Training Setup. In order to leverage inductive
biases of instruction-tuned models, the input for
Figure 2: An overview of different fine-tuning and infer-
B E I NFO includes the following: (i) instructions to ence setups for LLMs with and without B E I NFO (§3).
respond as factually accurately as possible, (ii) aug-
mented knowledge source which includes: ground
erwise, we rely on a multi-domain conversational
truth K and n = 4 distractors K′ for ‘answerable’
QA dataset, DoQA (Campos et al., 2020). The key
dialogues, and 5 randomly sampled K′ -s for unan-
statistics of the datasets are in Table 1, with further
swerable dialogues and (iii) dialogue history which
details and data analyses in Appendix B.
combines all the previous turns (the set H) and the
current user query u. An example input and instruc- Models. Prior work (Dziri et al., 2022a) has demon-
tion text are shown in Appendix A. The models are strated that instruction-tuned models such as the
then trained in a standard sequence-to-sequence Flan series (Chung et al., 2022) are a very strong
fashion with cross-entropy loss. The output is baseline for factuality in information-seeking di-
either ground truth responses for answerable di- alogue. Thus, we use them as a base for the pro-
alogues, where knowledge source K contains the posed method.2 In the experiments, we use Flan-T5
information to address user’s query, or a predefined (Chung et al., 2022) (BASE , L ARGE and XL) and
response ‘Could you please clarify or rephrase the Tk-Instruct-3B (Wang et al., 2022). All the back-
query?’ if the dialogue is unanswerable. Training bone models were pretrained on a large number
the models using B E I NFO proceeds at turn level: of tasks with instructions, which yields faster spe-
dialogue history at every turn is used as input. cialisation of the models to information-seeking
dialogue, especially when, as in our setup, the in-
Datasets. To perform behavioural fine-tuning, we
put/prompt includes a short description of the task.
use a standard dataset for information seeking dia-
logue, FaithDial (Dziri et al., 2022a), and an estab- Fine-Tuning and Inference Setups. The LLMs
lished conversational QA dataset, TopiOCQA (Ad- can be used directly in the final task in a fully zero-
lakha et al., 2022). Generalisation capabilities of shot manner or via in-context learning as ‘black
the models after the B E I NFO tuning are evaluated boxes’: this is a typical usage of very large models
on another domain and dataset (i.e., this could be in dialogue tasks. We can also conduct B E I NFO
seen as ‘zero-shot’ from the domain adaptation 2
We again note that B E I NFO can be applied on top of any
perspective). For this, unless explicitly stated oth- generative model.
tuning of ‘smaller LLMs’ via different regimes: Model BLEU ROUGE BERTS K-BERTS K-Precision
Flan-T5BASE 22.89 34.46 61.60 67.75 90
(i) fine-tuning directly on the task data but with +B E I NFO 22.76 34.04 61.71 77.55 100
augmented knowledge sources (if available) (i.e., Flan-T5L ARGE 26.16 39.57 64.61 71.38 93.86
+B E I NFO 26.34 38.55 63.19 75.55 100
task-only B E I NFO); (ii) fine-tuning only on the Flan-T5XL 28.66 41.99 65.89 67.21 94.12
available data from other dialogue datasets and +B E I NFO 26.65 39.39 64.60 80.19 100
Song Feng, Siva Sankalp Patel, Hui Wan, and Sachindra Fengran Mo, Jian-Yun Nie, Kaiyu Huang, Kelong Mao,
Joshi. 2021. MultiDoc2Dial: Modeling dialogues Yutao Zhu, Peng Li, and Yang Liu. 2023. Learning
grounded in multiple documents. In Proceedings of to relate to previous turns in conversational search.
the 2021 Conference on Empirical Methods in Natu- In Proceedings of the 29th ACM SIGKDD Confer-
ral Language Processing, pages 6162–6176, Online ence on Knowledge Discovery and Data Mining, New
and Punta Cana, Dominican Republic. Association York, NY, USA. Association for Computing Machin-
for Computational Linguistics. ery.
Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham-
Chen, Anna Gottardi, Sanjeev Kwatra, Anushree mad Saqib, Saeed Anwar, Muhammad Usman, Nick
Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. Barnes, and Ajmal Mian. 2023. A comprehensive
2019. Topical-chat: Towards knowledge-grounded overview of large language models. arXiv preprint
open-domain conversations. In Interspeech 2019. arXiv:2307.06435.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Liang Wang, Nan Yang, and Furu Wei. 2023a. Learning
Jing Zhu. 2002. Bleu: a method for automatic evalu- to retrieve in-context examples for large language
ation of machine translation. In Proceedings of the models. arXiv preprint arXiv:2307.07164.
40th annual meeting of the Association for Computa-
tional Linguistics, pages 311–318. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
Hannah Rashkin, David Reitter, Gaurav Singh Tomar,
Hajishirzi. 2023b. Self-instruct: Aligning language
and Dipanjan Das. 2021. Increasing faithfulness
models with self-generated instructions. In Proceed-
in knowledge-grounded dialogue with controllable
ings of the 61st Annual Meeting of the Association for
features. In Proceedings of the 59th Annual Meet-
Computational Linguistics (Volume 1: Long Papers),
ing of the Association for Computational Linguistics
pages 13484–13508, Toronto, Canada. Association
and the 11th International Joint Conference on Natu-
for Computational Linguistics.
ral Language Processing (Volume 1: Long Papers),
pages 704–718, Online. Association for Computa-
tional Linguistics. Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Siva Reddy, Danqi Chen, and Christopher D. Manning. Naik, Arjun Ashok, Arut Selvan Dhanasekaran,
2019. CoQA: A conversational question answering Anjana Arunkumar, David Stap, Eshaan Pathak,
challenge. Transactions of the Association for Com- Giannis Karamanolakis, Haizhi Lai, Ishan Puro-
putational Linguistics, 7:249–266. hit, Ishani Mondal, Jacob Anderson, Kirby Kuznia,
Krima Doshi, Kuntal Kumar Pal, Maitreya Patel,
Sebastian Ruder. 2021. Recent Advances in Lan- Mehrad Moradshahi, Mihir Parmar, Mirali Purohit,
guage Model Fine-tuning. http://ruder.io/ Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma,
recent-advances-lm-fine-tuning. Ravsehaj Singh Puri, Rushang Karia, Savan Doshi,
Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Shailaja Keyur Sampat, Siddhartha Mishra, Sujan
Singh, Tim Rocktäschel, Mike Sheldon, Guillaume Reddy A, Sumanta Patro, Tanay Dixit, and Xudong
Bouchard, and Sebastian Riedel. 2018. Interpretation Shen. 2022. Super-NaturalInstructions: Generaliza-
of natural language rules in conversational machine tion via declarative instructions on 1600+ NLP tasks.
reading. In Proceedings of the 2018 Conference on In Proceedings of the 2022 Conference on Empiri-
Empirical Methods in Natural Language Processing, cal Methods in Natural Language Processing, pages
pages 2087–2097, Brussels, Belgium. Association 5085–5109, Abu Dhabi, United Arab Emirates. As-
for Computational Linguistics. sociation for Computational Linguistics.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaumond, Clement Delangue, Anthony Moi, Pier-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
Raja, et al. 2021. Multitask prompted training en- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
ables zero-shot task generalization. arXiv preprint Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
arXiv:2110.08207. Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia
formers: State-of-the-art natural language processing.
Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau
In Proceedings of the 2020 Conference on Empirical
Yih. 2023. Trusting your evidence: Hallucinate
Methods in Natural Language Processing: System
less with context-aware decoding. arXiv preprint
Demonstrations, pages 38–45, Online. Association
arXiv:2305.14739.
for Computational Linguistics.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
and Jason Weston. 2021. Retrieval augmentation Zhangdie Yuan, Songbo Hu, Ivan Vulić, Anna Korho-
reduces hallucination in conversation. In Findings nen, and Zaiqiao Meng. 2023. Can pretrained lan-
of the Association for Computational Linguistics: guage models (yet) reason deductively? In Proceed-
EMNLP 2021, pages 3784–3803, Punta Cana, Do- ings of the 17th Conference of the European Chap-
minican Republic. Association for Computational ter of the Association for Computational Linguistics,
Linguistics. pages 1447–1462, Dubrovnik, Croatia. Association
for Computational Linguistics.
Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren,
Maarten de Rijke, and Zhaochun Ren. 2023. Con-
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
trastive learning reduces hallucination in conversa-
Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
tions. In Proceedings of the AAAI Conference on Ar-
uating text generation with bert. arXiv preprint
tificial Intelligence, volume 37, pages 13618–13626.
arXiv:1904.09675.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Bhosale, et al. 2023. Llama 2: Open founda- Yulong Chen, et al. 2023. Siren’s song in the ai ocean:
tion and fine-tuned chat models. arXiv preprint A survey on hallucination in large language models.
arXiv:2307.09288. arXiv preprint arXiv:2309.01219.
Kangyan Zhou, Shrimai Prabhumoye, and Alan W
Black. 2018. A dataset for document grounded con-
versations. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Process-
ing, pages 708–713, Brussels, Belgium. Association
for Computational Linguistics.
Please answer the following user query given
FaithDial TopiOCQA DoQA
the information and the conversation.
K-BERTS-F1 67.31 62.91 52.48
(y, K)
INFORMATION: K-Precision 46.23 80.67 97.73
I trim the stem and remove the outer leaves till
they snap to get to the fresh inner core and steam y Avg. length 17.17 10.89 13.29
them the night or morning before grilling so they
are cold and moist. I prefer steaming because I
want all of the nutrients to remain in the
artichoke. I cut them in half for the grill, remove Table 7: BERTScore-F1 and K-Precision between the
the choke and brush them with grapeseed oil where
they come into contact with the grill. First, ground truth knowledge source K and gold response y.
facing down grill them till they feel hot on top
then; flip them over to keep the yummy inner side Average length is calculated as an arithmetic mean of
tender. fill the cavity with garlic butter and...
lemon if you wish. I prefer the brown color to the number of whitespaced words in a response.
lemon flavor.
We present overall statistics of the datasets used for Table 9: Zero-shot results on DoQA Movies.
B E I NFO and evaluation in Table 1.
Additionally, we analyse the characteristics of Model BLEU ROUGE BERTS K-BERTS K-Precision
factual faithfulness of the true responses with re- Flan-T5BASE 23.52 34.96 62.27 64.76 100.0
+B E I NFO 22.40 32.95 61.88 79.12 100.0
spect to the knowledge source. The results in Table
Flan-T5L ARGE 27.50 41.59 66.02 69.90 100.0
7 demonstrate that the responses in FaithDial (Dziri +B E I NFO 25.27 36.10 62.29 77.35 100.0
et al., 2022a) are semantically most similar to their Flan-T5XL ARGE 25.88 41.68 66.91 66.42 100.0
+B E I NFO 23.28 36.22 63.02 81.77 100.0
knowledge source, which is in line with the dataset
collection procedure aimed to make the dataset Table 10: Zero-shot results on DoQA Travel.
more factual than the original responses. Similar-
ity of contextual semantic token representations
(BERTS-F1) is reversely correlated to lexical over-
D Zero-Shot Results on TopiOCQA
lap between the response and knowledge source. The results on TopiOCQA when the smaller dataset
As B E I NFO is aimed at improving the model’s DoQA is used for B E I NFO fine-tuning are pre-
general factual faithfulness, the results suggest that sented in Table 11.
FaithDial (Dziri et al., 2022a) and TopiOCQA (Ad- Model BLEU ROUGE BERTS K-BERTS K-Precision
lakha et al., 2022) are best used for behavioural Flan-T5BASE 19.10 43.44 63.72 68.17 100.0
tuning and DoQA for testing the out-of-distribution +B E I NFO 16.08 31.41 58.87 68.85 100.0
Flan-T5L ARGE 23.26 42.0 63.64 75.83 100.0
capabilities of the model. The former two have a +B E I NFO 24.47 37.16 62.31 76.33 100.0
large semantic but not literal overlap between the Flan-T5XL 22.41 42.52 63.79 77.43 100.0
+B E I NFO 27.13 40.59 62.58 76.89 100.0
knowledge source and the corresponding golden
response, meaning that the behavioural tuning will Table 11: Zero-shot results on TopiOCQA when DoQA
not lead to model learning to ‘copy-paste’ from the is used for B E I NFO fine-tuning.
knowledge source to the response.
6 Only information asked for (perfect)
Information asked for, but also provided additional
5
information that is relevant to the supporting facts (good)
Follow-up or generic question (No specific information is asked
4
for), agent asked for clarification (not great not terrible)
Information asked for, but also provided additional information
3
that is irrelevant to the supporting facts (not bad)
2 Transfer the customer to the correct customer service department (ok)
No information asked for, but provided additional information
1
that is either relevant or irrelevant to the supporting facts (bad)
Information provided is not coming from the supporting facts
0
(terrible), or transfer customers to the wrong queue (poor)