0% found this document useful (0 votes)
21 views27 pages

GrowOVER - How Can LLMs Adapt To Growing Real-World Knowledge

1

Uploaded by

hikizheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views27 pages

GrowOVER - How Can LLMs Adapt To Growing Real-World Knowledge

1

Uploaded by

hikizheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

GrowOVER: How Can LLMs Adapt to Growing Real-World

Knowledge?

Dayoon Ko Jinyoung Kim Hahyeon Choi Gunhee Kim


Seoul National University
[email protected] [email protected] [email protected] [email protected]
https://github.com/dayoon-ko/GrowOVER

Abstract GrowOVER-Dialogue Messi's free kicks are a thing


Expert: Messi's free kicks are absolutely of beauty. How does he rank
In the real world, knowledge is con- beautiful. As of September 2023, he’s 5th
all time in goals scored from direct free kicks.
among other players in that
aspect?
stantly evolving, which can render exist- Evidence Text: As of September 2023,

ing knowledge-based datasets outdated.


Messi ranks 5th all time in goals scored from That’s right. Messi ranks 9th all
direct free kicks with 65, the most among time in goals scored from direct
This unreliability highlights the critical active players.
Type: Changed
free kicks with 60, the most
among active players!
need for continuous updates to ensure Outdated

both accuracy and relevance in knowledge-


Wikipedia
That’s true. As of September
As of September 2023, Messi ranks
2023, Messi ranks 5th all time in
intensive tasks. To address this, we pro-
5th all time in goals scored from
goals scored from direct free
direct free kicks with 65, the
kicks with 65!
pose GrowOVER-QA and GrowOVER- most among active players. Updated

Dialogue, dynamic open-domain QA and Figure 1: An illustration of GrowOVER benchmarks.


dialogue benchmarks that undergo a con- GrowOVER is automatically generated and continuously
updated. It provides the evidence text to evaluate the re-
tinuous cycle of updates, keeping pace with
triever and also comprehensively evaluates the generator
the rapid evolution of knowledge. Our re- through an open-domain dialogue task.
search indicates that retrieval-augmented
language models (RaLMs) struggle with graphs or documents in a knowledge bank,
knowledge that has not been trained on such as a vector database, and a generator pro-
or recently updated. Consequently, we vides answers based on the retrieved passages
introduce a novel retrieval-interactive lan- or documents (Lewis et al., 2020; Guu et al.,
guage model framework, where the lan-
2020). Previous benchmarks (Kwiatkowski
guage model evaluates and reflects on its
answers for further re-retrieval. Our ex-
et al., 2019; Dinan et al., 2018; Petroni et al.,
haustive experiments demonstrate that our 2021) annotate gold answers and the evidence
training-free framework significantly im- text needed to predict them, and evaluate the
proves upon existing methods, performing retriever using the evidence text and the gen-
comparably to or even surpassing continu- erator using the gold answers.
ously trained language models. In the real world, new knowledge is con-
stantly being created, and existing knowl-
1 Introduction
edge is changing over time, causing annotated
In natural language research, many knowledge- benchmarks to become quickly outdated. To
intensive tasks have been actively studied, in- handle this issue, Kim et al. (2023) and Mar-
cluding open-domain QA (Kwiatkowski et al., gatina et al. (2023) respectively suggest dy-
2019; Joshi et al., 2017; Yang et al., 2018), fact- namic QA and cloze query benchmarks, which
checking (Thorne et al., 2018), entity linking are automatically generated by comparing two
(Hoffart et al., 2011), and open-domain di- Wikipedia (or Wikidata) snapshots at different
alogue (Dinan et al., 2018), to name a few times. However, they provide no annotated
(Petroni et al., 2021; Levy et al., 2017). Such a evidence text for the gold answers, which may
knowledge-intensive task is mostly to utilize make it difficult to evaluate retrievers in open-
world knowledge to generate a proper answer domain knowledge-intensive tasks. In the
for a given query (Lewis et al., 2020). However, retrieval-augmented generation (RAG) frame-
the amount of real-world knowledge is often work, it is crucial to accurately measure the
too enormous for models to fully store them performance of each component as well as
in the parameters. Thus, in most scenarios, end-to-end performance. This enables precise
a retriever is employed to seek relevant para- identification of error sources and inaccuracies,
3282
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3282–3308
August 11-16, 2024 ©2024 Association for Computational Linguistics
TempLAMA RealtimeQA DynamicTempLAMA TemporalWiki EvolvingQA GrowOVER
(2022) (2022) (2023) (2022) (2023) (Ours)
Label types C C C, U, N C, U C, U, N C, U, N
Automation ✗ ✗ ✓ ✓ ✓ ✓
Maintenance ✗ ✗ ✗ ✗ ✗ ✓
Evidence text ✗ ✗ ✗ ✗ ✗ ✓
Tasks Cloze query QA Cloze query ✗ QA QA & Dialogue

Table 1: Comparison of our GrowOVER with existing benchmarks. The Label Types display the data types available
in each dataset with Changed, Unchanged, and New. The automation indicates the feasibility of automatic generation.
The Maintenance represents whether the validity of previously generated datasets is verified in the forthcoming
time step. The evidence text indicates whether the dataset includes the evidence text. Lastly, the tasks identify the
intended tasks for each dataset.
allowing for less frequent updates to Large continual pretraining (Jang et al., 2022; Kim
Language Models (LLMs). Also, the tasks of et al., 2023). The retrieval approach employs
these benchmarks require the model to pro- a retriever to supply LLMs with new infor-
vide only direct answers, which mainly consist mation from an updated database, leverag-
of entities. In contrast, real-world knowledge ing their in-context learning capabilities. The
can’t be structured as simple question-and- continual pertaining approach updates out-
answer pairs. Instead, it’s more accurately dated knowledge within the LLMs’ parameters,
represented as a vast, interconnected knowl- thereby preventing hallucinations. However,
edge graph. Therefore, there are limitations constantly updating LLMs can be costly and
in evaluating the generator’s ability to provide prone to performance degradation, while re-
contextually appropriate and informative an- lying solely on the retriever can be vulnerable.
swers that incorporate relevant background Therefore, we propose a retrieval-interactive
knowledge. LLM (RiLM). In RiLM, the LLM evaluates its
To overcome such limitations, we propose own answers and, if found unreliable, pro-
novel open-domain dynamic benchmarks, vides feedback to the retriever to locate more
GrowOVER-QA and GrowOVER-Dialogue relevant documents. The LLM then uses the
(Growing Open-domain knowledge bench- feedback to generate improved answers.
marks for retrieVal-augmented genERation). Finally, our contributions are as follows.
As illustrated in Figure 1, GrowOVER provides 1. We introduce GrowOVER, a set of dy-
the evidence text along with the gold answers, namic QA and dialogue benchmarks that
which can be used to evaluate the retriever. In evaluate both retrievers and generators
addition, we utilize the evidence text to verify by annotating the evidence text and intro-
the validity of previously generated datasets ducing a challenging dialogue task.
and maintain valid ones in the succeeding 2. We propose RiLM, a framework where
time steps. Consequently, our benchmarks the LLM evaluates its own answers and
continue to grow from their initial creation as provides feedback to the retriever to cor-
new Wikipedia snapshots continue to come rect retrievals, thereby regenerating better
in. Furthermore, it introduces a dialogue task answers.
to better evaluate the generator. The open-
3. We empirically demonstrate the effective-
domain dialogue task challenges models to
ness of our method without requiring
adapt to the user’s responses and potentially
additional pre-training of the LLM.
shift topics while still responding accurately.
This demands a more sophisticated under-
standing and application of world knowledge, 2 Related Work
allowing for a more extensive evaluation. Ta- Temporal sensitivity. Temporal misalignment
ble 1 presents the comparison of GrowOVER occurs when training and test datasets origi-
with other benchmarks. nate from different time periods. Past studies
To enable intermittently updated LLMs to (Lazaridou et al., 2021; Luu et al., 2021) report
cope with the rapidly evolving world, recent poor performance in downstream tasks when
research has explored two approaches: re- making predictions beyond the training pe-
trieval (Kasai et al., 2022; Ram et al., 2023) and riod. Thus, recent research (Dhingra et al.,
3283
2022; Liska et al., 2022; Saxena et al., 2021; Jang Addressing this, Jang et al. (2021) introduce the
et al., 2022; Kim et al., 2023) efforts to evalu- concept of continual knowledge learning (CKL)
ate how LLMs handle time-sensitive informa- to manage the dynamic nature of world knowl-
tion. In particular, there have been approaches edge. It involves not just retaining previous
to utilize a retriever for time-sensitive knowl- knowledge but also embracing new informa-
edge. Zhang and Choi (2021) and Longpre tion and adapting to updates. These objectives
et al. (2021) report that, even with an updated align with the goal of our benchmarks.
evidence corpus, language models trained on
previous data struggle to respond to questions 3 The GrowOVER Dataset
in the present. However, Kasai et al. (2022) GrowOVER comprises two distinct datasets:
show that LLMs can adjust their generated QA and Dialogue. GrowOVER-QA is de-
responses to recently retrieved documents pro- signed to evaluate the ability to recall enti-
vided by prompting. Still, when failing to ties, while GrowOVER-Dialogue features user-
retrieve appropriate documents, LLMs may expert interactions over 3-4 turns to highlight
produce outdated answers. To address this generation capabilities. Each instance from
issue, we propose a retrieval-interactive LLM both datasets is annotated with the evidence
framework that allows the LLM to provide text and the type: Unchanged, Changed, or
feedback to the retriever to fetch more relevant New. Our goal is to evaluate the retention of un-
documents when the answer is less reliable. changed knowledge, the updating of changed
Retrieval augmented LLM. In the initial knowledge, and the acquisition of new knowl-
stages, language models had limited capac- edge, aligning with the objectives of CKL.
ity to store a vast amount of factual details. Article selection. GrowOVER is based on
Hence, prior studies (Lewis et al., 2020; Guu Wikipedia snapshots1, which contain a vast
et al., 2020) introduce RAG, where the genera- amount of world knowledge. Although there
tor responds based on the passages provided is no limitation to applying our algorithm to
by the retriever. As LLMs grow larger and are generate QA and conversation for any articles,
pre-trained on huge text corpora, Ram et al. we select the articles linked to Portal:Current
(2023) propose, instead of training LLMs, to Events Wikipedia article from January 2023 to
combine retrieved content with a query into a December 2023 (About 12K articles).
prompt for LLMs to generate an answer. Ad- Overall process. We create initial QA
ditionally, Shi et al. (2023) use an ensemble and dialogue instances using the 2023-08-20
scheme that provides multiple documents to Wikipedia snapshot using GPT-4. For each
the LLM, which determines the next token by subsequent snapshot, we label each sentence
summing the probability of the next token for in articles as unchanged, changed, or new by
each document. Recently, studies have focused comparing it to the previous month’s snap-
on when or what to retrieve. For instance, Asai shot. Then, we retain QA and dialogue in-
et al. (2023) uses special tokens to decide when stances when the evidence text is labeled as un-
to retrieve, and then generates and reflects on changed, and create new instances from new or
the passages and generated answers. Simi- changed sentences. This process is repeated for
larly, Jiang et al. (2023) generates a sentence, each month’s new snapshot from September
and if the generated tokens have low probabil- through December. Algorithms and prompt
ities, it retrieves passages using the generated templates for sentence labeling and data gen-
sentence for a long-form generation task. eration are detailed in Appendix C. Also,
Continual Knowledge Learning. Contin- the statistics of datasets are provided in Ap-
ual Learning (CL) focuses on training a model pendix E.
on multiple sequential tasks, while retaining
knowledge from previously learned tasks and 3.1 Initial Generation
adapting to new ones (Chen and Liu, 2018; GrowOVER-QA. Each article in snapshots in
He et al., 2021; Chen et al., 2020; Xu et al., the initial month is split into paragraphs. Then,
2023; Hu et al., 2021; Wang et al., 2020). In the
1We download Wikipedia data dumps from https://
realm of knowledge-intensive tasks, there is dumps.wikimedia.org/enwiki/ and use monthly snap-
an additional imperative for knowledge revision. shots from 2023-08-20 to 2023-12-20.

3284
Lionel Messi Lionel Messi
1. Unchanged
[153] Barcelona won … [153] Barcelona won … [154] At 22 years … of the Year award.
Compare sentences Compare sentence groups
[154] At 22 years old,
[481] His base salary is set at $12 million with guaranteed a
Messi won … the Year totaling $20.4 million for the 2023 season.
award.
[733] As of September 2023, Messi ranks 5th all time in goals
[730] As of October scored from direct free kicks with 65, the most among
2022, Messi ranks 9th all active players.
time in goals scored from
direct free kicks with
QA1. What is Lionel Messi's Dialogue2 Turn3.
60, the most among all-time rank in goals User: Messi’s always been a top earner in football.
active players. 2. New & Changed scored from direct free? How’s his financial deal with Inter Miami?
Old A. 9th / New A. 5th Expert: His financial terms are quite impressive. For
Neutral sim < τ! Type: Changed the 2023 season, his base salary is $12 million, and
New Evidence Text: As of his guaranteed compensation totals $20.4 million.
September 2023, ~ Type: New / Evidence Text: His base salary ~

Neutral sim > τ"


QA & Dialogue Changed QA & Dialogue
Contradict GPT-4
New : D1-Turn1 [153], Unchanged : D1-Turn1 [153], D1-Turn2 [154]
D1-Turn2 [154] New: D2-Turn3 [481] / Changed: QA2 [733]

(a) Initial Generation (b) Sentence Labeling (c) Temporal Updates

Figure 2: The overview of the dataset generation process. Please refer §3.1–3.3 together for detailed explanations.

we select up to four paragraphs, for each of pare them to find all unchanged sentences.
which we prompt GPT-4 to generate QA. The Wikipedia articles generally maintain the or-
sentences used for generating QA become the der of sentences even after editing. Hence,
evidence text. To ensure dataset quality, we we group old and new sentences into Sold and
experimentally set the criteria for paragraph Snew , respectively, that lie between pairs of pre-
selection. First, we select paragraphs with less viously identified unchanged sentences. Then,
than five sentences and between 300 and 600 we compute the similarity between all subsets
characters. If there are more than four satisfy- of Sold and Snew . If any subset of Snew matches
ing paragraphs, we use K-Means clustering to with a subset of Sold with a similarity score
group them into four clusters and randomly se- above the threshold, those sentences are la-
lect one paragraph from each cluster. This cre- beled as unchanged. For instance, if we have
ates semantically diverse QAs while avoiding matched (sold1 , snew1 ) and (sold4 , snew5 ) as un-
too similar QAs. Afterward, we guide GPT-4 changed, we then compare similarity between
to satisfy the following: i) the question should all subsets of Sold = {sold2 , sold3 } and Snew =
be directly answered without the context (e.g. {snew2 , snew3 , snew4 }. If the similarity score be-
no "According to the context), ii) the answers tween {sold2 } and {snew2 , snew3 } exceeds the
must be short and be entities, and iii) return a threshold, we label (sold2 , {snew2 , snew3 }) as un-
bounding box indicating the sentence(s) that changed as well. After identifying unchanged
include the answer as the evidence text. sentences in the groups, we update the groups
GrowOVER-Dialogue. The paragraph se- enclosed by the new pairs of unchanged sen-
lection process is the same as QA. For each tences. In the previous example, Sold becomes
paragraph, we ask GPT-4 to create a dialogue {sold3 } and Snew becomes {snew4 }.
involving user-expert interactions spanning 3- NLI. Next, we classify unlabeled sentences
4 turns. We guide it to return the sentence used in Sold and Snew using a natural language infer-
for generating each turn and then annotate it ence (NLI) task with RoBERTa (Liu et al., 2019)
as the evidence text for each turn. fine-tuned on the MultiNLI dataset (Williams
et al., 2018). The NLI task, given a premise and
3.2 Sentence Labeling a hypothesis, classifies the hypothesis as en-
Unchanged. Each time a new snapshot of tailment, neutral, or contradiction. In this step,
Wikipedia becomes available, we first identify we provide the model with each sentence pair
unchanged sentences for each pair of old and (sold , snew ) where sold ∈ Sold and snew ∈ Snew
new articles. We compute sentence similarity treating sold as the premise and snew the hy-
using SimCSE (Berant et al., 2013) to localize se- pothesis. If snew is classified as entailed by any
mantically identical sentence pairs, (sold , snew ). sold , we label that snew as unchanged.
If the similarity score exceeds a threshold of Changed. Else if snew is classified as contra-
0.99, we label (sold , snew ) as unchanged. diction with any sold , we label the (sold , snew ) as
Additionally, we group sentences and com- changed. Besides, we check whether their simi-
3285
larity is higher than τ2 set to 0.6 since changed 4 Approach
sentences share some content, not entirely new. If the LLM has been trained on outdated data
After that, we double-check whether sold and or has never been trained on new data, it may
snew are contradictory with GPT-4 to ensure not be able to answer the questions on new in-
that (sold , snew ) is changed. formation correctly. However, updating the pa-
New. Otherwise, if snew is classified as neu- rameters of such models should be conducted
tral, we further check their similarity scores. with the greatest caution to avoid potential side
If the similarity with all sold ∈ Sold is lower effects, such as catastrophic forgetting. There-
than τ1 set to 0.7, we classify that snew as new fore, our approach ensures that neither the
since new sentence should not have similar LLM nor the retriever is continuously trained
counterparts in the old document. with new data.
To adapt LLMs to rapidly evolving world
3.3 Temporal Updates knowledge, we propose RiLM, as shown in
If an article is newly added in the new snapshot, Figure 3. While freezing both the LLM
we perform the initial generation as done in 3.1. and the retriever, we introduce the Decision
Otherwise, based on the results of sentence Gate, which decides whether to accept the
labeling, we update GrowOVER in two ways: i) LLM’s answer based on its certainty score (sec-
maintain or exclude existing QA and dialogue tion 4.1). If the answer is not sufficiently con-
turn instances, and ii) generate new instances. fident, the retrieval-generation process is per-
formed again, termed Adaptive Re-Retrieval
Maintenance. Each QA and turn instance is
(section 4.2). In this process, the LLM’s pre-
annotated with evidence sentences and their
vious output is fed back to the retriever to
indices within the article. If all evidence sen-
fetch documents again, enabling the LLM to
tences sold are labeled as unchanged, we keep
generate better answers.
the corresponding QA or turn as Unchanged
Similar to our approach, Asai et al. (2023)
and update the evidence sentences and their
propose using reflection tokens to confirm out-
indices as of the new snapshot. Otherwise, we
put relevance, support, or completeness. How-
exclude the instance since it is not guaranteed
ever, their method requires training LLMs to
as Unchanged. For QA, we simply delete the
predict reflection tokens. In contrast, RiLM
instance from our dataset. But, for dialogue,
only involves training the certainty classifier.
we only exclude the turn when evaluation in-
Additionally, for long-form text generation,
stead of the whole dialogue.
Jiang et al. (2023) generate the next sentence
Generation. We generate new QA and di- and then concatenate it with the question to
alogues with new and changed sentences. For form a retrieval query. They exclude tokens
QA, we find consecutive new sentences and generated with low probability to avoid in-
split them into multiple groups if more than six terrupting retrieval. Our approach also uses
sentences. For each group, we prompt GPT-4 generated answers for re-retrieval but consid-
to generate a New QA instance and annotate ers all generated tokens only to the extent that
the evidence text as done before. For changed the LLM is certain about the answer.
sentences, we provide GPT-4 with both the
original and the revised sentences and prompt 4.1 The Decision Gate
it to generate a Changed question with contra- Previous work (Ram et al., 2023) has of-
dictory answers based on each sentence and ten concatenated top-k retrieved documents
annotate the revised sentences as evidence text. {D1 · · · Dk } with a query Q into a single
For dialogue generation, similarly to the initial prompt to the LLM. However, some irrele-
generation process, we select informative para- vant content may degrade performance. Thus,
graphs with changed or new sentences. We then following Shi et al. (2023), we concatenate each
prompt GPT-4 to generate dialogues and an- document Di with the query Q into a prompt
notate the used sentences as the evidence text to LLM in parallel. To select the best answer
for each turn. If the evidence text is changed or from these prompts, we add a certainty clas-
new, we label the generated turn as Changed sifier to the last multi-head attention layer of
and New, respectively. the LLM.
3286
𝐷% LLM 𝑫∗ ← 𝑫𝟏
𝑑$ reliable
Query 𝑄 𝑑#
Certainty misleading
Classifier uncertain
Retriever
Documents Multi Decision
z Adopt reliable 𝑦!!"
layer
𝐷% : 𝑄 𝐷$ : 𝑄 𝐷# : 𝑄
Gatez
Feed Classifier Output After Re-retrieval:
𝐷% : 𝑄 Decoder
𝑑$ Forward The most reliable 𝑦!!"
Adaptive 𝑑# 𝑦!!" with 𝐷% : 𝑄 : uncertain
Layer
Re-Retrieval LLM Output
Prompts
1 − 𝜔 ∗ 𝑐𝑜𝑠 𝐸 𝑄 , 𝐸 𝐷
+ 𝜔 ∗ 𝑐𝑜𝑠 𝐸 𝑄: 𝑦!!" , 𝐸 𝐷

Figure 3: The RiLM framework. Given a query, we retrieve top-k documents and generate k prompts to LLM in
parallel. The certainty classifier predicts either reliable, misleading, or uncertain for each prompt. If reliable, the Decision
Gate adopts the answer. Otherwise, we return back to the retrieval step with LLM’s output and the reliable probability.
In Adaptive Re-retrieval, the retriever reflects this information outputs for better retrieval, based on which the LLM
re-generates answers.

More specifically, we pass the query and 4.2 Adaptive Re-Retrieval


each document through the LLM to obtain the If the classifier does not predict reliable, we re-
last hidden state vector hLLM (Q, Di ). For each retrieve documents since all top-k documents
hidden state, the certainty classifier predicts are unlikely to be helpful in generating correct
either of the following three labels: i) reliable: answers. Instead of simply retrieving the next
the LLM confidently knows the answer, ii) set of top-k documents, we propose an Adap-
misleading: the LLM knows the answer but tive Re-Retrieval (ARR) method. This method
could be wrong, or iii) uncertain: the LLM does feeds the LLM’s answer and certainty value
not know the answer exactly. back to the retriever to improve relevance.
To train the classifier, we assume that the
In ARR, the retriever relies on the LLM’s
LLM knows the answer if the data it has been
answer to the extent that it is reliable. The
trained on remains unchanged and the re-
reliable probability is computed by
trieval succeeds. Conversely, the LLM might
incorrectly know the answer if the data it has ω = λ pCLF (reliable | hLLM (Q, D∗ )). (2)
been trained on has been updated and the re-
trieval fails. Lastly, the LLM does not know The hyperparameter λ is set to optimize re-
the answer if it has never been trained on the retrieval relevance with the training dataset
data and the retrieval fails. Based on this as- for the certainty classifier. Therefore, the rel-
sumption, we train the classifier to predict: i) evance score in the ARR is calculated as a
reliable: given Unchanged QA/turn with cor- weighted sum of two components: cosine sim-
rect retrieval, ii) misleading: given Changed ilarity between the query and the document,
QA/turn with wrong retrieval, iii) uncertain: and the similarity between the document and a
given New QA/turn with wrong retrieval. We concatenation of the query with the generated
train the classifier separately for QA and di- answer yLLM . This is formulated as:
alogue tasks. We use 512, 245, and 512 data
points of Unchanged, Changed, and New score = (1 − ω) sim(E(Q), E(D))
(3)
GrowOVER-QA, respectively, and 512, 133, + ω sim(E([Q : yLLM ]), E(D)).
and 512 Unchanged, Changed, and New turns
of GrowOVER-Dialogue from September. The retriever re-retrieves top-k documents
We choose the hidden state hLLM (Q, D∗ ) based on Eq.(3). We adjust the reflection of
with which the certainty classifier outputs the the generated answer based on the LLM’s pre-
highest reliable probability: dicted reliable probability: the less reliable an
answer is, the less likely it is to be used. Af-
D∗ = arg max pCLF (reliable | hLLM (Q, D)). (1) ter the re-retrieval, we again choose the last
hidden state vector from the document with
D ∈ {Di }

Afterward, LLM generates the answer yLLM the highest reliable probability among the re-
based on the hLLM (Q, D∗ ). If the certainty retrieved documents. In the final step, the
classifier predicts the label as reliable given Decision Gate compares the reliable probabili-
hLLM (Q, D∗ ), the Decision Gate adopts the an- ties of the initial and newly generated answers,
swer; otherwise, we return to the retrieval step. selecting the one with the higher probability.
3287
5 Experiments 9 10 11 12
GrowOVER-QA
5.1 Experimental Setup Accuracy 79.0 75.5 75.8 74.9
Average F1 (Adopted) 53.7 52.1 53.1 52.7
Baselines. We use five types of baselines: Average F1 (Not-adopted) 28.2 28.5 28.8 28.6
i) Vanilla: LLM without retrievals, ii) Self- Average F1 (ALL) 43.5 42.9 42.8 42.3
GrowOVER-Dialogue
RAG: an adaptive RAG baseline (Asai et al., Accuracy 59.0 58.6 58.3 58.6
2023), RaLM: vanilla LLM with concatenated Average BLEU (Adopted) 6.03 6.11 6.17 6.13
retrievals (Ram et al., 2023), iii) RaLM-CP: con- Average BLEU (Not-adopted) 3.44 3.42 3.44 3.45
Average BLEU (ALL) 4.68 4.69 4.72 4.70
tinuously pre-trained LLM (Jang et al., 2022)
with concatenated retrievals, iv) RaLM-D∗ : Table 2: Accuracy of the certainty classifier and results
of adopted / not-adopted answers on each month.
LLM generates an answer with the classifier’s
selected document, and v) RiLM. We use top-k 9 10 11 12
GrowOVER-QA
(k=3) documents for retrievals. Q 13.4 13.0 12.8 12.5
Database. Since it requires much compu- Q:yLLM 14.5 14.1 13.4 13.3
tation to use the entire snapshot (6M articles) ARR 14.6 14.5 13.9 13.6
GrowOVER-Dialogue
for the database, we randomly select 100K arti- Q 10.7 10.5 10.5 10.5
cles in addition to the 12K articles selected for Q:yLLM 11.5 11.3 11.4 11.3
ARR 11.7 11.7 11.5 11.5
GrowOVER generation. We first split these ar-
ticles using the LangChain document loader2, Table 3: Accuracy of ARR on each month.
which semantically segments given documents. 5.2 Experimental Results
Then, we index documents using FAISS (John-
We report not only the end-to-end performance
son et al., 2019) following Shi et al. (2023).
of the QA and dialogue tasks, but also the per-
Retriever. Even though all retrievers can formance of the classifier and ARR to demon-
be plugged into our framework, we use Sent- strate the effectiveness of each component.
BERT (Reimers and Gurevych, 2019) since it For all experiments, we exclude the training
reduces time cost using Siamese encoders with dataset of the classifier.
strong performance. It uses cosine similarity
to calculate sentence similarity. We also test 5.2.1 Results of the Classifier
Contriever (Izacard et al., 2021) and present Experimental results of the classifier are pre-
the results in Appendix B.1. We use questions sented in Table 2. We evaluate the accuracy
as retrieval queries for the QA task, and user of all data points from all months. The accu-
queries combined with the chat history for the racy is approximately 75 for GrowOVER-QA
dialogue response task. and 58 for GrowOVER-Dialogue. Since chat
LLM. We use Llama2-7B (Touvron et al., history is included in the prompts, it is harder
2023). We initially pre-train the model with for the classifier to gauge certainty based on
the selected articles in the 2023-08-20 snapshot. documents and queries, so performance tends
For RaLM-CP, we continuously pre-train the to suffer slightly. Also, to show the actual
model on each new snapshot available every effectiveness of the classifier, we separately
month, using only the selected articles from calculate the average F1 / BLEU scores for
these snapshots for dataset generation. For adopted answers and not-adopted answers
the other baselines, we freeze the initially pre- in the Decision Gate. The gap between the
trained model. In the QA task, the prompt average F1 / BLEU score of adopted and not-
consists of the retrieved documents and ques- adopted answers is approximately 25.0 for the
tions, while in the dialogue task, the prompt QA task and 2.7 for the dialogue task. This sig-
concatenates the chat history, retrieved docu- nificant gap indicates the classifier can predict
ments, and the user query. the certainty and reliability of LLM.
Metric. We use the F1 score to evaluate 5.2.2 Results of Adaptive Re-Retrieval
QA, following Petroni et al. (2021) and use We compare the accuracy of ARR and two
the BLEU (Papineni et al., 2002) score for the baselines: i) search with only query (Q)
dialogue task, follwing (Chan et al., 2021). and ii) always append yLLM to Q (Q:yLLM ),
2https://python.langchain.com/docs/modules/ each of which are using sim(E(Q), E(D)) and
data_connection/document_loaders/. sim(E([Q:yLLM ]), E(D)), where ω = 0 and
3288
9 10 11 12 9 10 11 12
New New
Vanilla 13.4 14.1 14.7 14.1 Vanilla 0.85 0.81 0.84 0.88
Self-RAG 23.6 22.6 23.2 22.7 Self-RAG 2.37 2.29 2.36 2.21
RaLM 38.2 36.8 37.0 37.1 RaLM 4.98 5.08 5.06 4.76
RaLM-CP† 39.2 38.3 37.6 37.7 RaLM-CP† 5.06 5.04 5.08 4.86
RaLM-D∗ (Ours) 37.9 38.1 38.3 38.1 RaLM-D∗ (Ours) 5.27 5.21 5.42 5.07
RiLM (Ours) 39.4 39.4 39.7 39.2 RiLM (Ours) 5.36 5.27 5.51 5.15
Changed Changed
Vanilla 6.1 5.6 7.0 5.3 Vanilla 1.58 2.68 1.40 1.87
Self-RAG 18.2 19.2 17.7 20.1 Self-RAG 4.31 3.74 3.28 5.00
RaLM 24.9 26.5 33.2 28.5 RaLM 5.09 6.25 6.89 6.30
RaLM-CP† 26.0 29.0 33.6 29.6 RaLM-CP† 6.11 6.98 6.49 6.36
RaLM-D∗ (Ours) 25.1 27.7 37.5 29.9 RaLM-D∗ (Ours) 6.60 7.19 6.01 6.38
RiLM (Ours) 28.2 27.7 38.3 30.4 RiLM (Ours) 7.26 7.67 6.05 6.64
Unchanged Unchanged
Vanilla 18.0 17.6 17.2 16.7 Vanilla 1.13 1.12 1.10 1.11
Self-RAG 26.7 25.5 25.4 26.1 Self-RAG 2.58 2.56 2.32 2.49
RaLM 43.1 41.2 41.8 41.1 RaLM 4.42 4.41 4.44 4.45
RaLM-CP† 44.0 43.3 42.7 42.0 RaLM-CP† 4.40 4.43 4.43 4.45
RaLM-D∗ (Ours) 44.3 43.7 43.5 42.9 RaLM-D∗ (Ours) 4.65 4.67 4.69 4.69
RiLM (Ours) 45.7 45.1 44.6 44.1 RiLM (Ours) 4.68 4.69 4.71 4.71
All All
Vanilla 17.4 17.1 16.8 16.4 Vanilla 1.12 1.11 1.09 1.10
Self-RAG 26.3 25.1 25.1 25.7 Self-RAG 2.58 2.55 2.32 2.48
RaLM 42.5 40.5 41.2 40.6 RaLM 4.44 4.44 4.46 4.46
RaLM-CP† 43.4 42.6 42.1 41.5 RaLM-CP† 4.43 4.45 4.46 4.47
RaLM-D∗ (Ours) 43.5 42.9 42.8 42.3 RaLM-D∗ (Ours) 4.68 4.69 4.72 4.70
RiLM (Ours) 44.9 44.2 44.0 43.5 RiLM (Ours) 4.70 4.72 4.74 4.73

continuously pre-trained language model †
continuously pre-trained language model

Table 4: F1 scores of GrowOVER-QA on each month. Table 5: BLEU scores of GrowOVER-Dialogue on each
month.
ω = 1 in equation (3), respectively. The gap compared to other retrieval-augmented mod-
between ARR and the former implies the de- els. Also, in general, RiLM demonstrates out-
gree to which reliable yLLM improves retrieval standing performance over other baselines.
relevance, while the gap between ARR and For New, RiLM improves RaLM from 1.2 in
the latter represents the degree to which ARR September to 2.7 in November. It shows a
ignores potentially incorrect yLLM . We exclude higher F1 score over all months than RaLM-CP,
the first retrieved documents from re-retrieval even though RiLM is not continuously trained.
to avoid using duplicate retrievals. The results For Changed, our method outperforms other
are shown in table 3. In the QA task, ARR baselines, except for October, when RaLM-
improves Q and Q:yLLM by approximately 1.2 CP exceeds our method by 1.3. Nevertheless,
and 0.3, respectively, while in the dialogue our RiLM shows much more robust perfor-
task, the performance enhances by 1.0 and 0.2, mance across the other months, with improve-
respectively. This result indicates that the out- ments ranging from 0.8 to 4.7. Moreover, for
put of LLM with reliable probability can aid in Unchanged and All, our method surpasses
re-retrieval. Moreover, the effect of retrieval the performance of other baselines. However,
relevance on the final answer is demonstrated over several months, all baselines show perfor-
in the subsequent end-to-end results. mance degradation over months, indicating
5.2.3 Results of GrowOVER-QA the need for the model update in the future.
Table 4 shows the performance on GrowOVER- 5.2.4 Results of GrowOVER-Dialogue
QA from September to December. It dis- Table 5 displays the consequences on
plays the F1 score for New, Changed, and GrowOVER-Dialogue from September to De-
Unchanged QAs, as well as All types. When cember. Similar to QA task, RiLM generally
comparing Vanilla to other baselines, it’s clear exhibits superior performance over other base-
that the retrieval significantly enhances perfor- lines. For instance, RiLM improves upon
mance, highlighting its crucial role in open- RaLM-CP by an average of 0.3 for New and
domain tasks. Self-RAG shows modest im- 0.4 for Changed. Also, RiLM enhances per-
provement over Vanilla but underperforms formance on both the Unchanged and All
3289
datasets, indicating its overall effectiveness. New Changed Unchanged
GrowOVER-QA
However, for Changed in November, not only Q 11.1 12.2 13.2
RiLM but also RaLM-D∗ underperforms com- Q:yLLM 12.2 10.6 14.6
pared to RaLM. This performance drop only ARR 12.2 12.6 14.7
GrowOVER-Dialogue
occurs on Changed, which may indicate the Q 10.5 11.8 10.6
limitation in predicting misleading cases. From Q:yLLM 10.6 13.0 11.4
ARR 11.0 13.7 11.6
the consistent performance improvements on
New and Unchanged, we can infer that LLM Table 6: Accuracy of ARR for each label.
can identify what it knows and what it doesn’t
know. On the other hand, it occasionally has New Changed Unchanged
GrowOVER-QA
difficulty assuming that its answer might be Adopted 50.3 39.3 52.4
wrong and verifying it. Not Adopted - Before ARR 26.0 20.0 27.3
Not Adopted - After ARR 19.6 11.9 21.9
Not Adopted - Final Selection 28.7 22.0 31.0
5.2.5 Label-Based Analysis Average 39.4 31.2 44.9
In addition to the monthly results, we also re- GrowOVER-Dialogue
Adopted 7.02 8.37 6.08
port the average experimental results for each Not Adopted - Before ARR 3.97 5.33 3.41
label across the months. Table 6 shows ARR Not Adopted - After ARR 1.19 2.62 1.30
Not Adopted - Final Selection 4.09 5.96 3.45
accuracy for New, Changed, and Unchanged Average 5.32 6.90 4.70
labels. For the Changed label in the QA task,
Q:yLLM performs 1.6 points lower than Q, in- Table 7: F1 and BLEU scores for each label in each
process of the pipeline.
dicating that relying entirely on misleading
yLLM can significantly degrade performance.
the classifier’s effectiveness in selecting more
Conversely, in the dialogue task, misleading
reliable answers. In the QA task, F1 scores
yLLM for Changed improves performance by
for not-adopted answers improve by over 2.0
1.2 points. This suggests that the detailed
points across all types. In the dialogue task,
sentences generated in the dialogue task bene-
BLEU scores significantly improve for New and
fit from common knowledge between the old
Changed by 1.0 and 0.6 points, respectively,
and new data, aiding re-retrieval. For New,
with little improvement in Unchanged.
the LLM struggles to generate detailed an-
swers, resulting in a small gain of 0.1 points.
For Unchanged in both tasks, yLLM improves 6 Conclusion
performance by 1.4 and 0.8 points, respec-
tively, though still lower than ARR. Overall, To evaluate whether LLMs can adapt to the
ARR demonstrates robust and improved per- fast-evolving world knowledge, we propose
formance in all scenarios. GrowOVER-QA and GrowOVER-Dialogue.
To analyze errors corrected by Decision Gate Our benchmarks annotate the evidence text
and ARR, we report the performance at each and introduce dialogue tasks to evaluate
stage of the RiLM pipeline, averaged by label retriever-augmented RaLM comprehensively.
type, as shown in Table 7. This includes results Furthermore, we suggest RiLM, an interactive
for answers adopted by the decision gate, not retriever-generator framework by simply train-
adopted answers before and after ARR and af- ing a classifier for LLM to predict reliability
ter the final selection, and the final aggregated itself. Through rigorous experiments, we show
result. The classifier performs well across that our method can be on par with or surpass
all types, as evidenced by the gap between continuously pre-trained LLMs even without
adopted and not-adopted answers before ARR. pre-training. However, even with retrieval, we
The efficiency of ARR and the decision gate observe that the performance degrades over
is highlighted by the performance improve- time. Thus, we hope our benchmarks can be
ments from before ARR to after the final selec- valuable resources to detect when to update
tion. Although performance decreases after the retriever or LLM in future work. Addi-
ARR due to the exclusion of top-k documents tionally, we anticipate further research into
from re-retrieval, the final selection results optimizing the use of retrievers to reduce the
for not-adopted answers improve, indicating frequency of model updates.
3290
Limitations hancement of casual videos and their 3D meta-
verse transformation), the SNU-Global Excel-
We highlight a few considerations for readers lence Research Center establishment project,
regarding potential limitations. We rely on var- the National Research Foundation of Korea
ious models to label each sentence in Wikipedia (NRF) grant funded by the Korea government
snapshots. Although we thoroughly designed (MSIT) (No. 2023R1A2C2005573), Basic Science
the sentence labeling process to label sentences Research Program through the National Re-
accurately, it can occasionally be faulty. To ad- search Foundation of Korea(NRF) funded by
dress this limitation, we append the MTurk the Ministry of Education(RS-2023-00274280),
study in the Appendix, showing the results are and Institute of Information & communica-
within acceptable bounds. Also, our dataset is tions Technology Planning & Evaluation (IITP)
primarily built from knowledge based on sin- grant funded by the Korea government (MSIT)
gle articles, which may restrict its effectiveness (No. RS-2021-II211343, Artificial Intelligence
for tasks that necessitate combining informa- Graduate School Program (Seoul National Uni-
tion from multiple sources. Further research versity)).
would be beneficial to generate benchmarks
that enable the evaluation of frameworks han-
dling complex reasoning tasks with a time- References
sensitive nature. Moreover, the information Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil,
extracted from Wikipedia may not promptly and Hannaneh Hajishirzi. 2023. Self-rag: Learn-
reflect real-world knowledge updates. This ing to retrieve, generate, and critique through
means that knowledge modifications in the self-reflection. In The Twelfth International Confer-
ence on Learning Representations.
real world might not be immediately mirrored
in the dataset, resulting in some degree of Jonathan Berant, Andrew Chou, Roy Frostig, and
outdated or inaccurate information. Finally, Percy Liang. 2013. Semantic parsing on freebase
the characteristics of our dataset may be influ- from question-answer pairs. In Proceedings of the
2013 Conference on Empirical Methods in Natural
enced by the features of GPT-4 that we utilized Language Processing, pages 1533–1544.
during its generation, as well as the prompts
we used. To enhance the effectiveness and mit- Zhangming Chan, Lemao Liu, Juntao Li, Haisong
Zhang, Dongyan Zhao, Shuming Shi, and Rui
igate bias in datasets, further research should Yan. 2021. Enhancing the open-domain dialogue
focus on generating datasets for a variety of us- evaluation in latent space. In Findings of the
ing Large Language Models employed across Association for Computational Linguistics: ACL-
diverse fields, aligned with a wide range of IJCNLP 2021, pages 4889–4900.
prompts. Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang
Che, Ting Liu, and Xiangzhan Yu. 2020. Recall
Ethics Statement and Learn: Fine-tuning Deep Pretrained Lan-
guage Models with Less Forgetting. Preprint,
We have manually reevaluated the dataset we arXiv:2004.12651.
created to ensure it is free of any potential for Zhiyuan Chen and Bing Liu. 2018. Lifelong machine
discrimination, human rights violations, bias, learning, volume 1. Springer.
exploitation, and any other ethical concerns.
Bhuwan Dhingra, Jeremy R Cole, Julian Martin
Eisenschlos, Daniel Gillick, Jacob Eisenstein, and
Acknowledgments William W Cohen. 2022. Time-aware language
models as temporal knowledge bases. Transac-
We sincerely thank Jaewoo Ahn, Soochan Lee, tions of the Association for Computational Linguis-
Yeda Song, Heeseung Yun, Junik Bae, and tics, 10:257–273.
other anonymous reviewers for their valu-
Emily Dinan, Stephen Roller, Kurt Shuster, An-
able comments. This work was supported gela Fan, Michael Auli, and Jason Weston. 2018.
by Institute of Information & communica- Wizard of wikipedia: Knowledge-powered con-
tions Technology Planning & Evaluation (IITP) versational agents. In International Conference on
grant funded by the Korea government (MSIT) Learning Representations.
(No. RS-2022-II220156, Fundamental research Kelvin Guu, Kenton Lee, Zora Tung, Panupong
on continual meta-learning for quality en- Pasupat, and Mingwei Chang. 2020. Retrieval
3291
augmented language model pre-training. In In- Annual Meeting of the Association for Computa-
ternational Conference on Machine Learning, pages tional Linguistics (Volume 1: Long Papers), pages
3929–3938. PMLR. 1601–1611.
Tianxing He, Jun Liu, Kyunghyun Cho, Myle Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi,
Ott, Bing Liu, James Glass, and Fuchun Peng. Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir
2021. Analyzing the forgetting problem in the Radev, Noah A Smith, Yejin Choi, and Kentaro
pretrain-finetuning of dialogue response models. Inui. 2022. Realtime qa: What’s the answer right
Preprint, arXiv:1910.07117. now? arXiv preprint arXiv:2207.13332.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- Yujin Kim, Jaehong Yoon, Seonghyeon Ye, Sung Ju
dino, Hagen Fürstenau, Manfred Pinkal, Marc Hwang, and Se-young Yun. 2023. Carpe
Spaniol, Bilyana Taneva, Stefan Thater, and Ger- diem: On the evaluation of world knowledge
hard Weikum. 2011. Robust disambiguation in lifelong language models. arXiv preprint
of named entities in text. In Proceedings of the arXiv:2311.08106.
2011 Conference on Empirical Methods in Natural Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Language Processing, pages 782–792. Redfield, Michael Collins, Ankur Parikh, Chris
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Alberti, Danielle Epstein, Illia Polosukhin, Ja-
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, cob Devlin, Kenton Lee, et al. 2019. Natural
and Weizhu Chen. 2021. Lora: Low-rank questions: a benchmark for question answer-
adaptation of large language models. Preprint, ing research. Transactions of the Association for
arXiv:2106.09685. Computational Linguistics, 7:453–466.
Angeliki Lazaridou, Adhi Kuncoro, Elena Gri-
Gautier Izacard, Mathilde Caron, Lucas Hosseini,
bovskaya, Devang Agrawal, Adam Liska, Tay-
Sebastian Riedel, Piotr Bojanowski, Armand
fun Terzi, Mai Gimenez, Cyprien de Mas-
Joulin, and Edouard Grave. 2021. Unsupervised
son d’Autume, Tomas Kocisky, Sebastian Ruder,
dense information retrieval with contrastive
et al. 2021. Mind the gap: Assessing temporal
learning. arXiv preprint arXiv:2112.09118.
generalization in neural language models. Ad-
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu- vances in Neural Information Processing Systems,
cas Hosseini, Fabio Petroni, Timo Schick, Jane 34:29348–29363.
Dwivedi-Yu, Armand Joulin, Sebastian Riedel, Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
and Edouard Grave. 2022. Few-shot learn- Zettlemoyer. 2017. Zero-shot relation extraction
ing with retrieval augmented language models. via reading comprehension. In Proceedings of the
arXiv preprint arXiv:2208.03299. 21st Conference on Computational Natural Language
Learning (CoNLL 2017), pages 333–342.
Joel Jang, Seonghyeon Ye, Changho Lee, Sohee
Yang, Joongbo Shin, Janghoon Han, Gyeonghun Patrick Lewis, Ethan Perez, Aleksandra Piktus,
Kim, and Minjoon Seo. 2022. Temporalwiki: A Fabio Petroni, Vladimir Karpukhin, Naman
lifelong benchmark for training and evaluating Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
ever-evolving language models. arXiv preprint Yih, Tim Rocktäschel, et al. 2020. Retrieval-
arXiv:2204.14211. augmented generation for knowledge-intensive
nlp tasks. Advances in Neural Information Process-
Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo ing Systems, 33:9459–9474.
Shin, Janghoon Han, KIM Gyeonghun, Stan-
ley Jungkyu Choi, and Minjoon Seo. 2021. To- Adam Liska, Tomas Kocisky, Elena Gribovskaya,
wards continual knowledge learning of language Tayfun Terzi, Eren Sezener, Devang Agrawal,
models. In International Conference on Learning D’Autume Cyprien De Masson, Tim Scholtes,
Representations. Manzil Zaheer, Susannah Young, et al. 2022.
Streamingqa: A benchmark for adaptation to
Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing new knowledge over time in question answering
Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, models. In International Conference on Machine
Jamie Callan, and Graham Neubig. 2023. Active Learning, pages 13604–13622. PMLR.
retrieval augmented generation. In Proceedings
of the 2023 Conference on Empirical Methods in Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Natural Language Processing, pages 7969–7992. Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Jeff Johnson, Matthijs Douze, and Hervé Jégou. Stoyanov. 2019. Roberta: A robustly opti-
2019. Billion-scale similarity search with GPUs. mized bert pretraining approach. arXiv preprint
IEEE Transactions on Big Data, 7(3):535–547. arXiv:1907.11692.
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Shayne Longpre, Kartik Perisetla, Anthony Chen,
Luke Zettlemoyer. 2017. Triviaqa: A large scale Nikhil Ramesh, Chris DuBois, and Sameer Singh.
distantly supervised challenge dataset for read- 2021. Entity-based knowledge conflicts in ques-
ing comprehension. In Proceedings of the 55th tion answering. arXiv preprint arXiv:2109.05052.
3292
Kelvin Luu, Daniel Khashabi, Suchin Gururan- Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,
gan, Karishma Mandyam, and Noah A Smith. Xuanjing Huang, Jianshu ji, Guihong Cao, Daxin
2021. Time waits for no one! analysis and chal- Jiang, and Ming Zhou. 2020. K-adapter: In-
lenges of temporal misalignment. arXiv preprint fusing knowledge into pre-trained models with
arXiv:2111.07408. adapters. Preprint, arXiv:2002.01808.
Katerina Margatina, Shuai Wang, Yogarshi Vyas, Adina Williams, Nikita Nangia, and Samuel Bow-
Neha Anna John, Yassine Benajiba, and Miguel man. 2018. A broad-coverage challenge corpus
Ballesteros. 2023. Dynamic benchmarking of for sentence understanding through inference.
masked language models on temporal concept In Proceedings of the 2018 Conference of the North
drift with multiple views. In Proceedings of the American Chapter of the Association for Computa-
17th Conference of the European Chapter of the Asso- tional Linguistics: Human Language Technologies,
ciation for Computational Linguistics, pages 2873– Volume 1 (Long Papers), pages 1112–1122, New Or-
2890. leans, Louisiana. Association for Computational
Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. Bleu: a method for au- Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiao-
tomatic evaluation of machine translation. In hui Tao, and Fu Lee Wang. 2023. Parameter-
Proceedings of the 40th annual meeting of the Associ- efficient fine-tuning methods for pretrained lan-
ation for Computational Linguistics, pages 311–318. guage models: A critical review and assessment.
Preprint, arXiv:2312.12148.
Fabio Petroni, Aleksandra Piktus, Angela
Fan, Patrick Lewis, Majid Yazdani, Nicola Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
De Cao, James Thorne, Yacine Jernite, Vladimir Bengio, William Cohen, Ruslan Salakhutdinov,
Karpukhin, Jean Maillard, et al. 2021. Kilt: a and Christopher D Manning. 2018. Hotpotqa: A
benchmark for knowledge intensive language dataset for diverse, explainable multi-hop ques-
tasks. In Proceedings of the 2021 Conference of tion answering. In Proceedings of the 2018 Con-
the North American Chapter of the Association for ference on Empirical Methods in Natural Language
Computational Linguistics: Human Language Tech- Processing, pages 2369–2380.
nologies, pages 2523–2544.
Michael Zhang and Eunsol Choi. 2021. SituatedQA:
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhl- Incorporating extra-linguistic contexts into QA.
gay, Amnon Shashua, Kevin Leyton-Brown, In Proceedings of the 2021 Conference on Empirical
and Yoav Shoham. 2023. In-context retrieval- Methods in Natural Language Processing, pages
augmented language models. arXiv preprint 7371–7387, Online and Punta Cana, Dominican
arXiv:2302.00083. Republic. Association for Computational Lin-
guistics.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
bert: Sentence embeddings using siamese bert-
networks. arXiv preprint arXiv:1908.10084.
Apoorv Saxena, Soumen Chakrabarti, and Partha
Talukdar. 2021. Question answering over
temporal knowledge graphs. arXiv preprint
arXiv:2106.01515.
Weijia Shi, Sewon Min, Michihiro Yasunaga,
Minjoon Seo, Rich James, Mike Lewis, Luke
Zettlemoyer, and Wen-tau Yih. 2023. Replug:
Retrieval-augmented black-box language mod-
els. arXiv preprint arXiv:2301.12652.
James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
Fever: a large-scale dataset for fact extraction and
verification. In Proceedings of the 2018 Conference
of the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long Papers), pages 809–819.
Hugo Touvron, Louis Martin, Kevin Stone, Pe-
ter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhar-
gava, Shruti Bhosale, et al. 2023. Llama 2: Open
foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288.
3293
A Experimental Details B Additional Experimental Results
B.1 Contriever
Dataset. For dataset generation, we use Ope- We additionally perform experiments with
nAI gpt-4-1106-preview model and set the tem- Contriever (Izacard et al., 2021), following Shi
perature as 0 and max_new_token as 256. Also, et al. (2023) and Izacard et al. (2022). We report
we use Huggingface princeton-nlp/sup-simcse- classifier performance, ARR results, as well as
roberta-large for NLI task in the sentence la- end-to-end QA results.
beling process. Additionally, when check-
ing contradiction with GPT-4, we use gpt-4- B.1.1 Classifier Performance
1106-preview and set the temperature as 0 and Table 8 shows the experimental results of the
max_tokens as 256. certainty classifier. Despite using the same cer-
tainty classifier as SentBERT, the accuracy of
Database. We use the Langchain docu- all cases is above 65.0, demonstrating its effec-
ment loader (RecursiveCharacterTextSplitter) tiveness. Also, across all months, F1 / BLEU
to split each article in the Wikipedia snapshot scores for adopted and not-adopted answers
into several passages. We set chunk_size as show a significant gap.
1500 and chunk_overlap as 10 characters.
9 10 11 12
Continual Pretraining of LLM. For initial GrowOVER-QA
training on August, we train Llama2 for four Accuracy 77.2 74.5 74.4 72.9
epochs with a learning rate of 1e-06, a learning Average F1 (Adopted) 28.4 27.7 27.4 27.1
Average F1 (Not-adopted) 17.0 16.7 16.5 16.2
rate decay of 0.8, a cosine learning rate sched- Average F1 (ALL) 21.1 20.6 20.2 19.8
uler, an AdamW optimizer, and a batch size of GrowOVER-Dialogue
Accuracy 66.6 66.4 65.9 66.0
64, using FSDP (fully sharded data parallel). Average BLEU (Adopted) 4.44 4.44 4.46 4.46
After that, for the c-RaLM baseline, we continu- Average BLEU (Not-adopted) 2.53 2.56 2.55 2.56
Average BLEU (ALL) 2.57 2.57 2.57 2.59
ously pre-train the model with 12K articles on
each month. We set all hyperparameters the Table 8: Accuracy and F1 scores of the classifier in
same with the initial training except the epoch. the Decision Gate on each month.
To prevent catastrophic forgetting, we trained
the model for only one epoch, following Jang
et al. (2022). B.1.2 Adaptive Re-Retrieval results
As shown in Table 9, ARR significantly im-
Classifier Training. For the classifier, we proves retrieval relevance. In the QA task,
trained a linear layer with dimensions (4096, 3). performance increased by more than 3 times,
In both QA and dialogue response generation and in the dialogue task, it increased by more
tasks, we set the learning rate as 0.0001, weight than 2 times.
decay as 1e-07, and 10 epochs. We use a cosine
learning rate scheduler. Also, we train the 9 10 11 12
model using cross-entropy loss. GrowOVER-QA
Next top-k 2.6 2.6 2.5 2.6
ARR 8.5 8.2 8.0 8.0
Adaptive Re-Retrieval. In both QA and
GrowOVER-Dialogue
dialogue tasks with SentBERT, we set the hyper- Next top-k 11.6 11.6 11.6 11.6
parameter λ as 2.0. In the other cases, we set ARR 27.1 27.6 27.3 30.7

the λ as 1.0.
Table 9: Adaptive Re-Retrieval relevance of Con-
Answer Generation. We set the triever compared to choosing next top-k documents
max_new_token of Llama2 as 10 for QA and on each month.
50 for dialogue, and load the model using
bfloat16. B.1.3 QA Results
Self-RAG. We used short-form generation The results of GrowOVER-QA are presented
version with always_retrieve mode, since all in Table 10. It demonstrates the robustness of
data points of GrowOver contain evidence our method. For New, Unchanged, and all
texts. We used same retrievals as RiLM. types of QA, our RiLM shows the highest score
3294
across all months. Also, for Changed, RiLM 9 10 11 12
New
surpasses the other baselines except in Septem- Vanilla 0.85 0.81 0.84 0.88
ber. In September, the performance degrades RaLM 3.28 3.11 3.22 2.95
after re-retrieval, which indicates the DG pos- RaLM-CP† 3.30 3.13 3.25 2.94
RaLM-D∗ (Ours) 3.26 3.07 3.20 3.10
sibly struggles with selecting correct answers RiLM (Ours) 3.61 3.14 3.66 3.10
between two different answers. Nonetheless, Changed
across all the other months, RiLM improves Vanilla 1.58 2.68 1.40 1.87
RaLM 5.40 5.39 3.79 4.63
RaLM-D∗ about by 3.0. Also, it significantly RaLM-CP† 4.74 5.08 4.06 4.64
outperforms other baselines. RaLM-D∗ (Ours) 3.94 4.66 3.92 5.29
RiLM (Ours) 4.28 4.71 4.48 5.29
9 10 11 12 Unchanged
New Vanilla 1.13 1.12 1.10 1.11
Vanilla 13.4 14.1 14.7 14.1 RaLM 2.50 2.53 2.52 2.54
RaLM 16.5 16.7 17.1 16.4 RaLM-CP† 2.48 2.51 2.52 2.53
RaLM-CP† 17.0 17.0 17.3 16.8 RaLM-D∗ (Ours) 2.54 2.55 2.54 2.56
RaLM-D∗ (Ours) 17.2 17.9 18.0 17.0 RiLM (Ours) 2.66 2.56 2.64 2.58
RiLM (Ours) 19.8 20.2 19.8 18.8 All
Changed Vanilla 1.12 1.11 1.09 1.10
Vanilla 6.1 5.6 7.0 5.3 RaLM 2.53 2.56 2.55 2.56
RaLM 8.1 10.7 9.0 9.2 RaLM-CP† 2.51 2.54 2.55 2.55
RaLM-CP† 8.1 11.1 9.1 9.3 RaLM-D∗ (Ours) 2.57 2.57 2.57 2.59
RaLM-D∗ (Ours) 11.8 10.9 10.5 8.8 RiLM (Ours) 2.69 2.59 2.68 2.60

continuously pre-trained language model
RiLM (Ours) 11.4 13.8 13.5 12.6
Unchanged
Vanilla 18.0 17.6 17.2 16.7 Table 11: BLEU score of GrowOVER-Dialogue on
RaLM 20.3 19.8 19.4 19.2 each month using Contriever. The table shows
RaLM-CP† 20.9 20.3 19.9 19.6 the BLEU score between the generated dialogue
RaLM-D∗ (Ours) 21.7 21.0 20.6 20.3 response and the gold dialogue response.
RiLM (Ours) 23.6 22.9 22.4 22.1
All
Vanilla 17.4 17.1 16.8 16.4
RaLM 19.8 19.3 19.0 18.8
RaLM-CP† 20.4 19.8 19.6 19.3
RaLM-D∗ (Ours) 21.1 20.6 20.2 19.8
RiLM (Ours) 23.1 22.5 22.1 21.7

continuously pretrained language model

Table 10: F1 score of GrowOVER-QA on each


month using Contriever.

3295
C Algorithms Algorithm 3 Sentence Labeling
Require: sentences_old:= sentences in the old article,
sentences_new:= sentences in the new article
Algorithm 1 Initial Generation for GrowOVER-QA ▷ Identify unchanged sentence pairs
Require: W Pcur := Wikipedia snapshots of the initial month for sold in sentences_old do
P := An empty array to store valid paragraphs for snew in sentences_new do
S := An empty array to store selected paragraphs
Q := An empty array to store generated QA pairs
if sim(sold , snew ) > thrs(=0.99) then
∗ article in W P has attributes id, title and text (sold , snew ) ← unchanged
end if
for all article ar ∈ W Pinit do end for
for paragraph pr ∈ ar .text do
if pr is of adequate length then end for
P .append(pr ) ▷ Identify unchanged sentence groups
end if for each group of sentences (Sold , Snew ) enclosed by
end for unchanged pairs do
S ← ClusterParagraphs(P )
Q ← Q + GenerateQA(S, New) for subold in P(Sold ) do
end for for subnew in P(Sold ) do
if sim (concat(subold ), concat(subnew )) >
function ClusterParagraphs(P ) thrs(=0.99) then
Extract features and obtain embeddings of P using SimCSE
Compute cluster assignment using KMeans algorithm (subold , subnew ) ← unchanged
S ← Randomly selected paragraphs from each cluster end if
return S end for
end function end for
function GenerateQA(S, type) end for
// initialize QA as an empty array ▷ NLI for changed & new
for selected text p ∈ S do for each group of sentences (Sold , Snew ) enclosed by
qa ← generate QA pairs with p unchanged pairs do
qa.type ← type
QA.append(qa) for snew in Snew do
end for preds ← an empty list
return QA for sold in Sold do
end function preds.append(NLI.classify(sold , snew ))
end for
if “entailment” in preds then
sold ← the entailed old sentence
Algorithm 2 Initial Generation for GrowOVER-Dialogue
(sold , snew ) ← unchanged
Require: W Pinit := Wikipedia snapshots at initial month else if “contradiction” in preds then
P := An empty array to store splitted paragraphs
D := An empty array to store generated Dialogue sold ← the contradicted old sentence
∗ article in W P has attributes id, title and text if sim (sold , snew ) > τ1 (=0.6) then
if GPT-4.contradict(sold , snew ) then
for article ar ∈ W Pinit do (sold , snew ) ← changed
P ← SplitArticleIntoParagraph(ar )
for paragraph p ∈ P do end if
if p is informative paragraph then end if
D.append(GenerateDialogue(p, ar .title)) else ▷ all elements in preds are “neutral”
end if sim_res ← [sim(sold , snew ) for sold in Sold )]
end for
end for max_similarity ← max(sim_res)
if max_similarity < τ2 (=0.7) then
function SplitArticleIntoParagraph(a) snew ← new
for paragraph p ∈ a.text do end if
P .append(split p into sentence)
end for end if
return P end for
end function end for
function GenerateDialogue(p, a.title)
d ← generate dialogue with p and a.title
for turn t ∈ d do
t.type ← NEW
end for
return d
end function

3296
Algorithm 4 Temporal Updates for GrowOVER-QA D Mturk Study
Require: Wt := Wikipedia snapshots of month t
QAt-1 := QAs of month t-1
QAt := An empty set for QAs of month t
We thoroughly designed the sentence label-
ing process to accurately label sentences (e.g.,
for article at ∈ Wt do
for at ’s QA qa ∈ QAt-1 do setting a high threshold for selecting semanti-
qa ← UpdateQA(qa) cally identical sentences). To further validate
if qa.type is Unchanged then
QAt .append(qa) using sentence similarity scores and natural
end if
end for language inference, we employed Amazon Me-
Gnew ← groups of new sentences in at
for group g ∈ Gnew do
chanical Turk (AMT) workers to assess the
QAt .append(GenerateQA(g, New)) sentence labels during the rebuttal process.
end for
for changed sentence st ∈ at do We randomly sampled 30 new and 30
st-1 ← contradictory sentence in at-1
QAt .append(GenerateQA(st-1 , st , Changed))
changed sentences and asked AMT workers
end for to classify whether each sentence in the new
end for
article was supported, not supported, or uncer-
function UpdateQA(qa)
if all sentences ∈ qa.evid_text are unchanged then tain given the old article. For new sentences,
qa.type ← Unchanged the labeling is incorrect when classified as
qa.index ← indices of qa.evid_text in Wt
else “supported,” while for changed sentences, the
qa.type ← Deleted
end if labeling is correct when classified as “not sup-
return qa
end function
ported.” Each sentence was evaluated by three
workers, and the majority vote was used. The
human quality check results are shown in the
Algorithm 5 Temporal Updates for GrowOVER- table 12.
Dialogue Although new sentences have lower accu-
Require: Wt := Wikipedia snapshots on month t
Dt-1 := Dialogues of month t-1 racy since they are not verified with GPT-4,
Dt := An empty set for dialogues of month t the results are still within acceptable bounds.
for article at ∈ Wt do Evaluating the correct label by humans re-
for at ’s dialogue d ∈ Dt-1 do
d ← UpdateDialog(d, Lt-1 ) quires a review of the entire article, which is
Dt .append(d) extremely time-consuming. This becomes crit-
end for
P ← SplitArticleIntoParagraph(at ) ical, especially for our dataset, which requires
for paragraph p ∈ P do
if p is not informative paragraph then regular updates to reflect the ever-changing
continue
else if changed or new sentence in p then
knowledge of the real world. Therefore, as evi-
Dt .append(GenerateDial(p, at .title)) denced by the human quality check results, our
end if
end for fully automated sentence labeling alone can
end for
efficiently provide reasonably accurate labels
function UpdateDialog(d) with no human effort at all.
for turn t ∈ d do
if s is unchanged sentence then
t.type ← Unchanged New Changed
t.index ← the index of t.evid_text in W Pcur GrowOVER-QA
else
t.type ← Deleted Accuracy 86.7 96.7
end if
end for
return d
Table 12: Sentence labeling validation.
end function

3297
E Data Analysis G Case Study
Table 18 and Table 19 present a case study
Unchanged New Changed Deleted Total from GrowOVER-QA. Predictions denote the
08 - 32,807 - - 32,807
09 32,422 (512) 4,936 (512) 290 (245) 385 37,648 (1,269) answers generated by each retriever-generator
10 36,863 5,193 307 785 42,363 framework. Table 20 presents a case study
11 41,257 5,363 309 1,106 46,929
12 43,422 5,082 313 1,211 48,817 from GrowOVER-Dialogue.

Table 13: QA H Prompt examples

Unchanged New Changed Deleted Total


Table 21 shows the prompt used for the ini-
08 - 108,128 - - 108,128 tial generation of the QA pair. Table 22 and
09 109,752 (512) 4,478 (512) 156 (133) 987 114,386 (1,157)
10 115,022 4,797 147 1,883 119,966 23 show the prompts used to generate New
11 120,551 4,870 158 2,161 125,579 QA pairs, without and with Source Content
12 125,940 5,218 142 2,427 131,300
respectively. Table 24 shows the prompt used
Table 14: Dialogue to generate Changed QA pairs. Table 25 shows
the prompt used for Dialogue generation.

F Dataset Examples
Table 15 shows the examples of QA. Table 16
and 17 show the examples of Dialogue.

3298
Article : Politics of Cambodia
Type : Changed
Question: Who is the current prime minister of Cambodia from the Cambodian People’s Party (CPP)?
Answer: Hun Manet
Previous Answer: Hun Sen
Evidence Text: The current prime minister is Cambodian People’s Party (CPP) member Hun Manet.
Evidence Index: 80
Article : The Eras Tour
Type : New
Question: Who is directing the concert film Taylor Swift: The Eras Tour?
Answer: Sam Wrench
Evidence Text: On August 31, 2023, Swift announced the concert film Taylor Swift: The Eras Tour, directed by Sam
Wrench. Recorded at SoFi Stadium in Los Angeles, the film is scheduled for release to theaters in North America on
October 13.
Evidence Index: 263
Article : Airline
Type : Unchanged
Question: Which airport would most Manhattan travelers prefer for its proximity?
Answer: LaGuardia Airport
Evidence Text: For example, LaGuardia Airport is the preferred airport for most of Manhattan due to its proximity,
while long-distance routes must use John F. Kennedy International Airport’s longer runways.
Evidence Index: 297

Table 15: An Example of QA

Article : 2030 Winter Olympics


Paragraph :
20
A decision on the 2030 Winter Olympics host city has been delayed until 23 July 2024, to allow the IOC more time
to carefully plan the future of the Winter Olympics. 21 Due to impacts from climate change, the IOC is considering a
number of changes to future games, such as rotating host cities, limiting the games to existing or temporary venues,
and establishing minimum average temperature requirements. 22 It also considered awarding the 2030 and 2034
winter games simultaneously at the next IOC session, but IOC President Thomas Bach has ruled out that possibility
because "it would not be the right thing to do". 23 However, at the 141st IOC Session in Mumbai, where these games
were originally set to be awarded, the IOC approved to award both the 2030 and 2034 Winter Olympics at the 142nd
IOC Session in Paris on the eve of the 2024 Summer Olympics.
Dialogue :
Type: Unchanged
User: I was looking forward to finding out which city will host the 2030 Winter Olympics, but I heard there’s been a
delay. What’s going on with that?
Expert: Yes, that’s correct. The decision has been postponed until 23 July 2024. The International Olympic Committee
(IOC) wants to take more time to plan the future of the Winter Games with careful consideration.
Evidence Index: 20
Type: Unchanged
User: I guess the delay must be frustrating for the candidate cities. I’ve heard that the changing climate is causing
some issues?
Expert: Indeed, the changing climate is a significant concern. The IOC is exploring various adjustments for future
games, including the possibility of having rotating host cities and using existing or temporary venues to mitigate the
impact.
Evidence Index: 21
Type: Unchanged
User: I also caught a rumor that they might announce the hosts for both 2030 and 2034 at the same time. Is that true?
Expert: That was a consideration at one point. However, IOC President Thomas Bach has stated that awarding both
the 2030 and 2034 Winter Olympics simultaneously wouldn’t be appropriate.
Evidence Index: 22
Type: New
User: So, when can we expect the announcement for the 2030 host city?
Expert: The announcement is set to be made at the 142nd IOC Session in Paris, which will happen right before the
2024 Summer Olympics. They will also award the 2034 Winter Olympics at that session.
Evidence Index: 23

Table 16: An Example of Dialogue

3299
Article : Tornadoes of 2023
Paragraph :
357
More rounds of severe storms occurred on June 17, including a powerful MCS that pushed through Northern
and Central Oklahoma. 358 Both the Oklahoma City and Tulsa metropolitan areas had widespread wind damage
with the latter metropolitan area seeing wind gusts of up to and multiple weak tornadoes. 359 On June 18, more
tornadoes were confirmed in Florida and Mississippi, including an EF3 tornado which caused significant damage in
Louin, Mississippi, where at least one fatality and 20 injuries occurred. 360 Scattered tornadoes occurred across the
Southeast on June 19 as well, including an EF2 tornado that injured six people in Moss Point, Mississippi. 361 Overall,
87 tornadoes were confirmed during this outbreak sequence.
Dialogue :
Type: Unchanged
User: I heard there were some intense storms in Oklahoma last June. What happened exactly?
Expert: Indeed, on June 17, Oklahoma experienced severe storms, including a particularly strong Mesoscale
Convective System that affected the northern and central parts of the state.
Evidence Index: 357
Type: Unchanged
User: That sounds serious. Did the storms cause a lot of damage in the area?
Expert: Yes, the storms led to extensive wind damage, especially in the Oklahoma City and Tulsa areas. Tulsa even
experienced wind gusts of up to 80 mph and several weak tornadoes.
Evidence Index: 358
Type: Unchanged
User: I also caught wind of some tornado activity in other states around that time.
Expert: That’s correct. The following day, on June 18, Florida and Mississippi reported tornadoes. Mississippi was
hit by an EF3 tornado in Louin, which resulted in one death and 20 injuries.
Evidence Index: 359
Type: Unchanged
User: Was that the end of it, or did the tornadoes continue?
Expert: The tornadoes continued. On June 19, the Southeast saw more, including an EF2 tornado in Moss Point,
Mississippi, which injured six people.
Evidence Index: 360
Type: Changed
User: It must have been a large outbreak to affect so many areas.
Expert: It was indeed a significant outbreak. In total, there were 87 confirmed tornadoes during that sequence of
storms.
Evidence Index: 361

Table 17: Another Example of Dialogue

3300
Article : Football player
Type : Unchanged
Question: What was the average annual salary for goalkeepers in Major League Soccer during the 2013 season?
Answer: $85,296
Evidence Text: For example, the average annual salary for footballers in Major League Soccer for the 2013 season
was $148,693, with significant variations depending on the player position (goalkeepers for example earned $85,296,
whereas forwards earned $251,805.
Predictions :
Vanilla: $110,000 RaLM: $148,693 RaLM-CP: $148,693 RaLM-D∗ (Ours): $85,296 RiLM (Ours): $85,296
Retrieved Documents :
[Top-3]
(1) Hit: However, only a fraction of men’s professional football players is paid at this level. ... For example, the
average annual salary for footballers in Major League Soccer for the 2013 season was $148,693, with significant
variations depending on the player position (goalkeepers for example earned $85,296, whereas forwards earned
$251,805. Popularity and average salaries in women’s leagues are far lower. For example, players in ...
(2) Miss: MLS has a set of pool goalkeepers who are signed to a contract with the league and are loaned to teams
during emergencies in which they are missing a goalkeeper due to injuries or suspensions. ... These initiatives have
brought about an increase in on-field competition.
(3) Miss: According to "France Football", Messi was the world’s highest-paid footballer for five years out of six between
2009 and 2014; ... In 2020, Messi became the second footballer, as well as the second athlete in a team sport, after
Cristiano Ronaldo, to surpass $1 billion in earnings during their careers.

[RaLM-D∗ ]
Hit: However, only a fraction of men’s professional football players is paid at this level. ... For example, the average
annual salary for footballers in Major League Soccer for the 2013 season was $148,693, with significant variations
depending on the player position (goalkeepers for example earned $85,296, whereas forwards earned $251,805.
Popularity and average salaries in women’s leagues are far lower. For example, players in ...

[RiLM]
Hit: However, only a fraction of men’s professional football players is paid at this level. ... For example, the average
annual salary for footballers in Major League Soccer for the 2013 season was $148,693, with significant variations
depending on the player position (goalkeepers for example earned $85,296, whereas forwards earned $251,805.
Popularity and average salaries in women’s leagues are far lower. For example, players in ...
Article : Benjamin Netanyahu
Type : New
Question: What city was Benjamin Netanyahu born in?
Answer: Tel Aviv
Evidence Text: Netanyahu was born in Tel Aviv, to Benzion Netanyahu (original name Mileikowsky) and Tzila (Cela;
Predictions :
Vanilla: Jerusalem, Israel RaLM: Tel Aviv RaLM-CP: Tel Aviv RaLM-D∗ (Ours): Tel Aviv RiLM (Ours): Tel
Aviv
Retrieved Documents :
[Top-3]
(1) Miss: Netanyahu was the second of three children. He was initially raised and educated in Jerusalem, where he
attended ... the liberal sensibilities of the Reform synagogue, Temple Judea of Philadelphia, that the family attended.
(2) Hit: Netanyahu was born in Tel Aviv, to Benzion Netanyahu (original name Mileikowsky) and Tzila (Cela;
née Segal). His mother was born in 1912 in Petah Tikva, then in Ottoman Palestine, now Israel. Though all his
grandparents were born in ...
(3) Miss: Netanyahu made his closeness to Donald Trump, a personal friend since the 1980s, central to his political
appeal in Israel from 2016. ... He claims descent from the Vilna Gaon.

[RaLM-D∗ ]
Hit: Netanyahu was born in Tel Aviv, to Benzion Netanyahu (original name Mileikowsky) and Tzila (Cela;
née Segal). His mother was born in 1912 in Petah Tikva, then in Ottoman Palestine, now Israel. Though all his
grandparents were born in ...

[RiLM]
Hit: Netanyahu was born in Tel Aviv, to Benzion Netanyahu (original name Mileikowsky) and Tzila (Cela;
née Segal). His mother was born in 1912 in Petah Tikva, then in Ottoman Palestine, now Israel. Though all his
grandparents were born in ...

Table 18: Case Study for QA

3301
Article : Kyrylo Budanov
Type : Changed
Question: What is Kyrylo Budanov’s military rank?
Answer: Lieutenant general
Evidence Text: He holds the rank of lieutenant general.
Predictions :
Vanilla: Kyrylo Budanov is a Major General in the Ukrainian Armed Forces. RaLM: Lieutenant General RaLM-CP:
Lieutenant General RaLM-D∗ (Ours): Lieutenant General RiLM (Ours): Lieutenant General
Retrieved Documents :
[Top-3]
(1) Hit: Kyrylo Oleksiyovych Budanov (; born 4 January 1986) is a Ukrainian military leader who is the chief of the ...
Budanov previously served as the Deputy Director of one of the Departments of the Foreign Intelligence Service of
Ukraine. He holds the rank of lieutenant general. ... as head of the Main Intelligence Directorate of the Ministry of
Defense.
(2) Miss: The Austrian military has a wide variety of equipment. Recently, Austria has spent considerable amounts of
money modernizing its military arsenal. ...
(3) Miss: Soon after the start of the German invasion of the Soviet Union, he was soon re-drafted into the Red Army
on 4 July 1941 and initially deployed to the front as part of the 50th Cavalry Regiment. ...

[RaLM-D∗ ]
Hit: Kyrylo Oleksiyovych Budanov (; born 4 January 1986) is a Ukrainian military leader who is the chief of the ...
Budanov previously served as the Deputy Director of one of the Departments of the Foreign Intelligence Service of
Ukraine. He holds the rank of lieutenant general. ... as head of the Main Intelligence Directorate of the Ministry of
Defense.

[RiLM]
Hit: Kyrylo Oleksiyovych Budanov (; born 4 January 1986) is a Ukrainian military leader who is the chief of the ...
Budanov previously served as the Deputy Director of one of the Departments of the Foreign Intelligence Service of
Ukraine. He holds the rank of lieutenant general. ... as head of the Main Intelligence Directorate of the Ministry of
Defense.
Article : Darwin, Northern Territory
Type : Unchanged
Question: What is the name of the passenger train service that connects Darwin to Adelaide?
Answer: The Ghan
Evidence Text: The first service ran in 2004. "The Ghan" passenger train service from Adelaide via Alice Springs and
Katherine runs once per week in each direction, with some exceptions.
Predictions :
Vanilla: The Ghan RaLM: The Ghan RaLM-CP: The Ghan RaLM-D∗ (Ours): The Ghan RiLM (Ours): The Ghan
Retrieved Documents :
[Top-3]
(1) Hit: The Territory’s public transport services are managed by the Department of Lands and Planning, ... The first
service ran in 2004. "The Ghan" passenger train service from Adelaide via Alice Springs and Katherine runs once
per week in each direction, with some exceptions. Historically, ...
(2) Miss: Darwin and its suburbs, Palmerston and Katherine, are powered by the Channel Island Power Station, the
Northern Territory’s largest power plant, and the ...
(3) Miss: The platform layout at Flinders Street is unusual among Australian terminal stations for being almost
entirely composed of through tracksa product ...

[RaLM-D∗ ]
Hit: The Territory’s public transport services are managed by the Department of Lands and Planning, ... The first
service ran in 2004. "The Ghan" passenger train service from Adelaide via Alice Springs and Katherine runs once
per week in each direction, with some exceptions. Historically, ...

[RiLM]
Hit: The Territory’s public transport services are managed by the Department of Lands and Planning, ... The first
service ran in 2004. "The Ghan" passenger train service from Adelaide via Alice Springs and Katherine runs once
per week in each direction, with some exceptions. Historically, ...

Table 19: Another Case Study for QA

3302
Article : Marshall Islands
Turn 1
Type : Changed
User: I’ve heard that the Marshall Islands were inhabited for a very long time. What kind of influence did the early
settlers have on these islands?
Expert: Indeed, the islands have been inhabited since the 2nd millennium BC. The early Austronesian settlers brought
with them Southeast Asian crops like coconuts and breadfruit, as well as domesticated chickens, which were crucial
for making the islands habitable.
Evidence Text: Austronesian settlers reached the Marshall Islands as early as the 2nd millennium BC and introduced
Southeast Asian crops, including coconuts, giant swamp taro, and breadfruit, as well as domesticated chickens which
made the islands permanently habitable.
Predictions :
Vanilla: The Marshall Islands were inhabited for a very long time. The early settlers had a big influence on the
islands. They built houses, farmed, and fished. They also made tools and weapons. The early settlers also had a big
influence on the culture of the Marshall Islands. They taught the people how to make clothes, how to cook food, and
how to make music. The early settlers also taught the people how to make weapons and how to fight. The early
settlers
RaLM: The early settlers of the Marshall Islands introduced Southeast Asian crops, including coconuts, giant swamp
taro, and breadfruit, as well as domesticated chickens which made the islands permanently habitable. They developed
a complex technique for interisland piloting by reading disruptions in ocean swells caused by low-lying coral atolls.
RaLM-CP: The early settlers of the Marshall Islands introduced Southeast Asian crops, including coconuts, giant
swamp taro, and breadfruit, as well as domesticated chickens which made the islands permanently habitable. They
developed a complex technique for interisland piloting by reading disruptions in ocean swells caused by low-lying
coral atolls.
RaLM-D∗ (Ours): The early settlers introduced Southeast Asian crops, including coconuts, giant swamp taro, and
breadfruit, as well as domesticated chickens which made the islands permanently habitable.
RiLM (Ours): The early settlers introduced Southeast Asian crops, including coconuts, giant swamp taro, and
breadfruit, as well as domesticated chickens which made the islands permanently habitable.
Retrieved Documents :
[Top-3]
(1) Miss: The majority of the citizens of the Republic of Marshall Islands are of Marshallese descent, ... around the 1st
century CE at the village of Laura on Majuro and on Kwajalein Atoll.
(2) Hit: Austronesian settlers reached the Marshall Islands as early as the 2nd millennium BC and introduced
Southeast Asian crops, including coconuts, giant swamp taro, and breadfruit, as well as domesticated chickens
which made the islands permanently habitable. They developed a complex technique for ...
(3) Miss: The British sea captains John Marshall and Thomas Gilbert visited the islands in 1788. Their vessels had
been part of the First Fleet taking convicts from England to Botany Bay ...

[RaLM-D∗ ]
Hit: Austronesian settlers reached the Marshall Islands as early as the 2nd millennium BC and introduced
Southeast Asian crops, including coconuts, giant swamp taro, and breadfruit, as well as domesticated chickens
which made the islands permanently habitable. They developed a complex technique for ...

[RiLM]
Hit: Austronesian settlers reached the Marshall Islands as early as the 2nd millennium BC and introduced
Southeast Asian crops, including coconuts, giant swamp taro, and breadfruit, as well as domesticated chickens
which made the islands permanently habitable. They developed a complex technique for ...
Turn 2
Type : New
User: That’s fascinating. I wonder how they navigated between the islands back then.
Expert: They actually developed a sophisticated method for navigating the ocean. They could read the disruptions in
ocean swells, which were caused by the coral atolls, to pilot between the islands.
Evidence Text: They developed a complex technique for interisland piloting by reading disruptions in ocean swells
caused by low-lying coral atolls.
Predictions :
Vanilla: The early settlers developed a complex technique for interisland piloting by reading disruptions in ocean
swells caused by low-lying coral atolls.
RaLM: The early settlers developed a complex technique for interisland piloting by reading disruptions in ocean
swells caused by low-lying coral atolls.
RaLM-CP: The early settlers developed a complex technique for interisland piloting by reading disruptions in ocean
swells caused by low-lying coral atolls.
RaLM-D∗ (Ours): The early settlers developed a complex technique for interisland piloting by reading disruptions
in ocean swells caused by low-lying coral atolls.
RiLM (Ours): The early settlers developed a complex technique for interisland piloting by reading disruptions in
ocean swells caused by low-lying coral atolls.

3303
Retrieved Documents :
[Top-3]
(1) Hit: ... which made the islands permanently habitable. They developed a complex technique for interisland
piloting by reading disruptions in ocean swells caused by low-lying coral atolls. In 1526, Spanish explorer Alonso
de Salazar ...
(2) Miss: The majority of the citizens of the Republic of Marshall Islands are of Marshallese descent, ... around the 1st
century CE at the village of Laura on Majuro and on Kwajalein Atoll.
(3) Miss: The Austronesian settlers introduced Southeast Asian crops, including coconuts, giant swamp taro, and
breadfruit, as well as domesticated chickens throughout the Marshall Islands. ...

[RaLM-D∗ ]
Hit: ... which made the islands permanently habitable. They developed a complex technique for interisland
piloting by reading disruptions in ocean swells caused by low-lying coral atolls. In 1526, Spanish explorer Alonso
de Salazar ...

[RiLM]
Hit: ... which made the islands permanently habitable. They developed a complex technique for interisland
piloting by reading disruptions in ocean swells caused by low-lying coral atolls. In 1526, Spanish explorer Alonso
de Salazar ...
Turn 3
Type : Unchanged
User: I guess that means they were pretty isolated until Europeans arrived.
Expert: That’s correct. The first European who might have sighted the islands was Spanish explorer Alonso de
Salazar in 1526, and the first confirmed European landing was by Álvaro de Saavedra Cerón two years later on an
uninhabited island.
Evidence Text: In 1526, Spanish explorer Alonso de Salazar may have been the first European to sight the islands,
and Álvaro de Saavedra Cerón made landfall on an uninhabited island two years later.
Predictions :
Vanilla: That’s correct. The Marshall Islands were not discovered by Europeans until 1529, when Spanish explorer
Alvaro de Mendana de Neira arrived. He named the islands after his patron, the Spanish admiral Álvaro de Mendaña
de Neyra.
RaLM: That’s right. The islands were largely untouched by European explorers until the late 16th century.
RaLM-CP: That’s right. The islands were largely untouched by European explorers until the late 16th century.
RaLM-D∗ (Ours): That’s correct. The islands were largely untouched by European explorers until the 18th century.
RiLM (Ours): That’s correct. The islands were largely untouched by European explorers until the 18th century.
Retrieved Documents :
[Top-3]
(1) Hit: ... by low-lying coral atolls. In 1526, Spanish explorer Alonso de Salazar may have been the first European
to sight the islands, and Álvaro de Saavedra Cerón made landfall on an uninhabited island two years later. Several
other Spanish ...
(2) Miss: The majority of the citizens of the Republic of Marshall Islands are of Marshallese descent, ... around the 1st
century CE at the village of Laura on Majuro and on Kwajalein Atoll.
(3) Miss: The British sea captains John Marshall and Thomas Gilbert visited the islands in 1788. Their vessels had
been part of the First Fleet taking convicts from England to Botany Bay in New South Wales, ...

[RaLM-D∗ ]
Hit: ... by low-lying coral atolls. In 1526, Spanish explorer Alonso de Salazar may have been the first European to
sight the islands, and Álvaro de Saavedra Cerón made landfall on an uninhabited island two years later. Several
other Spanish ...

[RiLM]
Hit: ... by low-lying coral atolls. In 1526, Spanish explorer Alonso de Salazar may have been the first European to
sight the islands, and Álvaro de Saavedra Cerón made landfall on an uninhabited island two years later. Several
other Spanish ...

3304
Turn 4
Type : Unchanged
User: And how did the Marshall Islands get their name?
Expert: The name comes from the British captains John Marshall and Thomas Gilbert who landed on Mili Atoll
in 1788. It was the first recorded European presence in over 200 years, and subsequent maps began to refer to the
islands by Captain Marshall’s name.
Evidence Text: Later maps and charts named the islands after Captain Marshall.
Predictions :
Vanilla: The islands were named after British explorer John Marshall, who was the first European to map them in
1788.
RaLM: The islands were named after Captain John Marshall, who was the first European to sight them.
RaLM-CP: The islands were named after Captain John Marshall, who was the first European to sight them.
RaLM-D∗ (Ours): The islands were named after Captain John Marshall, who was the first European to map the
islands in 1788.
RiLM (Ours): The islands were named after Captain John Marshall, who was the first European to map the islands
in 1788.
Retrieved Documents :
[Top-3]
(1) Hit: ... in the archipelago in over 200 years. Later maps and charts named the islands after Captain Marshall. ...
(2) Miss: The majority of the citizens of the Republic of Marshall Islands are of Marshallese descent, ... around the 1st
century CE at the village of Laura on Majuro and on Kwajalein Atoll.
(3) Miss: The British sea captains John Marshall and Thomas Gilbert visited the islands in 1788. Their vessels had
been part of the First Fleet taking convicts from England to Botany Bay in New South Wales, ...

[RaLM-D∗ ]
Hit: ... in the archipelago in over 200 years. Later maps and charts named the islands after Captain Marshall. ...

[RiLM]
Hit: ... in the archipelago in over 200 years. Later maps and charts named the islands after Captain Marshall. ...

Table 20: Case Study for Dialogue

Generate a Q&A pair based on a given context, where the context is understood but NOT DIRECTLY VISIBLE to the
person answering the question. Assume the person answering the question has common sense and is aware of the
details and key points in the paragraph, but the paragraph itself is not quoted or referenced directly.

Paragraph (a list of sentences): {paragraph}

Use the following instructions for generating a Q&A pair:


1) Provide a question, an answer, and a bounding box.
2) DON’T use phrases such as ‘according to the paragraph’ in your question.
3) An answer should be an entity or entities. Provide a SHORT ANSWER.
4) The bounding box for a paragraph is defined as (starting sentence index, ending sentence index): the bounding
box should be sufficiently large to encompass all the information necessary for a reader to FULLY infer the answer to
the question.
5) The sentence index starts from 0.
6) Generate a SINGLE Q&A pair.

Be sure to follow the following format and write your answer within curly brackets.
The format is as follows:
{Question}{Answer}{starting sentence index}{ending sentence index}

Table 21: Sample prompt for initial generation of a QA pair

3305
Generate a Q&A pair based on a given context, where the context is understood but NOT DIRECTLY VISIBLE to the
person answering the question. Assume the person answering the question has common sense and is aware of the
details and key points in the sentence(s), but the sentence(s) itself is not quoted or referenced directly.

Sentence(s): {sentences}

Use the following instructions for generating a Q&A pair:


1) Provide a question, and an answer.
2) DON’T use phrases such as ‘according to the sentence(s)’ in your question.
3) An answer should be an entity or entities. Provide a SHORT ANSWER.
4) Generate a SINGLE Q&A pair.

Be sure to follow the following format and write your answer within curly brackets.
The format is as follows:
{Question}{Answer}

Table 22: Sample prompt for generation of New QA pair (1)

Generate a Q&A pair based on New Sentence(s), where the context in understood but NOT DIRECTLY VISIBLE to
the person answering the question. You can reference the Source Content for broader context, but the Q&A pair
should relate directly to the information in New Sentence(s).

New Sentence(s): {sentences}

Source Content : {source content}

Use the following instructions for generating a Q&A pair:


1) Provide a question, and an answer.
2) DON’T use phrases such as ‘according to the sentence(s)’ in your question.
3) An answer should be an entity or entities. Provide a SHORT ANSWER.
4) Generate a SINGLE Q&A pair.

Be sure to follow the following format and write your answer within curly brackets.
The format is as follows:
{Question}{Answer}

Table 23: Sample prompt for generation of New QA pair (2)

3306
Identify the contradiction between two following sentences and generate a Q&A pair that reflects this contradiction.
The question should be answerable based on each sentence(s), but the two answers should CONTRADICT EACH
OTHER. You can reference the Source Content for broader context, but the Q&A pair should relate directly to the
information in Old/New Sentence(s).

Old Sentence(s) : {old sentence}


New Sentence(s) : {new sentence}

Source Content : {source content}

Use the following instructions for generating a Q&A pair:


1) The question should be answerable based on each sentence.
2) DON’T use phrases such as ‘according to the sentence(s)’ in your question.
3) An answer should be an entity or entities. Provide a SHORT ANSWER.
4) Create a SINGLE Q&A pair, providing two CONTRADICTORY answers: one based on the old sentence, and
another based on the new sentence.

Be sure to follow the following format and write your answer within curly brackets.
The format is as follows:
{Question}{Answer based on Old Sentence}{Answer based on New Sentence}

Table 24: Sample prompt for generation of Changed QA pair

Create an Information Dialogue Dataset about {topic} between two conversation partners (User, Expert).
A paragraph about {topic} will be provided as factual information. The expert’s words must be generated to provide
an answer based on this information.

Using the following instruction for generating a dialogue:


1) The user starts the dialogue first
2) Create a multi-turn dialogue of 3-4 turns, each consisting of a not too long conversation.
3) Create it to include each element of conversation, discussion, and QA. In other words, users should not always ask
questions using interrogative sentences.
4) DON’T use phrases such as according to the paragraph in guide’s utterance.
5) DON’T simply parrot this paragraph or referenced directly. There is no need to include everything given in the
paragraph in the dialogue.
6) Do not use what you already know about {topic}, and the Expert will answer only with the content of the provided
paragraph.
7) I will provide you with sentences and a unique number for each sentence. You must indicate the Sentence number
you’ve referenced for each turn.

Below is an example of output format and dialogues:


{{Reference Sentence}}2{{User}}I really love Granny Smith apples, they’re my favorite type of apple{{Expert}}I love
granny smith apples. they have hard, light green skin and a crisp flesh.
{{Reference Sentence}}1{{User}}Yes, I really enjoy them. I also like Honeycrisp apples but they’re so expen-
sive!{{Expert}}they’ve been grown for thousands of years in asia and europe, and were brought to north america by
european colonists
{{Reference Sentence}}3{{User}}Oh really? They’ve been around way longer than I thought!{{Expert}}they’re also
consumed raw, it’s one of the most popular cooking apples.

Sentences:
{sentences}

Please generate dialogue:

Table 25: Sample prompt for Dialogue generation

3307
Po
l iti
c sa
nd
go
ve

0
100
200
300
400
500
600
700
800
900
r
W nme
es nt
tA
Sp sia
or
S t
No T s
rth A EM
Am ric f
er a
ic
So Sou A a
M u th h sia t
ilit ea As
ar B st A ia
y a io s
Ea nd grap ia
ste wa hy
rn rfa
E r
So G uro e
u t e pe
h ne
Am ra
e l
Eu rica
Ea rop
So st e
Bu u A
sin No ther Soc sia
es rth n E iet
s a er ur y
nd n E op
W e ur e
es co op
te no e
rn m
Eu ics
ro
p
Tr O Me e
an d
Ce sp ce ia
nt or ani
ra ta a
Ph l A tio
ilo m n
so Ce er
p n S ica
M hy tral pac
ed an A e
Ea ici d r fri
ne el ca
rth ig
an En & H ion
d e gi ea
n l
No nvi eer th
rth ron ing
Ea ern men
ste A t
rn fric
Ed Afr a

3308
u ic
Vi cat a
s
Te ual ion
c a
Ce hno rts
Fo n lo
od tra gy
a n l As
d ia
dr
in
So Fil k
ut m
he M s
r n us
Af i c
Hi icar
s
G B t or
W eog iolo y
es rap gy
te h
rn ic

Figure 4: Article Categories Overview


In L Af al
te ite ric
rn ra a
et tu
c r
T ul e
V el tu
Pe ide evis re
rfo o g ion
rm a
A i n me
En rchi g a s
te tec rts
rta tu
i
Lin nm re
gu ent
M W istic
at om s
he e
No ma n
rth tics
Lib So As
ra Ch ftwa ia
rie
s & emi re
In Fas stry
fo h
rm ion
at
i
B o on
o
Co Phy ks
Co m sic
m pu s
ics ti
an Ra ng
d A di
nimo
e

You might also like