A Human-Inspired Reading Agent With Gist
A Human-Inspired Reading Agent With Gist
Current Large Language Models (LLMs) are not only limited to some maximum context length, but also
are not able to robustly consume long inputs. To address these limitations, we propose ReadAgent,
an LLM agent system that increases effective context length up to 20× in our experiments. Inspired
by how humans interactively read long documents, we implement ReadAgent as a simple prompting
system that uses the advanced language capabilities of LLMs to (1) decide what content to store together
arXiv:2402.09727v1 [cs.CL] 15 Feb 2024
in a memory episode, (2) compress those memory episodes into short episodic memories called gist
memories, and (3) take actions to look up passages in the original text if ReadAgent needs to remind
itself of relevant details to complete a task. We evaluate ReadAgent against baselines using retrieval
methods, using the original long contexts, and using the gist memories. These evaluations are performed
on three long-document reading comprehension tasks: QuALITY, NarrativeQA, and QMSum. ReadAgent
outperforms the baselines on all three tasks while extending the effective context window by 3 − 20×.
We posit that an underlying reason for this gap is inher- [page 1] gist
ent in the differences in reading approaches. Typically,
we use LLMs to consume the exact given content word- [page 2] gist
by-word and the process is relatively passive. On the
other hand, humans read and reason over long text
[page N] gist
differently. First, the exact information tends to be
forgotten quickly, whereas the fuzzier gist information,
i.e. the substance irrespective of exact words, from Figure 1 | ReadAgent workflow.
past readings lasts much longer [34, 31, 33]1 . Second,
human reading is an interactive process. When we
need to remind ourselves of relevant details in order
We think that using the fuzzy gist memory to capture
to complete a task, such as answering a question, we
global context and attending to local details together
look them up in the original text.
enables humans to reason over very long context effi-
ciently, in terms of how much information to process
1 Fuzzy-trace theory [34] posits that people form two types of at once, and is also important for comprehension. For
memory representations about a past event – verbatim and gist example, if we were to infer the intention of a fictional
memories. Gist memories, often episodic, are fuzzy memories of
character’s specific action described on a page in a
past events, whereas verbatim memories contain details of past
events. People prefer to reason with gists rather than with verbatim novel, besides focusing on the surrounding pages, we
memories [32]. likely also need to understand the overall story and
2
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
a tree of different levels of summaries to search for Memory Gisting For each page, we prompt the LLM
task-related information. However, the hierarchical to shorten the exact content into a gist, or summary,
summary structure makes it difficult to reason over re- as follows.
lated but distant information at the same granularity
(see Appendix F for more discussion). Example Gisting Prompt
3
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Example Parallel Lookup Prompt (ReadAgent-P) 3.3. Computational Overhead and Scalability
The following text is what you remember from reading an Episode pagination, memory gisting and interactive
article and a multiple choice question related to it. look-ups require iterative inference, which is a possible
You may read 1 to 5 page(s) of the article again to refresh computational overhead. However, as we show in the
your memory to prepare yourself for the question. following, the overhead is bounded linearly by a small
Please respond with which page(s) you would like to read. factor, making our approach scale well with input
length.
For example, if you only need to read Page 8, respond with
“I want to look up Page [8] to ...”; if you would like to read Pagination: In theory, an LLM could read a docu-
Page 7 and 12, respond with “I want to look up Page [7,
12] to ...”; if you would like to read Page 2, 3, 7, 15 and
ment and directly provide the pagination in a single
18, respond with “I want to look up Page [2, 3, 7, 15, 18] pass, so the minimum number of words the LLM must
to ...”. process is the length of the document. Our pagina-
DO NOT select more pages if you don’t need to. tion algorithm splits the document into chunks of at
You don’t need to answer the question yet. most max_words, and then guarantees that at least
Text: min_words are consumed at each step. Thus, the
{GIST MEMORY} ratio max_words
min_words gives an upper bound on how many
Question: times the word length of the document the LLM must
{QUESTION} process using our algorithm. Gisting: Memory gisting
is one additional pass of the raw input words, since
The selected raw pages replace the gist(s) at the corre- each page is gisted independently. Retrieval: Parallel
sponding positions in memory, preserving the overall look-ups are conditioned on gists instead of the full
narrative flow. Then we prompt the LLM again with text, and thus will be much shorter than one pass of
the task and the updated memory and ask it to solve the raw input words. Each step of a sequential look-
the task. up is similar to parallel look-ups and the overall cost
is capped with the maximum number of look-ups al-
ReadAgent-S We also study the sequential look-up lowed. Response: Finally, answering is also similar to
strategy, where the model requests one page at a time, parallel look-ups. There is additional overhead from
up to some maximum number of pages. In sequen- the prompt templates, of course.
tial look-up, the model gets to see the previously ex-
panded pages before deciding which page to expand. For example, in our QMSum ReadAgent-P 6 page ex-
This gives the model access to more information than periments, max_words
min_words ≈ 2, the gist memory is less than
parallel look-up, so we might expect it to perform bet- 0.2× the original context and the retrieved pages in-
ter in some situations. However, the larger number crease that to 0.3×, so the LLM processes ∼ 3.5× the
of interactions with the model increases the computa- original words.
tional cost, so sequential look-up should only be used
on tasks where it provides clear benefits. 3.4. ReadAgent Variants
Example Sequential Lookup Prompt (ReadAgent-S) In Appendix E, we discuss variants of ReadAgent that
can be useful in different problem settings, including
The following text is what you remember from reading
a meeting transcript, followed by a question about the when the target task is known prior to reading the
transcript. You may read multiple pages of the transcript long document. In Appendix D, we describe adapting
again to refresh your memory and prepare to answer the ReadAgent to work in the web navigation setting.
question. Each page that you re-read can significantly im-
prove your chance of answering the question correctly.
Please specify a SINGLE page you would like to read
again or say "STOP". To read a page again, respond with 4. Experiments
“Page $PAGE_NUM”, replacing $PAGE_NUM with the tar-
get page number. You can only specify a SINGLE page in
We evaluate ReadAgent’s long-document reading com-
your response at this time. To stop, simply say “STOP”.
DO NOT answer the question in your response. prehension ability on three long-context question-
Text: answering challenges: QuALITY [27], NarrativeQA
{GISTS WITH IN-LINE EXPANDED PAGES} [21] and QMSum [50]. Although ReadAgent does not
Pages re-read already (DO NOT ask to read them again): require any model training, we develop the proposed
{LIST OF PAGE NUMBERS ALREADY READ}
method on the training sets and test on the valida-
Question:
tion, test and/or development sets to avoid any risk
{QUESTION}
of overfitting system hyperparameters.
Specify a SINGLE page to read again, or say STOP: In this work, we primarily use the instruction-tuned
PaLM 2-L [2] for our experiments and evaluation. The
4
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
context length of PaLM 2-L is 8K tokens. Details of Permissive LLM Rater Prompt
the model can be found in Anil et al. [2]. Addition-
ally, we provide GPT-3.52 results in Appendix B, and After reading some text, John was given the following
question about the text:
experimental results on the web navigation setting in
Appendix D. {QUESTION TEXT}
John’s answer to the question was:
One important performance measure of the techniques {MODEL RESONSE TEXT}
considered here is the compression rate (CR). We The ground truth answer was:
word-count(in-context text)
define this as CR ≡ 100 ∗ (1 − word-count(full-context text)
) {REFERENCE RESPONSE TEXT}
at the final query. Does John’s answer agree with the ground truth answer?
Please answer “Yes”, “Yes, partially”, or “No”. If John’s
response has any overlap with the ground truth answer,
answer “Yes, partially”. If John’s response contains the
4.1. LLM Raters ground truth answer, answer “Yes”. If John’s response is
more specific than the ground truth answer, answer “Yes”.
NarrativeQA and QMSum both have one or more free-
form reference responses. They are typically evaluated
using syntactic matching metrics such as ROUGE [24] 4.2. Baseline Methods
F-Measure. We additionally evaluate these datasets
using an automatic LLM Rater as an alternative to Retrieval-Augmented Generation (RAG) As dis-
human evaluation similar to Peng et al. [29], Chiang cussed in Section 2, RAG [23] is a popular approach
et al. [9], Zheng et al. [49], Chiang and Lee [8]. to extend access to a large amount of text beyond
what can fit in the LLM context window. In this paper
In our implementation, we prompt the LLM to look at we compare ReadAgent to RAG baselines using con-
the question or instruction and compare the model’s ventional retrieval methods to find relevant “pages”
answer to the reference answer. The “Strict LLM Rater in a long text, where we reuse the pages generated
Prompt” shown below is for judging whether there by ReadAgent. We consider two relevance methods:
is an exact match, and the “Permissive LLM Rater Okapi BM25 [35] and neural retrieval based on the
Prompt” is for judging whether there is an exact match Gemini API embedding model (models/embedding-
or a partial match. We apply both prompts to all model 001)3 . The neural retrieval relevance score is defined
responses. If either rater decides there is an exact as the dot product between the question embedding
match, we count it as an exact match. If the strict rater vector and each page (or gist memory embedding vec-
is negative but the permissive rater detects a partial tor in the case of NarrativeQA, see Section 4.3.2). For
match, we count it as a partial match. Otherwise, reading comprehension tasks, the pages are ranked by
it’s not a match. In the case that there are multiple relevance to each question, and we prompt the LLM
reference answers, the response is compared against to look at the top-𝑘 pages as context for answering the
each reference answer in turn, and the highest rating question. In most retrieval settings, the database of
is returned. documents is quite large, which makes the retrieval
Based on these raters, we define two different scores: task more challenging. In our setting, ReadAgent and
LLM-Rating-1 (LR-1) is a strict evaluation score, retrieval methods all use a per-document database,
where we count the percentage of exact matches rather than per-dataset. For example, in QuALITY,
over all examples; LLM-Rating-2 (LR-2) is permis- there are hundreds of articles, each with multiple ques-
sive, where we count the percentage of exact and tions. The database for retrieval in each question is
partial matches. only the extracted pages from the corresponding ar-
ticle (typically less than 20 pages), rather than the
Strict LLM Rater Prompt thousands of pages from the entire dataset.
After reading some text, John was given the following Full or Truncated Text Content The maximum
question about the text: length of QuALITY dev articles is ∼6,000 words, which
{QUESTION TEXT} can fit into the PaLM 2-L context window. This allows
John’s answer to the question was: us to evaluate ReadAgent against directly using the full
{MODEL RESONSE TEXT} long document for long-context reading comprehen-
The ground truth answer was: sion. The maximum length of QMSum is over 26,000
{REFERENCE RESPONSE TEXT} words. Consequently, we choose to truncate the text
Does John’s answer agree with the ground truth answer? to close to the context window limit (6,000 words
Please answer YES or NO. for PaLM 2-L experiments) to ensure that the trun-
cated text fits in the LLM’s context, though this would
2 http://openai.com/api/ 3 https://ai.google.dev/models/gemini
5
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
generally be a weaker baseline. Finally, since the av- Method CR (# LU) Accuracy
erage length of NarrativeQA documents significantly BM25 Retrieval
exceeds the context window, it is less meaningful to Top-1 89.96% (1) 70.55% ± 0.07
perform the truncated-context comparison. Top-2 80.25% (2) 78.38% ± 0.10
Top-3 70.98% (3) 81.59% ± 0.17
Top-4 61.90% (4) 84.28% ± 0.15
Gist Memory We can also attempt to solve the given Neural Retrieval with Gemini API
task by reasoning directly over the gist memory. Do- Top-1 90.72% (1) 70.98% ± 0.06
ing so helps us understand not only the importance Top-2 81.88% (2) 79.56% ± 0.10
of interactive look-up but also how using the LLM- Top-3 73.13% (3) 83.11% ± 0.11
compressed information alone compares to the full Top-4 64.50% (4) 84.98% ± 0.06
content and retrieval baselines. Full Raw Content 0% 85.83% ± 0.19
GistMem 84.24% 77.95% ± 0.08
4.3. Long-Context Reading Comprehension ReadAgent-P
Look up 1 pg 76.63% (1.0) 83.80% ± 0.17
4.3.1. QuALITY Look up 1-2 pgs 72.17% (1.6) 84.95% ± 0.07
Look up 1-3 pgs 69.23% (2.0) 85.46% ± 0.07
QuALITY [27] is a four-way multiple choice question Look up 1-4 pgs 67.72% (2.2) 86.31% ± 0.18
answering challenge with text data from several dif- Look up 1-5 pgs 66.26% (2.4) 86.63% ± 0.10
ferent sources. QuALITY is evaluated using accuracy, Look up 1-6 pgs 64.63% (2.7) 86.40% ± 0.07
with 25% corresponding to chance performance. ReadAgent-S 1-6 pgs 60.27% (3.3) 86.88% ± 0.06
The dev set has an average length of 4,122 words
Table 1 | QuALITY results on the dev set of 230 docs
and a maximum of 5,967. The gist memory has an
and 2086 questions using PaLM 2-L. CR is the com-
average length of 650 words and a maximum of 1,264.
pression rate. # LU is the number of lookups. We
Figure 2 shows the word statistics for the original text
report means and standard deviations across 3 runs.
and the gists. The compression rate of the gists is
We omit standard deviations for CR and # LU for pre-
84.24%. See Appendix G for QuALITY pagination
sentation purposes; they were all inconsequential.
hyperparameters.
20 Documents
Document Gists 4.3.2. NarrativeQA
10
NarrativeQA [21] has the longest context length on
0
0 1000 2000 3000 4000 5000 6000 average among the three reading comprehension
words datasets we choose. The dataset is divided into books
Figure 2 | Histogram of QuALITY word counts for the (Gutenberg) and move scripts. The Gutenberg test set
original text and the gists. have 70,619 words on average, and the maximum is
343,910 words; the movie scripts test set have 29,963
Table 1 shows the experimental results on QuALITY, on average, and the maximum is 63,957 words. As the
where ReadAgent (Look up 1-5 pages) gives the best reference answers are free-form, we evaluate based
results with a compression rate of 66.97% (meaning on ROUGE [24] and the LLM Ratings (Section 4.1).
that ∼ 3× as many tokens can fit in the context win- The original main texts are replaced with the HTML-
dow after gisting). The performance increases as we stripped version from SCROLLS [36].
increase the maximum number of pages allowed for
Because of the length of NarrativeQA articles, in order
look-up, up to 5 pages. At 6 pages, we see that the
to fit the gists into the context window, we significantly
performance starts to degrade slightly, indicating that
expand the page size, resulting in stronger compres-
allowing 6 pages of context may be increasing the rate
sion (Section 3.1). For example, the Gutenburg gists
of distracting information.
from the test set have 2,217 words on average and the
Notably, ReadAgent outperforms using the full origi- maximum is 6,471 words, whereas the movie script
nal text, which could have been an upper bound on gists have 2,155 words on average and the maximum
the performance – every other method reduces the is 4,511 words. Figures 4 and 5 (appendix) show the
amount of text the LLM considers before generating word statistics for the original text and the gists in
its response. However, this is not a surprising result. Gutenberg and movie scripts respectively. The com-
Prior work shows that current LLMs are not able to pression rate of the gists is 96.80% for Gutenberg
effectively use the full long context window [25], po- texts and 91.98% for movie scripts. See Appendices G
tentially due to training data sparsity, and distracting and H for NarrativeQA pagination hyperparameters
information can also reduce performance [37, 42]. and more details.
6
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Gutenberg Validation (58 docs & 1743 questions) Gutenberg Test (177 docs & 5207 questions)
Method CR (# LU) LR-1 LR-2 R-1 R-2 R-L CR (# LU) LR-1 LR-2 R-1 R-2 R-L
BM25 Retrieval
Top-1 97.63% (1) 39.01% 50.14% 0.166 0.061 0.156 97.42% (1) 43.5% 55.33% 0.176 0.065 0.165
Top-2 95.24% (2) 49.34% 60.76% 0.203 0.079 0.191 94.80% (2) 51.70% 64.53% 0.206 0.082 0.194
Top-3 93.34% (3) 52.73% 63.68% 0.208 0.080 0.195 93.02% (3) 52.97% 66.03% 0.210 0.083 0.197
Top-4 92.47% (4) 53.59% 64.26% 0.211 0.082 0.197 92.27% (4) 53.60% 66.16% 0.210 0.084 0.197
Neural Retrieval with Gemini API
Top-1 98.19% (1) 34.25% 46.53% 0.146 0.051 0.134 98.14% (1) 36.47% 47.8% 0.150 0.054 0.140
Top-2 96.30% (2) 44.69% 54.96% 0.180 0.069 0.167 96.15% (2) 44.48% 56.17% 0.182 0.070 0.170
Top-3 94.62% (3) 46.24% 57.31% 0.191 0.077 0.178 94.42% (3) 48.97% 60.73% 0.195 0.076 0.183
Top-4 93.45% (4) 48.59% 59.21% 0.196 0.079 0.184 93.25% (4) 50.62% 62.05% 0.203 0.080 0.191
GistMem 96.89% 55.31% 68.22% 0.233 0.091 0.218 96.80% 55.79% 71.19% 0.231 0.092 0.217
ReadAgent-P
Look up 1 pg 95.15% (0.94) 58.92% 71.89% 0.244 0.101 0.230 94.84% (0.93) 59.98% 73.23% 0.240 0.098 0.226
Look up 1-2 pgs 94.79% (1.23) 59.84% 72.29% 0.239 0.098 0.224 94.36% (1.34) 59.19% 72.65% 0.231 0.091 0.218
Look up 1-3 pgs 94.39% (1.50) 59.84% 71.89% 0.240 0.098 0.226 94.03% (1.61) 59.63% 72.84% 0.230 0.093 0.217
ReadAgent-S 1-2 pgs 94.35% (1.38) 57.89% 71.14% 0.239 0.097 0.225 93.86% (1.46) 60.48% 72.48% 0.232 0.095 0.219
ReadAgent-S 1-3 pgs 94.08% (1.57) 58.52% 71.49% 0.242 0.098 0.229 93.67% (1.57) 60.55% 72.79% 0.231 0.095 0.219
Movie Validation (57 docs & 1699 questions) Movie Test (172 docs & 5139 questions)
BM25 Retrieval
Top-1 97.07% (1) 32.67% 42.61% 0.156 0.058 0.144 96.61% (1) 33.64% 43.34% 0.154 0.054 0.143
Top-2 94.12% (2) 39.97% 50.21% 0.187 0.070 0.174 93.81% (2) 42.50% 53.05% 0.191 0.072 0.178
Top-3 91.18% (3) 43.61% 53.91% 0.198 0.077 0.185 91.00% (3) 46.97% 57.52% 0.207 0.080 0.193
Top-4 88.24% (4) 46.85% 57.62% 0.210 0.084 0.198 88.19% (4) 50.18% 60.13% 0.217 0.085 0.202
Neural Retrieval with Gemini API
Top-1 97.07% (1) 32.02% 41.44% 0.153 0.053 0.142 96.67% (1) 37.24% 46.22% 0.130 0.043 0.118
Top-2 94.19% (2) 43.20% 51.38% 0.160 0.057 0.148 93.90% (2) 46.49% 54.60% 0.164 0.061 0.151
Top-3 91.29% (3) 47.56% 56.21% 0.176 0.064 0.163 91.14% (3) 50.69% 58.92% 0.186 0.071 0.172
Top-4 88.38% (4) 49.09% 59.33% 0.193 0.075 0.180 88.36% (4) 52.13% 59.41% 0.184 0.072 0.171
GistMem 92.09% 52.56% 64.39% 0.242 0.103 0.227 91.98% 54.68% 64.00% 0.248 0.105 0.234
ReadAgent-P
Look up 1 pg 89.20% (0.99) 53.38% 65.57% 0.247 0.106 0.233 89.22% (0.98) 57.68% 68.01% 0.274 0.116 0.260
Look up 1-2 pgs 87.68% (1.52) 54.62% 65.63% 0.238 0.098 0.223 88.10% (1.39) 58.24% 68.81% 0.270 0.115 0.255
Look up 1-3 pgs 86.57% (1.91) 54.91% 65.86% 0.241 0.099 0.225 86.73% (1.89) 58.82% 69.12% 0.272 0.116 0.257
ReadAgent-S 1-2 pgs 86.36% (1.98) 59.33% 68.28% 0.203 0.082 0.188 85.92% (1.98) 63.33% 72.06% 0.214 0.086 0.199
ReadAgent-S 1-3 pgs 83.56% (2.95) 59.45% 68.81% 0.210 0.087 0.195 83.18% (2.95) 64.53% 73.06% 0.217 0.090 0.202
Table 2 | NarrativeQA results (PaLM 2-L). CR is the compression rate. # LU is the number of lookups. R-1, R-2,
and R-L are ROUGE F-Measures. LR-1, and LR-2 are LLM-Ratings.
For the neural retrieval models, we use the gist mem- In Tables 3 and 11, we see that performance improves
ory embedding vectors rather than the page em- as the compression rate decreases, so techniques that
bedding vectors because the Gemini API embedding look up more pages tend to do better than techniques
model is limited to 10,000 characters (or less than that look up fewer pages. We also see that ReadAgent-
2,000 tokens, in expectation), which is too short for S substantially outperforms ReadAgent-P (and all base-
embedding full pages in our NarrativeQA experiments. lines). This performance improvement comes at a cost
However, using those embedding vectors, we then re- of up to six times as many requests in the retrieval
turn the original pages to the LLM context as normal, phase. Since other datasets don’t have such a strong
and use those pages as described in Section 4.2. performance improvement, we suspect that QMSum
is in some sense a more challenging dataset, requir-
Because the Gutenberg texts and the movie scripts
ing the model to actively search through the gisted
have significantly different distributions, we present
transcript to locate relevant information. This hypoth-
the results separately. The results in Table 2. Read-
esis seems reasonable, as meeting transcripts are much
Agent again outperforms all the baselines across all
less structured than the documents, books, and movies
subsets of NarrativeQA.
found in QuALITY and NarrativeQA.
7
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Method CR (# LU) LLM Rating-1 LLM Rating-2 ROUGE-1 ROUGE-2 ROUGE-L Resp. Length
BM25 Retrieval
Top-1 95.69% (1.00) 32.48% ± 1.65 63.85% ± 1.51 27.53 ± 0.23 7.00 ± 0.14 18.45 ± 0.16 48.62 ± 0.28
Top-2 91.48% (2.00) 29.41% ± 0.60 71.57% ± 1.48 28.85 ± 0.17 7.59 ± 0.08 19.34 ± 0.14 52.39 ± 0.49
Top-3 86.93% (3.00) 34.80% ± 1.14 79.53% ± 0.35 30.69 ± 0.17 8.40 ± 0.11 20.64 ± 0.13 53.59 ± 0.35
Top-4 82.55% (4.00) 35.66% ± 0.30 81.13% ± 0.35 31.10 ± 0.10 8.53 ± 0.06 20.36 ± 0.11 54.96 ± 0.42
Top-5 78.13% (5.00) 39.09% ± 0.92 84.44% ± 0.46 31.16 ± 0.14 8.52 ± 0.08 20.69 ± 0.03 54.52 ± 0.13
Top-6 73.97% (6.00) 37.87% ± 0.90 83.70% ± 0.87 31.06 ± 0.04 8.38 ± 0.06 20.43 ± 0.08 56.18 ± 0.44
Neural Retrieval with Gemini API
Top-1 95.99% (1.00) 34.80% ± 1.39 68.87% ± 0.62 27.86 ± 0.12 7.12 ± 0.04 18.76 ± 0.09 49.46 ± 0.23
Top-2 92.02% (2.00) 40.32% ± 0.92 81.50% ± 0.46 30.17 ± 0.08 8.03 ± 0.03 19.80 ± 0.08 55.48 ± 0.27
Top-3 87.93% (3.00) 40.93% ± 1.35 85.17% ± 1.25 31.36 ± 0.12 8.67 ± 0.10 20.68 ± 0.10 56.71 ± 0.27
Top-4 83.71% (4.00) 40.56% ± 0.62 84.31% ± 0.87 31.52 ± 0.11 8.59 ± 0.10 20.40 ± 0.10 56.47 ± 0.71
Top-5 79.47% (5.00) 40.20% ± 0.76 86.76% ± 0.60 31.32 ± 0.11 8.49 ± 0.11 20.49 ± 0.07 56.73 ± 0.91
Top-6 75.44% (6.00) 40.81% ± 0.52 87.01% ± 0.35 31.92 ± 0.02 8.73 ± 0.09 20.82 ± 0.05 58.39 ± 0.31
Truncated Raw Content
First 6k words 32.59% (0.00) 14.71% ± 0.79 52.45% ± 0.69 25.42 ± 0.05 4.98 ± 0.09 16.58 ± 0.10 58.42 ± 0.11
Last 6k words 32.38% (0.00) 10.42% ± 0.62 35.66% ± 2.46 20.69 ± 0.19 3.44 ± 0.10 14.13 ± 0.08 44.23 ± 0.11
GistMem 83.13% (0.00) 40.20% ± 0.96 89.83% ± 0.76 31.00 ± 0.09 7.99 ± 0.04 20.15 ± 0.08 65.75 ± 0.20
ReadAgent-P
Look up 1 pg 80.00% (0.98) 40.56% ± 0.46 89.46% ± 1.48 31.26 ± 0.09 8.22 ± 0.15 20.29 ± 0.07 63.78 ± 1.13
Look up 1-2 pgs 77.38% (1.71) 39.71% ± 1.87 89.71% ± 0.60 31.11 ± 0.04 8.01 ± 0.15 20.21 ± 0.04 64.73 ± 1.02
Look up 1-3 pgs 75.07% (2.53) 38.36% ± 1.21 89.71% ± 0.60 31.50 ± 0.29 8.15 ± 0.15 20.45 ± 0.24 63.91 ± 1.58
Look up 1-4 pgs 73.48% (3.08) 39.95% ± 1.51 90.56% ± 0.35 31.34 ± 0.05 8.08 ± 0.18 20.26 ± 0.07 63.40 ± 0.79
Look up 1-5 pgs 72.29% (3.50) 37.99% ± 0.96 87.75% ± 0.46 31.16 ± 0.10 8.06 ± 0.05 20.35 ± 0.12 65.22 ± 1.40
Look up 1-6 pgs 70.90% (3.97) 39.09% ± 2.04 88.24% ± 0.60 31.50 ± 0.30 8.05 ± 0.13 20.26 ± 0.13 66.70 ± 0.62
ReadAgent-S 1-6 pgs 70.34% (3.55) 46.57% ± 0.87 91.54% ± 0.30 32.90 ± 0.17 8.87 ± 0.23 21.15 ± 0.14 68.87 ± 0.60
Table 3 | QMSum validation results (PaLM 2-L) means and standard deviations across 3 runs. 35 articles and
272 questions. CR is the compression rate. # LU is the number of lookups. Resp. Length is the length in words
of the model’s final response.
the length of the texts increase (corresponding to the using ReadAgent to look up one page. This is equiva-
compression rates decreasing), the response lengths lent to replacing ReadAgent’s prompt-based retrieval
increase as well. Longer response lengths result in with neural retrieval. ReadAgent’s retrieval performs
lower ROUGE precision values, which pushes down the better here.
F-Measures. Consequently, for the ROUGE scores to
increase as text length increases, the improvement to Method Accuracy
recall must be more substantial than the reduction to GistMem + Neural Retrieval Top-1 82.74%
ReadAgent-P (Look up 1 pg) 83.80%
precision. This happens to some extent, but the effect
size is small. Furthermore, including gists in the text Table 4 | ReadAgent vs. GistMem with neural retrieval.
substantially increases the response length, as is the
case for GistMem and all the ReadAgent approaches.
This increase is in spite of the fact that all models
5. Conclusion
use the same question-answering prompt, so there is
We have presented ReadAgent, a simple interactive
no prompt difference to cause the increased response
prompting system to mitigate the context length and
lengths. This makes it much more challenging for
context use limitations of current LLMs. ReadAgent
GistMem and ReadAgent to outperform the retrieval
outperforms other strong zero-shot (i.e., not trained
methods in ROUGE score. Nevertheless, ReadAgent-S
or finetuned on the training set) baselines across
manages to have the highest ROUGE scores as well
standard performance metrics of accuracy or ROUGE
as the highest LLM ratings. Because of these issues
scores. These results demonstrate that LLMs are capa-
with ROUGE, we consider the LLM ratings to be more
ble of generating compressed textual representations
informative for comparisons between these runs. How-
of long contexts that are useful for tasks that humans
ever, the LLM ratings do not make it easy to compare
think are important, even without knowing those tasks
with results using a different LLM to rate, such as
ahead of time. I.e., the LLM can generate broadly use-
GPT, and they also do not allow for easy comparisons
ful gist memories even before knowing what questions
with other works. The same observation applies to the
are going to be asked about the text that is being gisted.
NarrativeQA results above.
The results also demonstrate that LLMs are capable
4.4. Ablation Study and Analysis of reasoning interactively over such compressed rep-
resentations, using them to decide what information
We provide additional ablation studies in Appendix A. needs to be retrieved to most effectively perform a
known task. This method can increase the effective
Retrieval Quality In Table 4, we compare using Gist- context length by up to 20× while outperforming con-
Mem with neural retrieval to look up one page with ventional retrieval techniques. However, this approach
8
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
does not give infinite context lengths, nor does it guar- Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An
antee good performance when the gist memory itself open-source chatbot impressing gpt-4 with 90%*
is extremely long. Future work will need to address chatgpt quality, March 2023.
these fundamental limitations in LLMs.
[10] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens,
B. Wang, H. Sun, and Y. Su. Mind2web: Towards
Acknowledgements a generalist agent for the web. arXiv preprint
arXiv:2306.06070, 2023.
The authors thank Sergey Ioffe, Rif A. Saurous, Yujin
Tang, Sergio Guadarrama, Daliang Li, Felix Yu, and [11] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli,
Rob Fergus for valuable feedback and discussion. and J. Weston. Wizard of wikipedia: Knowledge-
powered conversational agents, 2019.
[7] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, [16] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, and
and J. Jia. Longlora: Efficient fine-tuning of long- S. Wang. Lm-infinite: Simple on-the-fly length
context large language models. arXiv preprint generalization for large language models. arXiv
arXiv:2309.12307, 2023. preprint arXiv:2308.16137, 2023.
[8] C.-H. Chiang and H.-y. Lee. Can large lan- [17] P. He, X. Liu, J. Gao, and W. Chen. Deberta:
guage models be an alternative to human eval- Decoding-enhanced bert with disentangled at-
uations? In A. Rogers, J. Boyd-Graber, and tention. arXiv preprint arXiv:2006.03654, 2020.
N. Okazaki, editors, Proceedings of the 61st An-
nual Meeting of the Association for Computa- [18] G. Izacard and E. Grave. Leveraging passage re-
tional Linguistics (Volume 1: Long Papers), pages trieval with generative models for open domain
15607–15631, Toronto, Canada, July 2023. As- question answering, 2021.
sociation for Computational Linguistics. doi:
10.18653/v1/2023.acl-long.870. [19] H. Jin, X. Han, J. Yang, Z. Jiang, Z. Liu, C.-Y.
Chang, H. Chen, and X. Hu. Llm maybe longlm:
[9] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, Self-extend llm context window without tuning.
H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. arXiv preprint arXiv:2401.01325, 2024.
9
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
[20] G. Kim, P. Baldi, and S. McAleer. Language [30] O. Press, N. Smith, and M. Lewis. Train short,
models can solve computer tasks. arXiv preprint test long: Attention with linear biases enables
arxiv:2303.17491, 2023. input length extrapolation. In International Con-
ference on Learning Representations, 2022.
[21] T. Kočiskỳ, J. Schwarz, P. Blunsom, C. Dyer, K. M.
Hermann, G. Melis, and E. Grefenstette. The [31] V. Reyna and C. Brainerd. Fuzzy-trace theory:
narrativeqa reading comprehension challenge. Some foundational issues. Learning and Individ-
Transactions of the Association for Computational ual differences, 7(2):145–162, 1995.
Linguistics, 6:317–328, 2018. [32] V. F. Reyna. A theory of medical decision making
and health: fuzzy trace theory. Medical decision
[22] J. Lanchantin, S. Toshniwal, J. Weston, A. Szlam,
making, 28(6):850–865, 2008.
and S. Sukhbaatar. Learning to reason and
memorize with self-notes. arXiv preprint [33] V. F. Reyna. A new intuitionism: Meaning, mem-
arXiv:2305.00833, 2023. ory, and development in fuzzy-trace theory. Judg-
ment and Decision making, 7(3):332–359, 2012.
[23] P. Lewis, E. Perez, A. Piktus, F. Petroni,
V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.- [34] V. F. Reyna and C. J. Brainerd. Fuzzy-trace the-
t. Yih, T. Rocktäschel, et al. Retrieval-augmented ory: An interim synthesis. Learning and individ-
generation for knowledge-intensive NLP tasks. ual Differences, 7(1):1–75, 1995.
Advances in Neural Information Processing Sys-
[35] S. Robertson, H. Zaragoza, et al. The prob-
tems, 33:9459–9474, 2020.
abilistic relevance framework: BM25 and be-
[24] C.-Y. Lin. ROUGE: A package for automatic eval- yond. Foundations and Trends® in Information
uation of summaries. In Proceedings of the ACL Retrieval, 3(4):333–389, 2009.
Workshop: Text Summarization Branches Out [36] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran,
2004, page 10, 01 2004. A. Haviv, A. Gupta, W. Xiong, M. Geva, J. Berant,
et al. Scrolls: Standardized comparison over
[25] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, long language sequences. In Proceedings of the
M. Bevilacqua, F. Petroni, and P. Liang. Lost in 2022 Conference on Empirical Methods in Natural
the middle: How language models use long con- Language Processing, pages 12007–12021, 2022.
texts. arXiv preprint arXiv:2307.03172, 2023.
[37] F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan,
[26] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, E. H. Chi, N. Schärli, and D. Zhou. Large lan-
C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun- guage models can be easily distracted by irrel-
ders, et al. Webgpt: Browser-assisted question- evant context. In International Conference on
answering with human feedback. arXiv preprint Machine Learning, pages 31210–31227. PMLR,
arXiv:2112.09332, 2021. 2023.
[27] R. Y. Pang, A. Parrish, N. Joshi, N. Nangia, [38] T. Shi, A. Karpathy, L. Fan, J. Hernandez, and
J. Phang, A. Chen, V. Padmakumar, J. Ma, P. Liang. World of bits: An open-domain plat-
J. Thompson, H. He, et al. Quality: Question form for web-based agents. In International Con-
answering with long input texts, yes! In Proceed- ference on Machine Learning, 2017.
ings of the 2022 Conference of the North American
[39] S. Sun, Y. Liu, S. Wang, C. Zhu, and M. Iyyer.
Chapter of the Association for Computational Lin-
Pearl: Prompting large language models to plan
guistics: Human Language Technologies, pages
and execute actions over long documents. arXiv
5336–5358, 2022.
preprint arXiv:2305.14564, 2023.
[28] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, [40] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler.
P. Liang, and M. S. Bernstein. Generative agents: Efficient transformers: A survey. ACM Comput.
Interactive simulacra of human behavior. In Pro- Surv., 55(6), dec 2022. ISSN 0360-0300. doi:
ceedings of the 36th Annual ACM Symposium on 10.1145/3530811. URL https://doi.org/
User Interface Software and Technology, pages 10.1145/3530811.
1–22, 2023.
[41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
[29] B. Peng, C. Li, P. He, M. Galley, and J. Gao. L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polo-
Instruction tuning with gpt-4. arXiv preprint sukhin. Attention is all you need. In I. Guyon,
arXiv:2304.03277, 2023. U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
10
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
S. Vishwanathan, and R. Garnett, editors, Ad- American Chapter of the Association for Computa-
vances in Neural Information Processing Systems, tional Linguistics: Human Language Technologies,
volume 30. Curran Associates, Inc., 2017. pages 5905–5921, 2021.
[42] J. Weston and S. Sukhbaatar. System 2 atten- [51] W. Zhong, L. Guo, Q. Gao, and Y. Wang.
tion (is something you might need too). arXiv Memorybank: Enhancing large language mod-
preprint arXiv:2311.11829, 2023. els with long-term memory. arXiv preprint
arXiv:2305.10250, 2023.
[43] J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon,
R. Lowe, J. Leike, and P. Christiano. Recursively
summarizing books with human feedback. arXiv
preprint arXiv:2109.10862, 2021.
11
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Table 5 | ReadAgent accuracy on QuALITY with episode pagination based on LLM (PaLM 2-L) vs. uniform
length pagination.
The compression trade-off Table 6 presents the empirical results of compression rate increasing as page
size increases. As the compression rate decreases, the gists are more useful for answering questions directly.
However, for ReadAgent with look-ups, when the compression rate gets too low or too high, accuracy suffers.
Table 6 | Compression rate increases as the maximum number of words allowed per page increases on QuALITY.
Our default setting of min/max words is 280/600. In the other three experiments, we scale min words
proportionally with max words.
Table 7 | QuALITY results on the dev set of 230 docs and 2086 questions using GPT-3.5-turbo. CR is the
compression rate. # LU is the number of lookups. We report 1 run for each experiment for cost considerations.
12
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
C. Case Study
In this section, we analyze reading comprehension examples to demonstrate where the ability to simultaneously
think over long-range global context and focus on local information is important. We selected the short story
“off course” by Mack Reynolds4 because it is extremely short (2,712 words) and it is only broken into 8 pages,
yet even so, neural retrieval using 4 pages gets three questions wrong that ReadAgent correctly answers. For
this story, ReadAgent answers 6 of 8 questions correctly. Neural retrieval answers 3 of 8 correctly, and doesn’t
get either question correct that ReadAgent misses. Note that in all three examples, ReadAgent only chooses to
select two pages, even though it is also permitted to select up to 4. This flexibility is another advantage that
ReadAgent has over standard retrieval systems.
⟨P0⟩ Patrolmen Dermott and Casey encounter Dameri Tass, an alien who has landed on Earth. Dameri attempts to
communicate with them using a device that translates his thoughts into English.
⟨P1⟩ The alien Dameri Tass used a helmet to learn English from Tim Casey, an Irish patrolman. He then became
fascinated by a horse and wanted to use the helmet on the animal. Patrolman Dermott felt like he was in a shaggy
dog story.
⟨P2⟩ A helicopter arrived, interrupting the horse’s inspection. Two Army officers exited and ordered a police cordon
around the spacecraft. The alien spoke, surprising the general. More police and military personnel arrived.
⟨P3⟩ Dameri Tass, an alien visitor, was whisked away to Washington and held incommunicado for several days. His
arrival caused a global furor. Officials worried about the potential impact of his message on society. Eventually, the
UN demanded that he be allowed to speak before the Assembly. The White House agreed and a date was set.
⟨P4⟩ The world eagerly awaited a message from space. Dameri Tass, an envoy from a super-civilization, was
expected to guide the world. Most people were ready to be guided, but some were not. The U.N. Secretary-General
was nervous about introducing the envoy, as they knew very little about him. He had been asleep for most of
his time on Earth and had only recently woken up. He spent his time playing with a dog, cat, and mouse. The
Secretary-General was worried about what the envoy would say.
⟨P5⟩ Dameri Tass, an alien, is brought to Earth and mistaken for an envoy from another planet. He reveals he is
just a collector for a zoo.
⟨P6⟩ Dameri Tass, an alien, mistakenly landed on Earth. He addressed a large crowd, criticizing their weapons,
wars, and lack of a planet-wide government. He then left, refusing to take any Earth creatures with him, but
expressing interest in horses.
⟨P7⟩ The others watched as the first visitor from space hurriedly left Earth.
4 Available at http://aleph.gutenberg.org/3/0/0/3/30035//30035-h//30035-h.htm.
13
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
For the question above, ReadAgent looked up pages 5 and 6. Neural retrieval looked up pages 3, 4, 5, and 6.
Pages 4 and 5 both make prominent mention of animals, and Page 5 explicitly mentions that the alien is a
collector for a zoo, so answer (C) seems reasonable based on the information on those pages. However, Pages
5 and 6, together with the global context from the gist memory, make it clear that (D) is the correct answer.
Since neural retrieval provided both of those pages, the lack of the global context combined with the additional
distractor pages led the LLM astray.
Incorrect retrieval The same story provides two examples of the consequences of incorrect retrieval, and
the benefits of the gist memory. For the question above, ReadAgent looked up pages 3 and 4. Neural retrieval
looked up pages 0, 1, 3, and 6. The correct answer is clearly stated on Page 4, and also clearly stated in the
gist of Page 4. If the LLM had access to either of those, it should have been able to answer correctly. Instead, it
was undoubtedly confused by Pages 0 and 1, where the alien learns an accent from one of the police officers in
the initial encounter.
For the question above, ReadAgent looked up pages 0 and 1. Neural retrieval looked up pages 0, 3, 4, and
6. The critical information was in Page 1, although Page 0 was also relevant. The remaining pages were only
relevant in that they demonstrated that (B) was incorrect. Again, the gist memory was sufficient to answer
the question correctly, in addition to providing clear signal about what pages are relevant to the question. But
neural retrieval’s selection of Page 0 without Page 1 made (C) seem plausible, as Page 0 discusses a device that
the alien was clearly trying to use for communication.
D.1. Implementation
Pagination For HTML, we leverage the explicit HTML DOM tree structure, decomposing the HTML into
snippets with elements at a target depth and their descendants. We test the depth from 5 to 7 and choose the
best. We use these snippets as the “pages” instead of asking the LLM to paginate.
14
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Memory Gisting Similar to ReadAgent for reading comprehension, we prompt the LLM to summarize snippets
into gists zero-shot, and subsequently concatenate the gists. We contextualize the gists with snippet index
number in a python dictionary-format (e.g. {“index”: ..., “content”: ...}).
Interactive Look-up In the interactive look-up step, the LLM looks at a given task instruction, previous
action history, and the gists to decide which original HTML snippets it wants to look up. We experimented
with parallel look-up (ReadAgent-P) in the web navigation setting for faster experiments. Finally, to predict
next-step actions, the LLM reads the retrieved snippets again and predicts the target element id to interact
with, the type of action operation (click, type or select), and the input value (if any).
D.2. Mind2Web
We evaluate ReadAgent for Web Navigation on the Mind2Web [10] dataset, a real-world planning and web
action prediction benchmark, consisting of 2K instructions and episodes collected from 137 websites. The
agent’s task is to predict the next-step action (click, type and select) given HTML, task instruction, and previous
action history. Mind2Web has three test set splits: cross-task (252 tasks from 69 websites), cross-website (177
tasks from 10 websites), and cross-domain (912 tasks from 73 websites), which was originally designed for
different testing different type of generalization. However, since our approach is zero-shot without training,
these splits do not serve their original purposes.
Baselines MindAct from the Mind2Web paper [10] first uses a DeBERTa-base [17] model trained for task-
relevant element retrieval to get the top 50 relevant elements. Instead of directly predicting target element id
(part of an action), it formulates this task as iterative multi-choice question-answering with target element
ids sampled from the top 50 and uses the LLM to solve it for performance purpose (see Deng et al. [10] for
details). The same LLM also predicts the type of action and an optional value. MindAct (GPT-4) results are the
state-of-the-art. We additional generate MindAct results with PaLM 2-L as a reference.
Following the reading comprehension experiments (Section 4), we also compare with using full raw HTML,
retrieval with BM25, neural retrieval with Gemini API embedding model (models/embedding-001), and using
the gists without look-up, which, like ReadAgent, are not trained for web navigation tasks. We ask the LLM to
directly predict that target element id as it is a simpler and more tractable implementation in our setting.
No training
(PaLM 2-L)
+Raw HTML 0.0 22.1 76.7 19.2 1.2 0.0 22.2 72.3 18.2 1.7 0.0 23.6 75.6 20.9 1.0
+BM25 Retrieval (Top-1) 43.7 16.3 61.7 14.2 0.4 49.7 17.8 60.8 15.2 0.0 51.6 17.3 60.4 15.9 0.0
+BM25 Retrieval (Top-5) 19.5 25.9 70.4 22.4 2.0 17.6 29.5 71.8 23.1 1.7 19.2 27.6 71.1 24.4 1.0
+Neural Retrieval (Top-1) 74.4 14.6 55.5 11.7 0.4 87.9 18.0 55.8 14.0 0.0 82.8 16.4 60.3 14.2 0.1
+Neural Retrieval (Top-5) 32.4 26.4 71.9 22.6 0.8 37.2 26.7 69.1 22.3 2.8 38.1 30.0 72.5 26.9 1.2
GistMem 84.4 11.7 43.1 9.5 0.0 82.5 11.7 43.6 8.4 0.0 83.0 13.4 49.6 11.7 0.5
ReadAgent-P: Lookup 1 snippet 55.1 31.1 70.1 26.8 2.0 54.1 34.5 74.1 28.2 2.3 55.2 36.1 75.6 33.0 2.0
ReadAgent-P: Lookup 1-5 snippets 35.9 33.7 72.5 29.2 2.8 35.6 37.4 75.1 31.1 3.4 48.2 37.2 76.3 33.4 2.3
Δ (Raw − ReadAgent) – +11.6 -4.2 +10.0 +1.6 – +15.2 +2.8 +12.9 +1.7 – +13.6 +0.7 +12.5 +1.3
Δ (MindAct − ReadAgent) – +3.9 +10.6 +4.8 +1.6 – +8.6 +15.5 +9.5 +2.8 – +7.3 +15.9 +8.9 +1.0
Table 9 | Web navigation performance on Mind2Web [10]. ∗ marks models that are trained supervisedly for the
web domain. GistMem and ReadAgent results are all also based on PaLM 2-L. We evaluate the performance in
element accuracy (Ele. Acc), operation F1 (Op. F1), step success rate (Step SR), and episode success rate (SR).
We also measure the compression rate (CR). The best performance across all the baselines is bolded, and the
best across the approaches using PaLM 2-L is underlined. ReadAgent achieves consistently better performance
than using raw HTML inputs (PaLM 2-L), retrieval methods, and MindAct (PaLM 2-L) with a trained Rank LM
for HTML snippet retrieval.
15
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
D.3. Results
As shown in Table 9, ReadAgent achieves strong performance compared to the baselines. In particular, the
results are even better than MindAct (PaLM 2-L), which uses the supervisedly learned Rank LM, despite
ReadAgent not using models trained on the web navigation domain. Prior work shows that state-of-the-art LLMs
alone are generally still weaker than the approaches using models specifically trained for the web navigation
domain [12].
Figure 3 shows that gisting effectively reduces the number input tokens. Most of the input gists require less
than 8K tokens. For example, 97.4% of gisted inputs in cross-website split fits into the 8k context length, while
only 51.5% of raw HTML can fit in the context window. The inputs are truncated for the parts that exceed the
context length limit, which can significantly impact performance.
The results in Figure 3 and Table 9 indicate that even using the gist memory and ReadAgent retrieval causes
truncation on many web pages. This is because the retrieved snippets are quite large, causing the compression
rate to drop substantially. In spite of those issues, the ReadAgent results give real gains over using the full
context. This indicates that even the truncated gists and retrieved pages are more informative than the truncated
raw HTML when using an LLM with a small context length.
Figure 3 | (Left) Histogram of raw HTML and gist tokens in the Mind2Web cross-website split. Most of the
input gists require fewer than 8K tokens. (Right) Statistics of token counts of raw HTML and gists.
E. ReadAgent Variants
E.1. Unconditional and Conditional ReadAgent
When working with a long text, it is possible that the user will know ahead of time what task is to be solved. In
that case, conceivably the gisting step could include the task description in the prompt. In so doing, it is easy to
imagine that the LLM could do a better job of compressing out information that is irrelevant to the task, thereby
improving efficiency and reducing distraction. This approach would be Conditional ReadAgent. However, more
generally, the task may not be known while preparing the gists, or it may be known that the gists need to be
used for multiple different tasks, such as answering many questions about the text. Thus, by excluding the
task in the gisting step, the LLM may produce more broadly useful gists, at the cost of reduced compression
and increased distracting information. This setting would be Unconditional ReadAgent. We only explore the
unconditional setting in this work, but we note that the conditional setting may be preferred in some situations.
16
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
text segments, and the higher levels are summaries of summaries. Given a task, it traverses the tree from the
root to search for task-related information. We think there are a few reasons to prefer the ReadAgent approach
over MemWalker.
First, the reliability is a concern. Having LLMs traverse summary tree may not be a reliable process. In our
best-effort re-implementation of MemWalker with PaLM 2-L, it unsatisfyingly achieves 66.73% on QuALITY.
To put that into perspective, using full raw content is 85.83%, ReadAgent-P (look up 1-5 pages) is 86.63%,
ReadAgent-S (look up 1-6 pages) is 86.88%, and using BM25 Top-1 is 70.55%. Part of the performance
difference is caused by a high search failure rate. 11.7% of the searches failed to finish after sufficient retries.
This failure rate of our implementation is in a similar range to what the authors reported: 91.4% successes and
8.6% failures5 . In contrast, the failure rate of ReadAgent is mostly 0%.
Second, the hierarchical summary structure makes it difficult to reason over related but distant information at
the same granularity. There isn’t much detail preserved at the top levels of the hierarchy. For example, if the
two most important text pieces are at the beginning and the end of a very long text, the essential information
could be in the first and last leaf. As the agent traverses down to the first leaf, it could be difficult to go back up
to the root and down to the last leaf.
The motivations of the two approaches are also different. MemWalker interacts with a summary tree and
reasons over traversal trajectories, whereas ReadAgent interacts directly with documents and reasons over gist
memories.
G. Pagination Hyperparameters
Pagination Details As described in Section 3.1, max_words and min_words are two episode pagination
hyperparameters. Table 10 gives their values for each of the experiments in Section 4.
max_words
Dataset min_words
QuALITY 600
280
QMSum 600
280
NarrativeQA Gutenberg 3000
500
NarrativeQA movie scripts 1000
600
40 Documents
30 Document Gists
20
10
0
0 50000 100000 150000 200000 250000 300000 350000
words
Figure 4 | Histogram of NarrativeQA (Gutenberg) test set word counts for the original text and the gists.
Context Length Control As the NarrativeQA Gutenberg texts can be very long, the corresponding gists can
sometimes exceed the context length. For those exceptionally long texts, we ask the LLM to go through the
pages and think whether it makes sense to merge pages iteratively with the following prompt, and then re-gist
5 https://openreview.net/forum?id=H5XZLeXWPS
17
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Scripts
40 Script Gists
20
0
0 10000 20000 30000 40000 50000 60000
words
Figure 5 | Histogram of NarrativeQA (movie) test set word counts for the original text and the gists.
the new set of pages. In so doing, we are able to increase the average page size and thus the compression rate
(Figure 6).
Given Page 1 and Page 2, please tell me whether Page 2 starts a new chapter/section/book that is different from
what’s in Page 1.
Please answer with yes, no, or not sure.
Page 1:
{PREVIOUS PAGE TEXT}
Page 2:
{CURRENT PAGE TEXT}
Figure 6 | Histogram of NarrativeQA (Gutenberg) test set gists before and after page merging on the exceptionally
long texts.
The gists and pages can both be long for NarrativeQA. Thus, in the interactive look-up step of ReadAgent-P,
we prevent the retrieved pages from exceeding the context length by asking the model to sort the pages by
importance with the prompt below and iteratively detecting whether adding any pages could go beyond the
context window. For ReadAgent-S, we do a similar check to decide whether to early-stop the sequential look-up.
The following text is what you remember from reading an article and a question related to it.
You may read 1, 2 or 3 page(s) of the article again to refresh your memory to prepare yourself for the question.
Please respond with which page(s) you would like to read in the order of importance, beginning with the most
important page number.
For example, if you only need to read Page 8, respond with “I want to look up Page [8] to ...”.
If you would like to read Page 12 and 7, respond with “I want to look up Page [12, 7] to ...”.
If you would like to read Page 15, 2 and 3, respond with “I want to look up Page [15, 2, 3] to ...”.
DO NOT select more pages if you don’t need to.
You don’t need to answer the question yet.
Text:
{GIST MEMORY}
Question:
{QUESTION}
18
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
20
Transcripts
15 Transcript gists
10
5
0
0 5000 10000 15000 20000 25000
words
Figure 7 | Histogram of QMSum word counts for the original transcripts and the gisted transcripts. The gisted
transcripts are all less than 5,000 words, allowing them to entirely fit into the context windows of PaLM 2-L.
Table 11 shows the same results as Table 3, but on the QMSum test set.
Method CR (# LU) LLM Rating-1 LLM Rating-2 ROUGE-1 ROUGE-2 ROUGE-L Resp. Length
BM25 Retrieval
Top-1 95.61% (1.00) 24.67% ± 0.44 66.90% ± 0.87 28.81 ± 0.13 8.14 ± 0.15 19.62 ± 0.18 48.15 ± 0.18
Top-2 91.32% (2.00) 31.79% ± 1.31 79.95% ± 0.67 30.89 ± 0.13 9.14 ± 0.05 20.67 ± 0.09 53.91 ± 0.64
Top-3 87.25% (3.00) 33.45% ± 0.00 83.63% ± 1.05 31.39 ± 0.23 9.11 ± 0.05 21.03 ± 0.03 55.15 ± 0.61
Top-4 82.86% (4.00) 37.72% ± 1.05 86.12% ± 0.50 31.71 ± 0.09 9.35 ± 0.13 21.26 ± 0.12 58.21 ± 0.37
Top-5 78.79% (5.00) 39.38% ± 1.02 86.60% ± 0.44 32.66 ± 0.04 9.98 ± 0.10 21.86 ± 0.05 59.20 ± 1.05
Top-6 74.62% (6.00) 40.45% ± 0.89 90.98% ± 0.34 32.56 ± 0.03 9.78 ± 0.03 21.64 ± 0.09 60.40 ± 1.28
Neural Retrieval with Gemini API
Top-1 95.80% (1.00) 27.05% ± 0.50 67.97% ± 1.74 28.71 ± 0.12 7.98 ± 0.04 19.59 ± 0.04 49.76 ± 0.78
Top-2 91.62% (2.00) 35.35% ± 0.44 80.07% ± 0.00 31.65 ± 0.18 9.59 ± 0.11 21.29 ± 0.11 56.19 ± 0.76
Top-3 87.39% (3.00) 35.71% ± 1.37 88.49% ± 0.34 32.33 ± 0.17 9.84 ± 0.07 21.54 ± 0.13 59.19 ± 0.96
Top-4 83.28% (4.00) 39.62% ± 0.17 90.15% ± 0.34 32.31 ± 0.21 9.69 ± 0.15 21.65 ± 0.15 59.86 ± 0.11
Top-5 79.33% (5.00) 44.01% ± 0.84 91.22% ± 0.34 32.33 ± 0.24 9.84 ± 0.21 21.67 ± 0.19 61.53 ± 0.35
Top-6 75.35% (6.00) 44.60% ± 0.89 92.65% ± 0.17 32.55 ± 0.08 9.75 ± 0.21 21.39 ± 0.13 61.29 ± 0.46
Truncated Raw Content
First 6k words 31.51% (0.00) 13.17% ± 1.05 47.81% ± 5.90 24.15 ± 1.42 4.89 ± 0.57 16.27 ± 0.96 61.43 ± 3.53
Last 6k words 33.80% (0.00) 13.76% ± 0.84 43.42% ± 0.00 22.90 ± 0.10 4.35 ± 0.04 15.69 ± 0.03 52.47 ± 0.39
GistMem 82.81% (0.00) 44.96% ± 0.44 91.93% ± 0.73 31.20 ± 0.17 9.02 ± 0.09 20.60 ± 0.14 65.84 ± 0.87
ReadAgent-P
Look up 1 pg 79.37% (0.98) 44.84% ± 0.00 92.29% ± 0.34 31.46 ± 0.12 9.09 ± 0.11 20.63 ± 0.05 66.74 ± 0.74
Look up 1-2 pgs 77.00% (1.72) 43.42% ± 1.01 92.88% ± 1.05 31.77 ± 0.16 9.11 ± 0.12 20.70 ± 0.08 65.55 ± 0.28
Look up 1-3 pgs 74.85% (2.46) 44.37% ± 1.21 91.22% ± 0.44 31.89 ± 0.06 8.98 ± 0.13 20.70 ± 0.09 66.06 ± 1.63
Look up 1-4 pgs 73.26% (3.02) 44.13% ± 0.50 90.51% ± 0.44 31.87 ± 0.07 9.12 ± 0.06 20.77 ± 0.01 66.44 ± 0.74
Look up 1-5 pgs 72.01% (3.44) 43.42% ± 1.45 91.22% ± 0.60 31.80 ± 0.16 9.03 ± 0.07 20.64 ± 0.03 66.48 ± 0.39
Look up 1-6 pgs 70.65% (3.89) 42.70% ± 1.54 90.51% ± 0.73 31.74 ± 0.09 8.90 ± 0.09 20.66 ± 0.16 66.24 ± 1.14
ReadAgent-S 1-6 pgs 70.75% (3.42) 49.58% ± 0.44 93.83% ± 0.34 32.88 ± 0.15 9.98 ± 0.06 21.50 ± 0.04 67.86 ± 0.11
Table 11 | QMSum test results (PaLM 2-L) means and standard deviations across 3 runs. 35 articles and 281
questions. Bold methods are this work. Bold values are the best; bold italics are ties for best. CR is the
compression rate. # LU is the number of lookups. Resp. Length is the length in words of the model’s final
response. We omit standard deviations for CR and # LU for presentation purposes; they were all inconsequential.
J. Author Contributions
Kuang-Huei Lee developed the initial working prototype, the method and the experiments on QuALITY and
NarrativeQA, was a main writer of the manuscript, and led the project overall.
Xinyun Chen developed the method, the LLM rater, and experiments on NarrativeQA, and significantly
contributed to manuscript writing.
Hiroki Furuta developed the web navigation experiments, and significantly contributed to manuscript writing.
John Canny contributed in the initial conceptualization, advised the project, and helped with manuscript
editing.
Ian Fischer co-proposed the core idea, developed the method and experiments on QMSum, and was a main
writer of the manuscript.
19