Undermind Whitepaper
Undermind Whitepaper
January 5, 2024
Abstract
We outline an end-to-end solution for searching academic literature. This system, Undermind,
uses language models as a reasoning engine and classifier at key steps within a structured search
process. We benchmark Undermind’s performance compared to Google Scholar, showing drastic
improvements including a 10× higher concentration of truly relevant results within the top hits.
Undermind misses virtually no highly relevant works found by Google Scholar, and in addition
returns 10× the total number of relevant results for the median user-generated query.
Only 1 in
10 top hits ~8 Highly
is highly relevant
relevant papers
~17 Closely
⋮ ⋮
related
1
Undermind achieves these goals nearly perfectly. It currently accesses the scientific literature
database ArXiv,1 , searching within the full texts of 2.3 million papers. It uses a language model
(GPT-4) as a reasoning engine at key steps in a structured exploration process. Its search algorithm
mimics that of a human, adapting and following citation trails to uncover important papers and
reflecting on progress so far to decide next steps. Ultimately, Undermind delivers a precise set of final
results exactly relevant to the user’s complex search topic, explaining each result in detail. The quality
of this report far exceeds that of existing search engines (see Fig. 1 and Fig. 2).
1. Basic search: We identify promising candidate papers using a custom algorithm that combines
semantic vector embeddings, citations, and language model reasoning.
2. Relevance classification: Given your search query, a high quality language model (GPT-4)
accurately classifies each candidate paper based on its full text into 3 categories: highly relevant,
closely related (meaning relevant, but slightly off-topic), or ignorable. See Appendix 3.2 for
classification accuracy statistics.2
3. Adaptation and exploration: The algorithm adapts and searches again based on the relevant
content it has discovered. This adaptation, which mimics a human’s discovery process, makes it
possible to uncover every relevant result.
4. Estimating comprehensiveness: Undermind tracks how frequently it discovers relevant pa-
pers during each search. Undermind initially finds many relevant results, but over time dimin-
ishing returns set in, empirically leading to “discovery curves” which are exponential in form (see
Fig. 1(b)). Modeling this process allows us to determine when Undermind has found nearly all
the relevant works.
1. 10× more relevant results on Undermind vs. the first 5 pages of Google Scholar. In
many cases Google Scholar finds 0 results, while Undermind finds 10-20. Even for searches where
Google Scholar returns a few relevant papers, Undermind still returns significantly more. The
full distribution of relative performance is shown in Fig. 2.
1 https://www.arxiv.org/
2 With accuracy ∼98%, Undermind never classifies a highly relevant paper as irrelevant, or an irrelevant paper as
highly relevant.
3 Undermind’s classifier was used to identify relevant ArXiv papers in Google Scholar’s top 50 results, which typically
2
70
Undermind predicted total
60 Google Scholar top 5 pages
Figure 2: Undermind finds far more relevant papers than Google Scholar. Data for 300
user-generated queries. Blue line: The number of relevant results found by converged Undermind
searches (or, if not converged, the estimated total findable by Undermind with modest extension).
Red line: The number of relevant results found in Google Scholar’s top 5 pages for the same queries.
The queries are ordered by the number of relevant papers found by Google Scholar, and then further
by the number Undermind found. For many queries, Undermind finds 10s of papers while Google
Scholar finds nothing (percentile ∼ 0.15). For searches with many Google Scholar results, Undermind
still finds 3-5× more results (percentile ∼0.9).
3
search goals (see Appendix 3.1 for examples of very complex searches submitted by scientists).
In contrast, for many user requests, Google Scholar completely fails to return relevant results.
This is likely because it is impossible to translate many complex, real world needs and requests
into efficient keyword searches.
2. Knowing how much prior work has been done on a topic Because of the predictable
exponential form of Undermind’s discovery process, we can estimate how many relevant works
exist on a given topic after initially exploring the database. This gives the user an immediate
snapshot of how novel their search topic is, a capability strictly absent from conventional keyword
search.
3. Confirming nothing exists on a topic Because Undermind is likely truly exhaustive, if
Undermind provides no relevant results, one can be reasonably certain nothing exists on the
topic. In contrast, if one uses Google Scholar and finds no results, it’s impossible to know
whether nothing exists, or whether keyword searching with Google Scholar has simply failed (see
Fig. 2, left side).
4
3 Appendix
3.1 Distribution of real user searches on Undermind
User-submitted requests to Undermind vary in complexity and difficulty. However, for each search, the
discovery rate of relevant papers follows an exponential form, and saturates after Undermind has found
most relevant results, as shown in Fig. 1(b). The variation in the complexity of searches submitted
by users causes the time constant as well as the total number of relevant papers found to vary widely
between searches.6
To convey this variation, in Fig. 3 we show the predicted number of relevant papers and convergence
rate for the user searches submitted to Undermind and analyzed in this report. A median search
has 24 relevant papers and converges with a time constant of 80 papers evaluated, meaning that a
typical Undermind report evaluating 150 papers would immediately find ∼ 85% of all relevant results.
Extending this search to read 150 additional papers (300 total) would find ∼ 98% of all relevant results.
To clarify the range of complexities for user queries, we provide a few examples (modified slightly
for privacy):
Topic: Experiments that use tapered optical fibers to couple light into a microfabri-
cated waveguide in the visible spectrum
Additional context: Tapered optical fibers take the mode from the fiber core to largely
being evanescent and can be used to couple into other waveguides with high efficiency. I
am curious about how these tapered fibers are mechanically attached when this method
is used. I care most about results which use light in the visible spectrum, so between
400 nm and 800 nm wavelength.
These complex searches involve many concepts: For the latter search, relevant papers must contain
experimental not theoretical results, use tapered optical fibers, talk about optical coupling into a
microfabricated waveguide, and must use visible spectrum light. In addition, the user clarifies they
6 Someone asking for “any quantum experiment” would find thousands of papers, with a very long time constant for
exponential saturation, while someone asking for a very specific topic might find only 1 or 0 papers, with a short time
constant.
5
a) b)
Figure 3: Statistics of Undermind user searches. Histograms of the exponential amplitude (a) and
time constants (b) for the best fits to the discovery curves (as in Fig. 1(b)) of ∼ 300 user searches. (a)
The amplitude Undermind predicts for each search is the total number of papers Undermind expects
to find if the search is extended to fully converge. (b) The time constant τ of the exponential describes
how quickly this exponential discovery process approaches convergence. The discovered fraction of
total papers f after evaluating n papers is modelled as f = 1 − e−n/τ . The majority (∼ 63%) of
relevant papers are discovered after τ papers are evaluated. Typical Undermind searches evaluate 150
papers.
are most interested in learning about mechanical attachment methods. This is a very difficult, if
not impossible, goal to convey to a keyword search engine, though such goals can be achieved by
Undermind.
3.3 Methodology for generating keyword phrases for Google Scholar com-
parison
When formulating their queries for Undermind, scientists were told to phrase their request to capture
the entirety of their search goals and conditions. As a result, many of their queries are verbose and
complex, and un-optimized for keyword search (see Appendix 3.1 for examples).
In order to translate these verbose queries into a format usable by Google Scholar for our compar-
ison, we needed to mirror the process a human takes to break down their complex search task into
6
Undermind Classification Human Judgment
Highly relevant Closely related Not relevant
Highly relevant 85 17 0
Closely related 25 72 8
Not relevant 2 9 214
bit-sized keyword searches. To automate this process, we prompted GPT-4 to create 5 keyword search
phrases from each Undermind query (prompt details below). We then gathered the top 10 papers
found by each of these keyword searches on Google Scholar (50 total papers) to compare to the papers
Undermind retrieves and analyzes.
Generating keyword search phrases Here is an example of how GPT-4 was used to generate the
keyword search phrases for a user search:
7
Reading efficiency on Google Scholar
1.0
All Relevant
Fraction of papers in
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Percentile Rank
Figure 4: Benchmarking the efficiency of reading Google Scholar’s top results. This plot
shows the fraction of relevant papers found within the top 10 ArXiv papers returned by Google Scholar
for each of the ∼ 300 user-generated searches (ordered by percentile on the x-axis). For most searches,
the relevant results are sparse.
System message: You are a thoughtful, expert scientist, and you are knowledgeable about
carefully crafting a search phrase to find useful papers in a search engine.
User message: I am trying to help a colleague find papers about this topic: ‘{topic}’.
In addition, here is some background information they provided: ‘{additional context}’. I
want you to help me generate 5 unique keyword searches for Google Scholar that will find
papers on this topic. Do not use boolean operators. Make sure not to repeat searches without
changing the keywords. Make some searches broad and some narrow, some very short, and
some slightly longer.
Generated keyword search phrases for this example:
1. evolution of language model architectures
2. historical review transformer language models
3. large language models development milestones
4. language model architecture advancements 2023
5. comparative study large language model architectures
For each of these keyword search phrases, we gathered the top 10 results from Google Scholar (top
page). We then found the ArXiv papers in these results (typically ∼ 30 papers out of 50 gathered).
We ordered these ArXiv papers in a round-robin fashion (top paper from one search, then the top from
the next, and so on). We refer to the first 10 ArXiv papers discovered as the effective “first page” of
Google Scholar, and when we quote the “top 5 pages of Google Scholar”, we are referring to all ArXiv
papers found in the entire 50 results.7 We believe this set of ArXiv papers from the “top 5 pages” is
a reasonable approximation of the set of papers a human could parse with significant manual effort.
3.4 Measuring sparsity of relevant works within Google Scholar’s top re-
sults
We evaluated the first 10 ArXiv papers found by Google Scholar using Undermind’s high quality
classification system to determine if each paper was relevant to the user’s original request. Fig. 4
shows the fraction of these top 10 results which were actually relevant to a user’s search, across the
full set of Undermind searches.
When reading through the top few Google Scholar results, often more than 90% of results are
completely irrelevant (right side of graph). Note that relevant papers do exist for most of these
7 In 4 out of the ∼ 300 searches, we found less than 10 ArXiv IDs in the top 5 pages of Google scholar (these searches
only had 7, 8, 9, and 9 ArXiv papers). For simplicity, we treat these searches as if we had found 10 ArXiv papers to
analyze and classified these few additional papers as irrelevant.
8
searches (see Fig. 2 for the predicted number of relevant papers for most searches). Google Scholar
simply finds very few of these relevant papers.
Method 2: ensembling search methods Because evaluating every paper is prohibitively expen-
sive, a different approach is usually taken. Instead, one samples many complementary search methods
which are somewhat uncorrelated. Their combined results are assumed to exhaustively gather all rele-
vant papers. One can then compare the retrieved papers of any specific search method to the set of all
papers found by all the methods. The advantage of this approach is that one only needs to evaluate a
small fraction of all papers in the database to find all truly relevant results.
• Within the 5.25 papers unencountered by Undermind, only 0.03 highly relevant papers are found
on average (Fig. 5(c), right side).
• Within the 4.75 papers that were already encountered by Undermind (Fig. 5(d), right side), 1.21
are highly relevant (Fig. 5(f), right side).
8 These are searches where Undermind predicts it will not discover more relevant papers with further reading, because
9
To estimate the exhaustiveness of Undermind using equation (1), we take the ratio:
3.6 Measuring the total number of relevant papers in the top 50 Google
Scholar results
To save on compute costs, instead of running the relevance classifier over all the ArXiv papers found
in the top 50 papers on many Google Scholar searches, we can obtain a close estimate of the number
of relevant hits in the top 50 Google Scholar results using the data in Fig. 5.
For a converged Undermind search, we established in Appendix 3.5 that Google Scholar finds
virtually no relevant papers Undermind misses. Therefore, one can use the set of Undermind-discovered
relevant papers as the ground truth, and simply check how many of those same papers appear in the
top 50 results of Google Scholar.
For non-converged searches, we can still easily estimate the number of expected relevant papers in
Google Scholar’s top 50 results using the data in Fig. 5. To do so, we first estimate the fraction of
total papers a given search has found so far, which places the search at a given position on the x-axis
of Fig. 5. At that x position, we next estimate the ratio
Total relevant papers in Google Scholar top 10
(3)
Relevant papers found by Undermind in Google Scholar top 10
by comparing the best fit data in Fig. 5(b-c) to Fig. 5(e-f). Finally, we count the number of relevant
papers that the non-converged Undermind search has found in the top 50 results, and correct this
upwards to account for the undiscovered papers. This correction factor is in the range of 1× to 2.5×.
Where necessary, the data shown in Fig. 2 have this correction already applied.
9 Sources of uncertainty include: error on the best fit lines in Fig. 5, misclassification errors of Undermind in Table 1
and Table 2, and the uncertainty from a finite sample size of ∼ 300 highly relevant papers.
10 As an outline, misclassification of irrelevant papers as closely related occurs at a ∼ 4% rate in Table 1. Assuming
this 4% error also holds for Google Scholar sampled papers (not necessarily justified), this implies the ∼ 5 irrelevant
papers in the set of unencountered papers would produce ∼ 0.2 falsely identified closely related papers, a large fraction
of the observed 0.28 closely related papers in Fig. 5(b), right side.
10
a) b) c)
d) e) f)
Figure 5: Statistics of papers in the top 10 Google Scholar results. These plots show the
number of papers in the top 10 papers returned by Google Scholar which were not yet encountered
and classified by Undermind (a), and how many of those were closely related (b) or highly relevant
(c) after evaluating them with the language model classifier. These are shown as a function of the
convergence fraction f = 1 − e−n/τ of each Undermind search, which is Undermind’s best estimate of
the fraction of relevant papers it has found so far (f is described further in Fig. 3). Red lines show
moving averages of 20 datapoints, and black lines are best fit lines to the entire dataset. (d-f) shows
the same corresponding data for the papers that were already encountered by Undermind in the top
10 Google Scholar results. (a-c) show that converged searches (far right of each graph) have on average
∼ 5 papers in the top 10 which Undermind has not yet encountered and evaluated. However, virtually
no new highly relevant papers are discovered when those papers are evaluated. See Appendix 3.5 for
further details and interpretation.
11