0% found this document useful (0 votes)
12 views4 pages

Harnessing Retrieval Augmented Generatio

The document presents a methodology utilizing Retrieval-Augmented Generation (RAG) to identify knowledge gaps on the internet by simulating user search behavior. The RAG system achieved a consistent accuracy of 93% in generating relevant suggestions, demonstrating its effectiveness across various fields such as scientific discovery and content development. The study highlights the potential of generative AI in improving information retrieval systems by uncovering and addressing these knowledge gaps.

Uploaded by

Jeremy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

Harnessing Retrieval Augmented Generatio

The document presents a methodology utilizing Retrieval-Augmented Generation (RAG) to identify knowledge gaps on the internet by simulating user search behavior. The RAG system achieved a consistent accuracy of 93% in generating relevant suggestions, demonstrating its effectiveness across various fields such as scientific discovery and content development. The study highlights the potential of generative AI in improving information retrieval systems by uncovering and addressing these knowledge gaps.

Uploaded by

Jeremy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Harnessing Retrieval-Augmented Generation (RAG) for

Uncovering Knowledge Gaps


Joan Figuerola Hurtado
Independent Researcher
[email protected]

A Large Language Model (LLM) [4] generates text-based


Abstract responses, while RAG [3] is an AI framework used to
We present a methodology for uncovering knowledge enhance the quality of LLM-generated responses by
gaps on the internet using the Retrieval Augmented grounding them on external sources of knowledge. These
Generation (RAG) model. By simulating user search technologies combine to provide accurate, up-to-date
behaviour, the RAG system identifies and addresses gaps information and improve the generative process of language
in information retrieval systems. The study demonstrates models.
the effectiveness of the RAG system in generating relevant
suggestions with a consistent accuracy of 93%. The
methodology can be applied in various fields such as
3 Methodology
scientific discovery, educational enhancement, research To identify knowledge gaps, we simulate user
development, market analysis, search engine optimization, interactions with search engines in a structured process.
and content development. The results highlight the value of Initially, we begin with a query and methodically review each
identifying and understanding knowledge gaps to guide search result until an answer is found. If the first top 10
future endeavours. results do not yield an answer, we generate up to four
alternative queries and retrieve up to two documents per
query, iterating through the search process again.
1 Introduction
The increasing number of users dissatisfied with the
relevance of commercial search engine results is surprising,
given the unprecedented access to vast information and
sophisticated search technologies [1, 2].

In this paper, we employ the Retrieval Augmented


Generation (RAG) model to simulate user search behaviour,
aiming to identify and address knowledge gaps on the
Internet. We posit that uncovering and bridging these gaps
is crucial for enhancing the efficacy of information retrieval
systems.

2 Related Work
Yom. et. al [14] presents an algorithm to estimate query
difficulty. Estimation is based on the agreement between the Figure 1: Iteration loop to find knowledge gaps
top results of the full query and the top results of its
sub-queries. In doing so, difficult queries reveal gaps in a Our approach utilises AskPandi [12], a
content library. The methodology is based on training an Retrieval-Augmented Generation (RAG) system, to mimic
estimator based on a small dataset. We argue that there are user behaviour. AskPandi integrates Bing's web index for
now simpler LLM prompting techniques that do not require data retrieval and GPT as a reasoning engine. After finding
training a custom model and yield better generalisation an answer, we capitalise on the in-context capabilities [5, 6,
across multiple domains. 7] of LLMs to generate a series of relevant follow-up
questions. This process is guided by the premise that a
well-generalised [8] LLM should provide useful

This is a preprint. It is not peer reviewed yet.


recommendations based on the initial question and answer. 17. Online Communities
The prompt we use is: 18. People & Society
19. Pets & Animals
'Based on the answer '{}' and the question '{}', what
20. Property
are some potential short follow-up questions?'
21. Reference
This methodology diverges from traditional recommender 22. Science
systems [9], which filter through existing content. In 23. Shopping
contrast, our system focuses on generating the most 24. Sports
relevant content, regardless of its preexistence, highlighting 25. Travel
a shift from extractive to generative approaches. The
process is then iterated, with each cycle going deeper into For each category, we generate 20 queries grouped by
the query’s topic, thus increasing the difficulty of finding their complexity: easy and difficult. To determine the
relevant information. We consider the emergence of a complexity of each query, we use the following
knowledge gap when the LLM can no longer generate an methodology:
answer.
Length of Query
In terms of terminating the process, we incorporate a ● Easy: Short queries, usually 1-3 words.
mechanism to identify stop words in answers. We explored ● Difficult: Very long queries or full sentences, more
two methods: either letting the model naturally produce a than 6 words.
stop word or directing the model to generate one in cases of Specificity of Query
uncertainty [10]. ● Easy: General or broad queries.
● Difficult: Highly specific, niche, or detailed queries.
This comprehensive process not only helps in identifying Use of Jargon or Technical Terms
knowledge gaps but also enhances our understanding of ● Easy: Common language, no specialised terms.
the potential of generative AI in facilitating more relevant ● Difficult: Heavy use of technical terms, jargon, or
information retrieval systems. acronyms.
Ambiguity or Clarity of Query
● Easy: Clear and straightforward, with likely one
4 Experiments main interpretation.
We build a dataset with 500 search queries classified in ● Difficult: Ambiguous, requiring context or additional
25 categories. We pick the parent categories from Google information to interpret.
Trends as of 2023 [11]. Given that Google Trends derives its Search Intent
data from Google search queries, it is hypothesised that this ● Easy: General information seeking or popular
tool provides a representative sample of the general online topics.
search behaviour. All the 500 search queries can be found ● Difficult: In-depth research, controversial topics, or
in our GitHub repository [13]. highly detailed queries.
Knowledge Level Required
1. Arts & Entertainment ● Easy: Suitable for a general audience, no special
2. Autos & Vehicles knowledge needed.
3. Beauty & Fitness ● Difficult: Requires in-depth knowledge or expertise
4. Books & Literature in the field.
5. Business & Industrial Query Format
6. Computers & Electronics ● Easy: Basic questions or keyword searches.
7. Finance ● Difficult: Complex questions, hypotheticals, or
8. Food & Drinks requiring multi-step thinking.
9. Games
10. Health For each search simulation, we measured the following
11. Hobbies & Leisure metrics:
12. Home & Garden ● Accuracy: the percentage of queries that were
13. Internet & Telecom answered correctly by the RAG system. Answers
14. Jobs & Education that have been manually reviewed.
15. Law & Government ● Topic Depth: the number of iterations until the LLM
16. News system stopped answering the question.

This is a preprint. It is not peer reviewed yet.


● Average number of sources used per search 5. Search Engine Optimization: Improving search
simulation. recommendations by identifying what users might be
looking for but isn’t currently available online.
6. Content Development: It aids in recognizing content
5 Analysis gaps within a content library, assisting content creators
We carried out search simulations for 60 keywords, in filling these voids.
generating 323 answers across 655 sources. We have
found that using more than 60 keywords from the initial 500 Each of these applications demonstrates the value of
keywords dataset did not make a significant difference. All identifying and understanding what is missing, thereby
the search simulations can be found in our GitHub guiding future endeavours in various fields.
repository [13]. The results demonstrate the effectiveness of
using a RAG system in simulating user search behaviour
and generating relevant suggestions. 7 Conclusion
We have successfully demonstrated a methodology for
With a consistent accuracy of 93% for both simple and identifying knowledge gaps in content libraries. For future
complex keywords, the RAG system proved to be a reliable work, there is potential to expand this research by exploring
tool for information retrieval. The study also found that alternative search simulation methods. Specifically, utilising
finding sources becomes slightly more challenging for agents could be a promising avenue. These agents, with
specific topics, as indicated by the average number of their broader bandwidth in search engine usage and content
sources needed per keyword difficulty, 10.9 sources for processing, offer capabilities surpassing those of human
easy queries and 11.23 for difficult ones. No significant users. Future research could extend the evaluation to
differences were observed in accuracy or source quantity additional answer engines, thereby enabling a more
across categories, likely due to the broad and balanced comprehensive benchmarking of the estimation
nature of the selected categories. methodology outlined in reference [14].

Additionally, we discovered that on average, a It's worth pointing out that we don’t have direct access to
knowledge gap is encountered at the fifth level of topic a web index to do a more rigorous evaluation. Future work
depth. This suggests that the internet may have limitations could consider the system’s ability to predict whether a
in providing in-depth information on certain subjects. Our query is a MCQ (missing content query) [14] given
methodology effectively highlights these knowledge gaps, gold-standard labels (perhaps using a TREC-style test
showing a straightforward approach to identifying them in collection and removing the relevant documents from the
various topics. collection for some queries).

REFERENCES
6 Applications [1] Dmitri Brereton. 2022. Google Search Is Dying. Published on February
15, 2022. [Online]. Available: https://dkb.io/post/google-search-is-dying
Recommending nonexistent content is a powerful tool for [2] Edwin Chen. 2022. Is Google Search Deteriorating? Measuring Google's
revealing knowledge gaps. This approach has a wide range Search Quality in 2022. Published on January 10, 2022. [Online].
of applications, including: Available:
https://www.surgehq.ai/blog/is-google-search-deteriorating-measuring-s
earch-quality-in-2022
1. Scientific Discovery: It can pinpoint unexplored areas in [3] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir
Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,
research, highlighting future research topics that have Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021.
yet to be investigated. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
arXiv:2005.11401 [cs.CL].
2. Educational Enhancement: By identifying missing
[4] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
elements in learning materials, it helps in creating more Sutskever. 2019. Language Models are Unsupervised Multitask
comprehensive educational resources. Learners. In Proceedings of the 2019 Conference. [Online]. Available:
https://api.semanticscholar.org/CorpusID:160025533
3. Research Development: This method can uncover [5] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi,
untapped research opportunities, guiding scholars and Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits
Reasoning in Large Language Models. CoRR, abs/2201.11903. [Online].
scientists towards novel inquiries. Available: https://arxiv.org/abs/2201.11903
4. Market Analysis: In the business realm, it can reveal [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
product gaps in a catalogue, offering insights for new Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
product development. Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher

This is a preprint. It is not peer reviewed yet.


Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learners. NeurIPS.
[7] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik
Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and
Acting in Language Models. arXiv:2210.03629 [cs.CL].
[8] Kenji Kawaguchi, Yoshua Bengio, and Leslie Kaelbling. 2022.
Generalisation in Deep Learning. In Mathematical Aspects of Deep
Learning, Philipp Grohs and Gitta Kutyniok, Eds. Cambridge University
Press, Cambridge, 112–148. DOI:
https://doi.org/10.1017/9781009025096.003
[9] Ricci, F., Rokach, L., Shapira, B. (2022). Recommender Systems:
Techniques, Applications, and Challenges. In: Ricci, F., Rokach, L.,
Shapira, B. (eds) Recommender Systems Handbook. Springer, New
York, NY. https://doi.org/10.1007/978-1-0716-2197-4_1
[10] Anthropic's Team. Let Claude Say "I Don't Know" to Prevent
Hallucinations. Anthropic. Accessed in 2023. [Online]. Available:
https://docs.anthropic.com/claude/docs/let-claude-say-i-dont-know
[11] Google Trend's Team. Google Trends. Google. Accessed in 2023.
[Online]. Available: https://trends.google.com/trends/
[12] AskPandi's Team. AskPandi - Ask Me Anything. AskPandi. Accessed in
2023. [Online]. Available: https://askpandi.com
[13] https://github.com/webeng/llm_knowledge_gap_finder
[14] Yom-Tov, Elad et al. “Learning to estimate query difficulty: including
applications to missing content detection and distributed information
retrieval.” Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval (2005).

This is a preprint. It is not peer reviewed yet.

You might also like