Multi-Meta-RAG Improving RAG For Multi-Hop Queries Using Database Filtering With LLM-Extracted Metadata
Multi-Meta-RAG Improving RAG For Multi-Hop Queries Using Database Filtering With LLM-Extracted Metadata
1 Introduction
Large Language Models (LLMs) have shown remarkable language understanding
and generation abilities [10,13]. However, there are two main challenges: static
knowledge [8] and generative hallucination [5]. Retrieval-augmented generation
[6] is an established process for answering user questions over entire datasets.
RAG also helps mitigate generative hallucination and provides LLM with new in-
formation on which it was not trained [11]. Real-world RAG pipelines often need
to retrieve evidence from multiple documents simultaneously, a procedure known
as multi-hop querying. Nevertheless, existing RAG applications face challenges
in answering multi-hop queries, requiring retrieval and reasoning over numerous
pieces of evidence [12]. In this paper, we present Multi-Meta-RAG: an improved
RAG using a database filtering approach with LLM-extracted metadata that
significantly improves the results on the MultiHop-RAG benchmark.
2 Related works
MultiHop-RAG [12] is a novel benchmarking dataset focused on multi-hop
queries, including a knowledge base, questions, ground-truth responses, and
supporting evidence. The news articles were selected from September 26, 2023,
to December 26, 2023, extending beyond the knowledge cutoff of ChatGPT1
and GPT-42 . A trained language model extracted factual or opinion sentences
from each news article. These factual sentences act as evidence for multi-hop
queries. The selection method involves keeping articles with evidence that
overlaps keywords with other articles, enabling the creation of multi-hop queries
with answers drawn from numerous sources. Given the original evidence and
its context, GPT-4 was used to rephrase the evidence, referred to as claims.
Afterward, the bridge entity or topic is used to generate multi-hop queries.
For example, "Did Engadget report a discount on the 13.6-inch MacBook Air
before The Verge reported a discount on Samsung Galaxy Buds 2?" is a typi-
cal query from the MultiHop-RAG dataset. Answering it requires evidence from
Engadget and The Verge to formulate an answer. Also, it requires LLM to figure
out the temporal ordering of events. MultiHop-RAG also has inference, com-
parison, and null (without correct answer) queries, in addition to the temporal
query above.
Engadget Chunk
BBC Chunk
User Prompt
Query
News Article
News Article
Context
"source":
"source": "BBC"
"Engadget"
News Article
News Article
"source": "The
Verge"
"source": "CNN" LLM Wrong response
Fig. 1. A naive RAG implementation for MultiHop-RAG queries. RAG selects chunks
from articles not asked in the example query, which leads to LLM giving a wrong
response.
3 Multi-Meta-RAG
have a source filter extracted, while the publishing date filter was extracted in
15.57% of queries, while 22.81% of queries of the MultiHop-RAG dataset are
temporal.
{
"source":{
"$in":[
"Engadget",
"The Verge"
User ]
}
Embedding }
News Article
News Article
"source":
"source": "BBC" Store vectorized chunks
"Engadget"
and metadata
Vector DB
(Neo4j)
News Article
News Article
"source": "The Engadget, The
"source": "CNN"
Verge" Verge, BBC, CNN,
etc.
Metadata
Engadget Chunk
Prompt
Context
The Verge Chunk
cosine similarity with the query embedding. We also filter the chunks with
LLM-extracted metadata in the same stage. Similarly to MultiHop-RAG, we
use a Reranker module (bge-reranker-large [15]) to examine the retrieval per-
formance. After retrieving 20 corresponding chunks using the embedding model
and metadata filter, we select the top-K chunks using the Reranker.
4 Results
Table 2. Chunk retrieval experiment results. Top-10 chunks are selected with bge-
reranker-large after the top-20 chunks are found via the similarity search and database
metadata filtering. A chunk size of 256 and a chunk overlap of 32 is used. We evaluate
both Baseline RAG and Multi-Meta-RAG using an evaluation script provided in the
MultiHop-RAG repository.
Accuracy
LLM
Ground-truth [12] Baseline RAG [12] Multi-Meta-RAG (ours)
GPT4 (gpt-4-0613) 0.89 0.56 0.606
PaLM (text-bison@001) 0.74 0.47 0.608
Table 4. Generation accuracy of LLMs with MultiHop-RAG per question type (top-6
chunks with voyage-02)
Accuracy
Question Type
GPT4 (gpt-4-0613) PaLM (text-bison@001)
Table 4 shows the detailed evaluation results of different question types for
GPT-4 and Google PaLM. Both models scored remarkable scores that exceeded
0.9 for inference queries. Google PaLM performs significantly better for com-
parison and temporal queries than GPT-4. However, PaLM struggles with Null
questions, whereas GPT-4 achieves a near-perfect score. These results suggest
that combining both models for different queries can be a valid strategy to in-
crease overall accuracy further.
5 Conclusion
5.1 Limitations
The proposed solution still has some limitations. Firstly, extracting metadata
requires a set of queries from a particular domain and question format, as
well as additional inference time. Secondly, it requires the manual creation of
a prompt template that will extract the metadata from the query. Thirdly, while
the improved results are encouraging, they still fall considerably below the results
achieved by feeding LLM precise ground-truth facts.
5.2 Future work
Future work includes trying more generic prompt templates for metadata ex-
traction using multi-hop datasets from different domains. In addition, testing
alternative LLMs, like LLama 3.1 [13], on datasets with more recent cut-off
dates is viable.
Appendix
Given the question, extract the metadata to filter the database about article
sources. Avoid stopwords.
The sources can only be from the list: [’Yardbarker’, ’The Guardian’, ’Revyuh
Media’, ’The Independent - Sports’, ’Wired’, ’Sport Grill’, ’Hacker News’, ’Iot
Business News’, ’Insidesport’, ’Sporting News’, ’Seeking Alpha’, ’The Age’, ’CB-
SSports.com’, ’The Sydney Morning Herald’, ’FOX News - Health’, ’Science
News For Students’, ’Polygon’, ’The Independent - Life and Style’, ’FOX News -
Entertainment’, ’The Verge’, ’Business Line’, ’The New York Times’, ’The Roar |
Sports Writers Blog’, ’Sportskeeda’, ’BBC News - Entertainment & Arts’, ’Busi-
ness World’, ’BBC News - Technology’, ’Essentially Sports’, ’Mashable’, ’Ad-
vanced Science News’, ’TechCrunch’, ’Financial Times’, ’Music Business World-
wide’, ’The Independent - Travel’, ’FOX News - Lifestyle’, ’TalkSport’, ’Yahoo
News’, ’Scitechdaily | Science Space And Technology News 2017’, ’Globes En-
glish | Israel Business Arena’, ’Wide World Of Sports’, ’Rivals’, ’Fortune’, ’Zee
Business’, ’Business Today | Latest Stock Market And Economy News India’,
’Sky Sports’, ’Cnbc | World Business News Leader’, ’Eos: Earth And Space Sci-
ence News’, ’Live Science: The Most Interesting Articles’, ’Engadget’]
Examples to follow:
Question: Who is the individual associated with the cryptocurrency industry
facing a criminal trial on fraud and conspiracy charges, as reported by both The
Verge and TechCrunch, and is accused by prosecutors of committing fraud for
personal gain?
Answer: {’source’: {’$in’: [’The Verge’, ’TechCrunch’]}}
Question: After the TechCrunch report on October 7, 2023, concerning Dave
Clark’s comments on Flexport, and the subsequent TechCrunch article on Octo-
ber 30, 2023, regarding Ryan Petersen’s actions at Flexport, was there a change
in the nature of the events reported?
Answer: {’source’: {’$in’: [’TechCrunch’]}, ’published_at’: ’$in’: {[’October 7,
2023’, ’October 30, 2023’]}}
Question: Which company, known for its dominance in the e-reader space and
for offering exclusive invite-only deals during sales events, faced a stock decline
due to an antitrust lawsuit reported by ’The Sydney Morning Herald’ and dis-
cussed by sellers in a ’Cnbc | World Business News Leader’ article?
Answer: {’source’: {’$in’: [’The Sydney Morning Herald’, ’Cnbc | World Business
News Leader’]}}
If you detect multiple queries, return the answer for the first. Now it is your
turn:
Question: <query>
Answer:
References
1. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C.,
Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are
few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H.
(eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901.
Curran Associates, Inc. (2020)
2. Chase, H.: LangChain (Oct 2022), https://github.com/langchain-ai/langchain
3. Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Larson,
J.: From local to global: A graph rag approach to query-focused summarization
(2024), https://arxiv.org/abs/2404.16130
4. Ho, X., Duong Nguyen, A.K., Sugawara, S., Aizawa, A.: Construct-
ing a multi-hop QA dataset for comprehensive evaluation of reason-
ing steps. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the
28th International Conference on Computational Linguistics. pp. 6609–
6625. International Committee on Computational Linguistics, Barcelona,
Spain (Online) (Dec 2020). https://doi.org/10.18653/v1/2020.coling-main.580,
https://aclanthology.org/2020.coling-main.580
5. Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W.,
Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models:
Principles, taxonomy, challenges, and open questions (2023)
6. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H.,
Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented
generation for knowledge-intensive nlp tasks. In: Larochelle, H., Ranzato, M., Had-
sell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing
Systems. vol. 33, pp. 9459–9474. Curran Associates, Inc. (2020)
7. Liu, J.: LlamaIndex (Nov 2022). https://doi.org/10.5281/zenodo.1234,
https://github.com/jerryjliu/llama{_}index
8. Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R.,
Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y.,
Scialom, T.: Augmented language models: a survey (2023)
9. Neo4j, Inc.: Neo4j graph database, https://neo4j.com/product/neo4j-graph-database
10. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang,
C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.,
Simens, M., Askell, A., Welinder, P., Christiano, P.F., Leike, J., Lowe, R.: Training
language models to follow instructions with human feedback. In: Koyejo, S., Mo-
hamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural
Information Processing Systems. vol. 35, pp. 27730–27744. Curran Associates, Inc.
(2022)
11. Shuster, K., Poff, S., Chen, M., Kiela, D., Weston, J.: Retrieval augmenta-
tion reduces hallucination in conversation. In: Moens, M., Huang, X., Spe-
cia, L., Yih, S.W. (eds.) Findings of the Association for Computational Lin-
guistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Repub-
lic, 16-20 November, 2021. pp. 3784–3803. Association for Computational
Linguistics (2021). https://doi.org/10.18653/V1/2021.FINDINGS-EMNLP.320,
https://doi.org/10.18653/v1/2021.findings-emnlp.320
12. Tang, Y., Yang, Y.: Multihop-rag: Benchmarking retrieval-augmented generation
for multi-hop queries (2024)
13. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T.,
Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave,
E., Lample, G.: Llama: Open and efficient foundation language models (2023)
14. Voyage AI: Voyage ai cutting-edge embedding and rerankers,
https://www.voyageai.com
15. Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., Nie, J.Y.: C-pack: Packaged
resources to advance general chinese embedding (2024)
16. Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., Manning,
C.D.: HotpotQA: A dataset for diverse, explainable multi-hop question answer-
ing. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing. pp.
2369–2380. Association for Computational Linguistics, Brussels, Belgium (Oct-Nov
2018). https://doi.org/10.18653/v1/D18-1259, https://aclanthology.org/D18-1259