0% found this document useful (0 votes)
81 views10 pages

Multi-Meta-RAG Improving RAG For Multi-Hop Queries Using Database Filtering With LLM-Extracted Metadata

Uploaded by

zexyzm1201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views10 pages

Multi-Meta-RAG Improving RAG For Multi-Hop Queries Using Database Filtering With LLM-Extracted Metadata

Uploaded by

zexyzm1201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Multi-Meta-RAG: Improving RAG for Multi-Hop

Queries using Database Filtering with


LLM-Extracted Metadata

Mykhailo Poliakov[0009−0006−5263−762X] and Nadiya Shvai[0000−0001−8194−6196]


arXiv:2406.13213v2 [cs.CL] 19 Aug 2024

National University of Kyiv-Mohyla Academy


{mykhailo.poliakov, n.shvay}@ukma.edu.ua

Abstract. The retrieval-augmented generation (RAG) enables retrieval


of relevant information from an external knowledge source and allows
large language models (LLMs) to answer queries over previously un-
seen document collections. However, it was demonstrated that tradi-
tional RAG applications perform poorly in answering multi-hop ques-
tions, which require retrieving and reasoning over multiple elements of
supporting evidence. We introduce a new method called Multi-Meta-
RAG, which uses database filtering with LLM-extracted metadata to im-
prove the RAG selection of the relevant documents from various sources
applicable to the question. While database filtering is specific to a set
of questions from a particular domain and format, we found that Multi-
Meta-RAG greatly improves the results on the MultiHop-RAG bench-
mark. The code is available on GitHub.

Keywords: large language models · retrieval augmented generation ·


multi-hop question answering

1 Introduction
Large Language Models (LLMs) have shown remarkable language understanding
and generation abilities [10,13]. However, there are two main challenges: static
knowledge [8] and generative hallucination [5]. Retrieval-augmented generation
[6] is an established process for answering user questions over entire datasets.
RAG also helps mitigate generative hallucination and provides LLM with new in-
formation on which it was not trained [11]. Real-world RAG pipelines often need
to retrieve evidence from multiple documents simultaneously, a procedure known
as multi-hop querying. Nevertheless, existing RAG applications face challenges
in answering multi-hop queries, requiring retrieval and reasoning over numerous
pieces of evidence [12]. In this paper, we present Multi-Meta-RAG: an improved
RAG using a database filtering approach with LLM-extracted metadata that
significantly improves the results on the MultiHop-RAG benchmark.

2 Related works
MultiHop-RAG [12] is a novel benchmarking dataset focused on multi-hop
queries, including a knowledge base, questions, ground-truth responses, and
supporting evidence. The news articles were selected from September 26, 2023,
to December 26, 2023, extending beyond the knowledge cutoff of ChatGPT1
and GPT-42 . A trained language model extracted factual or opinion sentences
from each news article. These factual sentences act as evidence for multi-hop
queries. The selection method involves keeping articles with evidence that
overlaps keywords with other articles, enabling the creation of multi-hop queries
with answers drawn from numerous sources. Given the original evidence and
its context, GPT-4 was used to rephrase the evidence, referred to as claims.
Afterward, the bridge entity or topic is used to generate multi-hop queries.
For example, "Did Engadget report a discount on the 13.6-inch MacBook Air
before The Verge reported a discount on Samsung Galaxy Buds 2?" is a typi-
cal query from the MultiHop-RAG dataset. Answering it requires evidence from
Engadget and The Verge to formulate an answer. Also, it requires LLM to figure
out the temporal ordering of events. MultiHop-RAG also has inference, com-
parison, and null (without correct answer) queries, in addition to the temporal
query above.

Engadget Chunk

Did Engadget report a discount on the 13.6-inch


MacBook Air before The Verge reported a Embedding
Vector DB
discount on Samsung Galaxy Buds 2? CNN Chunk

BBC Chunk

User Prompt

Query

News Article
News Article
Context
"source":
"source": "BBC"
"Engadget"

News Article
News Article
"source": "The
Verge"
"source": "CNN" LLM Wrong response

Fig. 1. A naive RAG implementation for MultiHop-RAG queries. RAG selects chunks
from articles not asked in the example query, which leads to LLM giving a wrong
response.

In a typical RAG application, we use an external corpus that comprises multiple


documents and serves as the knowledge base. Each document within this corpus
is segmented into chunks. These chunks are then converted into vector represen-
1
gpt-3.5-turbo-0613
2
gpt4-0613
tations using an embedding model and stored in a vector database. Given a user
query, RAG typically retrieves the top-K chunks that best match the query. The
retrieved chunks, combined with the query, are submitted to an LLM to generate
a final response.
For the MultiHop-RAG benchmark, scraped articles act as a knowledge base
for the RAG application tested. The problem is that a naive RAG application
fails to recognize that the query asks for information from specific sources. Top-
K chunks such as RAG retrieves often contain information from sources other
than those mentioned in the query. Retrieved chunks might even miss relevant
sources, leading to a wrong response, as depicted in Figure 1.
Several popular benchmarks, such as HotpotQA [16] and 2WikiMultiHopQA
[4], can be used for QA from multiple document sources. However, these datasets
primarily focus on estimating LLM reasoning skills and do not highlight retriev-
ing evidence from the knowledge base. Another problem is that they are based
on Wikipedia, which means LLM’s are already trained on the same data.
Alternative solutions to tackle multi-hop queries include graph-based solu-
tions like Graph RAG [3]. While Graph RAG evaluates MultiHop-RAG dataset,
it is used purely as a knowledge base for an independent question set. Another
LLM assesses the responses for custom metrics such as comprehensiveness, di-
versity, empowerment, and directness instead of simple accuracy.

3 Multi-Meta-RAG

3.1 Extraction of Relevant Query Metadata with the LLM

Each question in the MultiHop-RAG [12] benchmark follows a typical struc-


ture. Every query requests information from one or more sources of news. In
addition, some temporal queries require news articles from a particular date. We
can extract the query filter via helper LLM by constructing a few-shot prompt [1]
with examples of extracted article sources and publishing dates as a filter. The
prompt template is provided in Appendix 5.2. We only run metadata extraction
with ChatGPT3 because this additional RAG pipeline step must be quick and
cheap. We found out that this step takes 0.7 seconds on average for one query.
Two query metadata filter fields are extracted: article source and publication
date. The complete filter is a dictionary with two fields combined. Samples of
extracted metadata filters can be found in Table 1. The primary filtering oper-
ator is $in, the only operator provided in the examples in a few-shot prompt
template. The LLM also correctly chooses a tiny fraction of the $nin operator
for some queries without an example. While LLM only used $in and $nin for
article sources, the model sometimes chooses other operators like $lt or $gt for
publication date for a fraction of temporal queries. Because the number of such
queries is small, we decided to only use date filters with $in and $nin operators
and a most frequent date format4 for easier matching in the database. All queries
3
gpt-3.5-turbo-1106
4
strftime format code %B %-d, %Y
Table 1. Examples of extracted metadata filters using a few-shot prompt with corre-
sponding queries. Correct usage of the $nin operator for the last query can be noted.

Query Extracted Filter

Does the TechCrunch article " source ": {


report on new hiring at Starz, " $ i n " : [ " TechCrunch " , " Engadget " ]
}
while the Engadget article dis-
cusses layoffs within the entire
video game industry?

Did The Guardian’s report on " published_at " : {


" $ i n " : [ " December 1 2 , 2 0 2 3 " ]
December 12, 2023, contradict },
the Sporting News report re- " source ": {
" $ i n " : [ " The Guardian " , " S p o r t i n g News " ]
garding the performance and }
future outlook of Manchester
United?

Who is the individual facing a


criminal trial on seven counts
" source ": {
of fraud and conspiracy, pre- " $nin ": [
viously likened to a financial " TechCrunch"
icon but not by TechCrunch, ]
}
and is accused by the prose-
cution of committing fraud for
wealth, power, and influence?

have a source filter extracted, while the publishing date filter was extracted in
15.57% of queries, while 22.81% of queries of the MultiHop-RAG dataset are
temporal.

3.2 Improved Chunk Selection using Metadata Filtering

The extracted metadata could be used to enhance an RAG application (Figure


2). We split the articles in the MultiHop-RAG [12] knowledge base into chunks,
each containing 256 tokens using LLamaIndex [7] using a sentence splitter as in
the original MultiHop-RAG implementation. We also picked a chunk overlap of
32, finding out that smaller chunk overlap leads to a better variety of unique
chunks in the top-K selection than the original implementation, which used the
LLamaIndex default of 200. We selected LangChain [2] Neo4j [9] vector store
as a vector database as its index implementation recently5 started to support
metadata filtering. We then convert the chunks using an embedding model and
save the embeddings into a vector database with article metadata saved as node
properties.
Likewise, in the retrieval stage, we transform a query using the same em-
bedding model and retrieve the top-K most relevant chunks with the highest
5
April 2024
Did Engadget report a discount on the 13.6-inch
MacBook Air before The Verge reported a LLM
discount on Samsung Galaxy Buds 2? Extract Metadata

{
"source":{
"$in":[
"Engadget",
"The Verge"
User ]
}
Embedding }

News Article
News Article
"source":
"source": "BBC" Store vectorized chunks
"Engadget"
and metadata
Vector DB
(Neo4j)

News Article
News Article
"source": "The Engadget, The
"source": "CNN"
Verge" Verge, BBC, CNN,
etc.

Metadata

Engadget Chunk

Prompt

Correct response The Verge Chunk


LLM Query

Context
The Verge Chunk

Fig. 2. Multi-Meta-RAG: an improved RAG with database filtering using metadata.


Metadata is extracted via secondary LLM. With filtering, we can ensure top-K chunks
are always from relevant sources with better chances of getting correct overall responses.

cosine similarity with the query embedding. We also filter the chunks with
LLM-extracted metadata in the same stage. Similarly to MultiHop-RAG, we
use a Reranker module (bge-reranker-large [15]) to examine the retrieval per-
formance. After retrieving 20 corresponding chunks using the embedding model
and metadata filter, we select the top-K chunks using the Reranker.

4 Results

4.1 Chunk Retrieval Experiment

We selected two best-performing embedding models from the original MultiHop-


RAG experiment for testing metadata filtering chunk retrieval performance, bge-
large-en-v1.5 [15], and voyage-02 [14]. The retrieved list of chunks is compared
with the ground truth evidence associated with each query, excluding the null
queries, as they lack corresponding evidence. For evaluation, we assume the Top-
K chunks are retrieved and use metrics such as Mean Average Precision at K
(MAP@K), Mean Reciprocal Rank at K (MRR@K), and Hit Rate at K (Hit@K).
MAP@K measures the average precision of the top-K retrieval across all queries.
MRR@K calculates the average reciprocal ranks of the first relevant chunk within
the top-K retrieved set for each query. Hit@K measures the proportion of evi-
dence that appears in the top-K retrieved set. The experiment (Table 2) with
RAG showed considerable improvement in both embeddings for all core metrics:
MRR@10, MAP@10, Hits@10, and Hits@4. Most notably, for voyage-02, Hits@4
enhanced by 17.2%. This improvement is important for practical RAG systems,
where the top-K retrieved should be as low as possible to account for context
window limits and cost.

Table 2. Chunk retrieval experiment results. Top-10 chunks are selected with bge-
reranker-large after the top-20 chunks are found via the similarity search and database
metadata filtering. A chunk size of 256 and a chunk overlap of 32 is used. We evaluate
both Baseline RAG and Multi-Meta-RAG using an evaluation script provided in the
MultiHop-RAG repository.

Baseline RAG [12]


Embedding
MRR@10 MAP@10 Hits@10 Hits@4
bge-large-en-v1.5 (evaluated) 0.6029 0.2687 0.7490 0.6661
voyage-02 (evaluated) 0.6016 0.2619 0.7419 0.6630
Multi-Meta-RAG (ours)
Embedding
MRR@10 MAP@10 Hits@10 Hits@4
bge-large-en-v1.5 0.6574 0.3293 0.8909 0.7672
voyage-02 0.6748 0.3388 0.9042 0.792

4.2 LLM Response Generation Experiment

Table 3. Overall generation accuracy of LLMs with MultiHop-RAG (top-6 chunks


with voyage-02)

Accuracy
LLM
Ground-truth [12] Baseline RAG [12] Multi-Meta-RAG (ours)
GPT4 (gpt-4-0613) 0.89 0.56 0.606
PaLM (text-bison@001) 0.74 0.47 0.608

As with embeddings, we picked two best-achieving LLMs on ground-truth


chunks based on MultiHop-RAG initial experiments, GPT-4 and Google PaLM.
We achieved substantial improvement in accuracy (Table 3) for both models
compared to baseline RAG implementation. Google PaLM accuracy improved
by 25.6% from 0.47 to 0.608. GPT-4 results also show a 7.89% increase from 0.56
to 0.63. The accuracy is calculated by checking if any word in an LLM-generated
response is present in the correct gold answer for each question.

Table 4. Generation accuracy of LLMs with MultiHop-RAG per question type (top-6
chunks with voyage-02)

Accuracy
Question Type
GPT4 (gpt-4-0613) PaLM (text-bison@001)

Inference 0.951 0.9203


Comparison 0.382 0.5397
Temporal 0.256 0.4545
Null 0.9867 0.2492

Table 4 shows the detailed evaluation results of different question types for
GPT-4 and Google PaLM. Both models scored remarkable scores that exceeded
0.9 for inference queries. Google PaLM performs significantly better for com-
parison and temporal queries than GPT-4. However, PaLM struggles with Null
questions, whereas GPT-4 achieves a near-perfect score. These results suggest
that combining both models for different queries can be a valid strategy to in-
crease overall accuracy further.

5 Conclusion

This paper introduces Multi-Meta-RAG, a method of improving RAG for multi-


hop queries using database filtering with LLM-extracted metadata. Multi-Meta-
RAG considerably improves results in chunk retrieval and LLM generation ex-
periments while being relatively straightforward and explainable compared to
alternative solutions, like Graph RAG [3].

5.1 Limitations

The proposed solution still has some limitations. Firstly, extracting metadata
requires a set of queries from a particular domain and question format, as
well as additional inference time. Secondly, it requires the manual creation of
a prompt template that will extract the metadata from the query. Thirdly, while
the improved results are encouraging, they still fall considerably below the results
achieved by feeding LLM precise ground-truth facts.
5.2 Future work

Future work includes trying more generic prompt templates for metadata ex-
traction using multi-hop datasets from different domains. In addition, testing
alternative LLMs, like LLama 3.1 [13], on datasets with more recent cut-off
dates is viable.

Acknowledgments. This research was partially funded by OpenAI Researcher


Access Program (Application 0000005294).

Appendix

Metadata Extraction Prompt Template

Given the question, extract the metadata to filter the database about article
sources. Avoid stopwords.

The sources can only be from the list: [’Yardbarker’, ’The Guardian’, ’Revyuh
Media’, ’The Independent - Sports’, ’Wired’, ’Sport Grill’, ’Hacker News’, ’Iot
Business News’, ’Insidesport’, ’Sporting News’, ’Seeking Alpha’, ’The Age’, ’CB-
SSports.com’, ’The Sydney Morning Herald’, ’FOX News - Health’, ’Science
News For Students’, ’Polygon’, ’The Independent - Life and Style’, ’FOX News -
Entertainment’, ’The Verge’, ’Business Line’, ’The New York Times’, ’The Roar |
Sports Writers Blog’, ’Sportskeeda’, ’BBC News - Entertainment & Arts’, ’Busi-
ness World’, ’BBC News - Technology’, ’Essentially Sports’, ’Mashable’, ’Ad-
vanced Science News’, ’TechCrunch’, ’Financial Times’, ’Music Business World-
wide’, ’The Independent - Travel’, ’FOX News - Lifestyle’, ’TalkSport’, ’Yahoo
News’, ’Scitechdaily | Science Space And Technology News 2017’, ’Globes En-
glish | Israel Business Arena’, ’Wide World Of Sports’, ’Rivals’, ’Fortune’, ’Zee
Business’, ’Business Today | Latest Stock Market And Economy News India’,
’Sky Sports’, ’Cnbc | World Business News Leader’, ’Eos: Earth And Space Sci-
ence News’, ’Live Science: The Most Interesting Articles’, ’Engadget’]

Examples to follow:
Question: Who is the individual associated with the cryptocurrency industry
facing a criminal trial on fraud and conspiracy charges, as reported by both The
Verge and TechCrunch, and is accused by prosecutors of committing fraud for
personal gain?
Answer: {’source’: {’$in’: [’The Verge’, ’TechCrunch’]}}
Question: After the TechCrunch report on October 7, 2023, concerning Dave
Clark’s comments on Flexport, and the subsequent TechCrunch article on Octo-
ber 30, 2023, regarding Ryan Petersen’s actions at Flexport, was there a change
in the nature of the events reported?
Answer: {’source’: {’$in’: [’TechCrunch’]}, ’published_at’: ’$in’: {[’October 7,
2023’, ’October 30, 2023’]}}
Question: Which company, known for its dominance in the e-reader space and
for offering exclusive invite-only deals during sales events, faced a stock decline
due to an antitrust lawsuit reported by ’The Sydney Morning Herald’ and dis-
cussed by sellers in a ’Cnbc | World Business News Leader’ article?
Answer: {’source’: {’$in’: [’The Sydney Morning Herald’, ’Cnbc | World Business
News Leader’]}}

If you detect multiple queries, return the answer for the first. Now it is your
turn:
Question: <query>
Answer:

References

1. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C.,
Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are
few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H.
(eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901.
Curran Associates, Inc. (2020)
2. Chase, H.: LangChain (Oct 2022), https://github.com/langchain-ai/langchain
3. Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Larson,
J.: From local to global: A graph rag approach to query-focused summarization
(2024), https://arxiv.org/abs/2404.16130
4. Ho, X., Duong Nguyen, A.K., Sugawara, S., Aizawa, A.: Construct-
ing a multi-hop QA dataset for comprehensive evaluation of reason-
ing steps. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the
28th International Conference on Computational Linguistics. pp. 6609–
6625. International Committee on Computational Linguistics, Barcelona,
Spain (Online) (Dec 2020). https://doi.org/10.18653/v1/2020.coling-main.580,
https://aclanthology.org/2020.coling-main.580
5. Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W.,
Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models:
Principles, taxonomy, challenges, and open questions (2023)
6. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H.,
Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented
generation for knowledge-intensive nlp tasks. In: Larochelle, H., Ranzato, M., Had-
sell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing
Systems. vol. 33, pp. 9459–9474. Curran Associates, Inc. (2020)
7. Liu, J.: LlamaIndex (Nov 2022). https://doi.org/10.5281/zenodo.1234,
https://github.com/jerryjliu/llama{_}index
8. Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R.,
Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y.,
Scialom, T.: Augmented language models: a survey (2023)
9. Neo4j, Inc.: Neo4j graph database, https://neo4j.com/product/neo4j-graph-database
10. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang,
C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.,
Simens, M., Askell, A., Welinder, P., Christiano, P.F., Leike, J., Lowe, R.: Training
language models to follow instructions with human feedback. In: Koyejo, S., Mo-
hamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural
Information Processing Systems. vol. 35, pp. 27730–27744. Curran Associates, Inc.
(2022)
11. Shuster, K., Poff, S., Chen, M., Kiela, D., Weston, J.: Retrieval augmenta-
tion reduces hallucination in conversation. In: Moens, M., Huang, X., Spe-
cia, L., Yih, S.W. (eds.) Findings of the Association for Computational Lin-
guistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Repub-
lic, 16-20 November, 2021. pp. 3784–3803. Association for Computational
Linguistics (2021). https://doi.org/10.18653/V1/2021.FINDINGS-EMNLP.320,
https://doi.org/10.18653/v1/2021.findings-emnlp.320
12. Tang, Y., Yang, Y.: Multihop-rag: Benchmarking retrieval-augmented generation
for multi-hop queries (2024)
13. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T.,
Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave,
E., Lample, G.: Llama: Open and efficient foundation language models (2023)
14. Voyage AI: Voyage ai cutting-edge embedding and rerankers,
https://www.voyageai.com
15. Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., Nie, J.Y.: C-pack: Packaged
resources to advance general chinese embedding (2024)
16. Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., Manning,
C.D.: HotpotQA: A dataset for diverse, explainable multi-hop question answer-
ing. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing. pp.
2369–2380. Association for Computational Linguistics, Brussels, Belgium (Oct-Nov
2018). https://doi.org/10.18653/v1/D18-1259, https://aclanthology.org/D18-1259

You might also like