Benchmarking Large Language Models in Retrieval-Augmented Generation
Benchmarking Large Language Models in Retrieval-Augmented Generation
on different large language models, which make it challeng- The Nobel Prize in Literature for 2021 is The 2020 Nobel Laureate in Literature,
awarded to the novelist Abdulrazak Gurnah,
ing to identify the potential bottlenecks in the capabilities born in Zanzibar and active in …
poet Louise Glück, has written both poetry
and essays about poetry. Since her…
of RAG for different LLMs. In this paper, we systemati- Retrieval Augmented Retrieval Augmented
cally investigate the impact of Retrieval-Augmented Gener- Generation Generation
both English and Chinese. RGB divides the instances within External documents contain all answers Counterfactual external documents
On May 18th, 2023, OpenAI introduced its The 2004 Olympic Games returned home to
the benchmark into 4 separate testbeds based on the afore- own ChatGPT app for iOS… New York, birthplace of the …
mentioned fundamental abilities required to resolve the case. That changed on March 1, when OpenAI After leading all voting rounds, New York
Then we evaluate 6 representative LLMs on RGB to diag- announced the release of API access to
ChatGPT and Whisper,…
easily defeated Rome in the fifth and
final vote …
nose the challenges of current LLMs when applying RAG. Retrieval Augmented Retrieval Augmented
Generation Generation
Evaluation reveals that while LLMs exhibit a certain degree
of noise robustness, they still struggle significantly in terms of May 18 and March 1. There are factual errors in the provided
documents. The answer should be Athens.
negative rejection, information integration, and dealing with
false information. The aforementioned assessment outcomes
indicate that there is still a considerable journey ahead to ef- Figure 1: Illustration of 4 kinds of abilities required for
fectively apply RAG to LLMs. retrieval-augmented generation of LLMs.
Introduction
2022; Izacard et al. 2022). With the help of external knowl-
Recently, there have been impressive advancements in large edge, LLMs can generate more accurate and reliable re-
language models (LLMs) like ChatGPT (OpenAI 2022) and sponses. The most common method is to use a search engine
ChatGLM (THUDM 2023a). Although these models have as a retriever such as New Bing. Due to the vast amount of
shown remarkable general abilities (Bang et al. 2023; Guo information available on the Internet, using a search engine
et al. 2023), they still suffer severely from challenges includ- can provide more real-time information.
ing factual hallucination (Cao et al. 2020; Raunak, Menezes,
and Junczys-Dowmunt 2021; Ji et al. 2023), knowledge out- However, Retrieval-Augmented Generation brings not
dating (He, Zhang, and Roth 2022), and the lack of domain- only positive effects to LLMs (Liu, Zhang, and Liang 2023;
specific expertise (Li et al. 2023c; Shen et al. 2023). Maynez et al. 2020). On one hand, there is a significant
Incorporating external knowledge via information re- amount of noise information even fake news in the content
trieval, i.e., Retrieval-Augmented Generation (RAG), has available on the Internet, which poses challenges for search
been regarded as a promising way to resolve the above chal- engines in accurately retrieving desirable knowledge. On the
lenges. (Guu et al. 2020; Lewis et al. 2020; Borgeaud et al. other hand, LLMs suffer from unreliable generation chal-
lenge. LLMs can be misled by incorrect information con-
* Corresponding authors. tained in the context (Bian et al. 2023) and also suffer from
Copyright © 2024, Association for the Advancement of Artificial hallucination during the generation (Adlakha et al. 2023),
Intelligence (www.aaai.org). All rights reserved. resulting in generating content that goes beyond external in-
formation. These challenges result in LLMs being unable to Based on RGB, we conduct evaluation on 6 state-of-
consistently generate reliable and accurate responses. Un- the-art large language models including ChatGPT (Ope-
fortunately, currently there lacks of comprehensive under- nAI 2022), ChatGLM-6B (THUDM 2023a), ChatGLM2-
standing on how these factors can influence RAG, and how 6B (THUDM 2023b), Vicuna-7b (Chiang et al. 2023),
could each model survives from these drawbacks and im- Qwen-7B-Chat (QwenLM 2023), BELLE-7B (Yunjie Ji
provement their performance via information retrieval. As a 2023). We found that even though RAG can improve the re-
result, there is a pressing need for a comprehensive evalua- sponse accuracy of LLMs, they still suffer from the above-
tion of LLMs on their ability to effectively utilize retrieved mentioned challenges significantly. Specifically, we found
information, as well as their ability to withstand the various that even though LLMs demonstrate some level of noise ro-
drawbacks present in information retrieval. bustness, they tend to confuse similar information and fre-
To this end, this paper conducts a comprehensive evalua- quently generate inaccurate answers when relevant informa-
tion of RAG for current LLMs. Specifically, we create a new tion exists. For example, when faced with a question about
Retrieval-Augmented Generation Benchmark, namely RGB, the 2022 Nobel Prize in Literature, if there are noisy docu-
in both English and Chinese. In order to ensure that the in- ments about the 2021 Nobel Prize in Literature in external
ternal knowledge of LLMs does not introduce bias into the documents, LLMs may become confused and provide inac-
evaluation results, RGB chooses to aggregate the latest news curate answers. Besides, LLMs frequently fail to reject an-
information and constructs queries based on the news infor- swering and generate incorrect answers when none of the
mation. Then, based on these queries, we use Search API to external documents contain relevant information. Further-
fetch relevant documents and select most relevant snippets more, LLMs lack the ability to summarize from multiple
from the content as external retrieved documents. Finally, documents, and therefore if multiple documents are needed
based on different compositions of query and document-set to answer a question, LLMs often fail to provide accurate
pairs, we expand the corpus and divided it into 4 testbeds to answer. Finally, we found that even when the LLMs contain
evaluate the following basic abilities of LLMs according to the required knowledge and are given warnings about po-
the common challenges in RAG, as shown in Figure 1: tential risks in the retrieved information through instruction,
• Noise Robustness, which means a LLM can extract use- they still tend to trust and prioritize the retrieved information
ful information from noisy documents. In this paper, we over their own existing knowledge. The experimental results
define noisy documents as those that are relevant to the mentioned above highlight the need for further resolution of
question but do not contain any information of the an- important issues in the existing RAG method. Therefore, it
swer. For the instance in Figure 1, the noisy documents is crucial to exercise caution and carefully design its usage.
related to the question “Who was awarded the 2022 No- Generally speaking, the contributions of this paper are1 :
bel Prize in Literature” include reports about the 2021 • We proposed to evaluate four capabilities for retrieval-
Nobel Prize in Literature. To this end, the testbed for augmented generation of LLMs and created the
noise robustness contains instances whose external doc- Retrieval-Augmented Generation Benchmark in both En-
uments contain a certain number of noisy documents glish and Chinese. To best of our knowledge, it is the first
based on the desired noise ratio. benchmark designed to assess these four capabilities for
• Negative Rejection, which means that a LLM should re- retrieval-augmented generation of LLMs.
ject to answer the question when the required knowledge • We evaluated the existing LLMs using RGB and found
is not present in any retrieved document. The testbed for the limitations of them in the four different abilities.
negative rejection contains instances whose external doc- • We analyzed the responses of LLMs in RGB and identi-
uments are only with noisy documents. LLMs are ex- fied their current shortcomings as well as suggested di-
pected to indicate “insufficient information” or other re- rections for improvement.
jection signals.
• Information Integration, which evaluates whether Related work
LLMs can answer complex questions that require inte-
Retrieval-augmented models The knowledge stored in
grating information from multiple documents. For the in-
large language models is commonly out-of-date (He, Zhang,
stance in Figure 1, for the question “When were the Chat-
and Roth 2022) and they also sometimes generate hallu-
GPT app for iOS and ChatGPT api launched?”, LLMs
cination (Cao et al. 2020; Raunak, Menezes, and Junczys-
are expected to provide information of the launch dates
Dowmunt 2021; Ji et al. 2023) i.e., they may generate ir-
for both the ChatGPT iOS app and ChatGPT API. The
relevant or factually incorrect contents. By using external
testbed for information integration contains instances
knowledge as guidance, retrieval-augmented models can
that can only be answered using multiple documents.
generate more accurate and reliable responses (Guu et al.
• Counterfactual Robustness, which evaluates whether 2020; Lewis et al. 2020; Borgeaud et al. 2022; Izacard
LLMs can identify risks of known factual errors in the et al. 2022; Shi et al. 2023; Ren et al. 2023). Retrieval-
retrieved documents when the LLMs are given warnings augmented models have achieved remarkable results in var-
about potential risks in the retrieved information through ious tasks such as open-domain QA (Izacard and Grave
instruction. The testbed for counterfactual robustness in- 2021; Trivedi et al. 2023; Li et al. 2023a), dialogue (Cai
cludes instances that can be answered directly by the
1
LLMs, but the external documents contain factual errors. Our code&data: https://github.com/chen700564/RGB.
Top1 Chunk Top2 Chunk …… Top30 Chunk
and responsibility of LLMs, M3Exam (Zhang et al. 2023)
Rerank by dense Dense retrieval model
focuses on human exam and ToolBench (Qin et al. 2023)
retrieval model
……
evaluates how well LLMs use external tools. Recently, Ad-
Who was awarded the 2022 Nobel
Prize for Physiology and Medicine?”, Chun2
Chunk lakha et al. (2023) evaluate the RAG of LLMs in exist QA
dataset. Different from their work, we focus on 4 required
{"link": "https://www.nobelprize.org/prizes/medicine/", "title":
"The Nobel Prize in Physiology or Medicine 2022", "snippet": "The abilities of RAG and create Retrieval-Augmented Genera-
Nobel Assembly..."}, ...
Retrieve using tion Benchmark to evaluate the LLMs.
search engine Google Search API
Query: Who was awarded the 2022 Nobel Prize for Physiology and
Medicine?”, Retrieval-Augmented Generation Benchmark
{
Data adjustment “Question”: “Who was awarded the 2022
In this section, we first introduce the specific retrieval-
and filtering by Nobel Prize for Physiology and Medicine?”, augmented generation abilities we aim to evaluate. Next, we
Human “Answer”: ['Svante Pääbo','Svante Paabo’]
} outline the process of constructing the RAG benchmark for
Related event: 2022 Nobel Prize for Physiology and Medicine evaluation. Lastly, we present the evaluation metrics.
Question: Who was awarded the 2022 Nobel Prize for Physiology
and Medicine?
Key information: Svante Pääbo and Svante Paabo
gpt-3.5-turbo api
Required abilities of RAG
Data generation by
ChatGPT We simulate the process of a user querying and obtaining
information. Suppose the user retrieves a current event news,
External knowledge is the key to resolving the problems
speculate the event that the user is concerned about and the
question that he/she may want to know, and generate the key
of LLMs such as hallucination and outdated knowledge,
information corresponding to the answer to the question. …
… which can make LLMs generate more accurate and reliable
News: The 2022 Nobel Prize for Physiology and Medicine was …
responses through retrieval-augmented generation (RAG).
News Collection
The 2022 Nobel Prize for Physiology and Medicine was awarded on
Monday to Swedish scientist Svante Pääbo for sequencing the
However, LLMs cannot always response as expected with
genome of the Neanderthal.
RAG. For one thing, there are numerous irrelevant docu-
ments and false information on the Internet. Incorporating
Figure 2: The process of data generation. Firstly, we use these external documents into LLMs could have a detrimen-
models to extract (event, question, answer) from news ar- tal effect. For anthoer, LLMs suffer from the unreliable gen-
ticles. Next, we utilize search engines to retrieve relevant eration challenge. The generation of LLMs is often unpre-
web pages. Finally, a dense retrieval model is employed to dictable, and we cannot guarantee that they will utilize the
re-rank the content of these web pages. useful information entailed in the external documents. Ad-
ditionally, LLMs can easily be misled by incorrect infor-
mation in the document. To this end, we build Retrieval-
et al. 2019a,b; Peng et al. 2023), domain-specific ques- Augmented Generation Benchmark (RGB) to evaluate the
tion answering (Cui et al. 2023) and code generation (Zhou retrieval-augmented generation of LLMs, and we concern
et al. 2023b). Recently, with the development of large mod- about 4 specific abilities:
els, a series of retrieval-enhanced tools and products have Noise Robustness is the robustness of LLMs in noisy
gained widespread attention, such as ChatGPT retrieval plu- documents. As retrievers are not perfect, the external knowl-
gin, Langchain, New Bing, etc. However, in real-world sce- edge they retrieve often contains a significant amount of
narios, the retrieved text inevitably contains noise. There- noise, i.e., documents which are relevant to the question but
fore, in this paper we conducted a systematic evaluation and do not contain any information about the answer. To effec-
analysis of retrieval-augmented generation in LLMs. tively answer user questions, LLMs must be able to extract
the necessary information from documents despite there are
Evaluation of LLMs Evaluating LLMs has received sig- noisy documents.
nificant attention due to their remarkable general capabil- Negative Rejection is a measure of whether LLMs can
ity (Chang et al. 2023). It enables us to gain a deeper under- decline to answer a question when none of the contexts pro-
standing of the specific abilities and limitations of LLMs, vide useful information. In real-world situations, the search
while also providing valuable guidance for future research. engine often fails to retrieve documents containing the an-
In the past, benchmarks such as GLUE (Wang et al. 2019b) swers. In these cases, it is important for the model to have
and SuperCLUE (Wang et al. 2019a) primarily focused on the capability to reject recognition and avoid generating mis-
evaluating NLP tasks, particularly in natural language un- leading content.
derstanding. However, these evaluations often fail to fully Information Integration is a capacity to integrate an-
capture the capabilities of LLMs. MMLU (Hendrycks et al. swers from multiple documents. In many cases, the an-
2021) was then proposed to measure the knowledge acquired swer to a question may be contained in multiple documents.
by language models when pre-training. Recently, with the For example, for the question “Who are the champions of
development of LLMs, a series of general evaluation bench- the U.S. Open 2022 men’s and women’s singles?”, the two
marks have emerged, such as AGIEval (Zhong et al. 2023), champions may be mentioned in different documents. In or-
C-Eval (Huang et al. 2023), AlpacaEval (Li et al. 2023b), der to provide better answers to complex questions, it is nec-
OpenLLM Leaderboard (Edward Beeching 2023), etc. In essary for LLMs to have the ability to integrate information.
addition to general abilities, there are also specific bench- Counterfactual Robustness refers to a capacity to han-
marks that focus on evaluating the capabilities of models. dle errors in external knowledge. In the real world, there is
For example, CValues (Xu et al. 2023a) focuses on the safety an abundance of false information on the internet. Please
English Chinese
note that we only evaluate the situation that LLMs are given System instruction
You are an accurate and reliable AI assistant that can
System instruction
你是一个准确和可靠的人工智能助手,
warnings about potential risks in the retrieved information answer questions with the help of external documents.
Please note that external documents may contain noisy
能够借助外部文档回答问题,请注意
外部文档可能存在噪声事实性错误。
through instruction. or factually incorrect information. If the information in
如果文档中的信息包含了正确答案,
the document contains the correct answer, you will give
你将进行准确的回答。如果文档中的
In real-world scenarios, it is not possible to obtain per- an accurate answer. If the information in the document
信息不包含答案,你将生成“文档信
does not contain the answer, you will generate ’I can not
息不足,因此我无法基于提供的文档
fect documents with all the necessary external knowledge. answer the question because of the insufficient
回答该问题。”如果部分文档中存在
information in documents.‘ If there are inconsistencies
Therefore, evaluating these four abilities of the model be- 与事实不一致的错误,请先生成“提
with the facts in some of the documents, please generate
供文档的文档存在事实性错误。”,
the response 'There are factual errors in the provided
comes essential in order to measure the RAG of LLMs. documents.' and provide the correct answer.
并生成正确答案。
Table 1: The experimental result of noise robustness measured by accuracy (%) under different noise ratios. We can see that the
increasing noise rate poses a challenge for RAG in LLMs.
Table 2: Error cases of noise robustness, and only one positive document and one negative document are shown. The responses
are generated by ChatGLM2-6B. The blue text indicates the matching parts between the document and the question or answer,
while the red text highlights the non-matching parts.
structions to inform the model.). If the model generates this Models We conduct evaluation on 6 state-of-the-art
content, it indicates that the model has detected erroneous large language models which can generate both En-
information in the document. glish and Chinese including ChatGPT (OpenAI 2022)3 ,
Error correction rate measures whether the model can ChatGLM-6B (THUDM 2023a), ChatGLM2-6B (THUDM
provide the correct answer after identifying errors for coun- 2023b), Vicuna-7b-v1.3 (Chiang et al. 2023), Qwen-7B-
terfactual robustness. The model is asked to generate the cor- Chat (QwenLM 2023), BELLE-7B-2M (Yunjie Ji 2023).
rect answer after identifying the factual errors. If the model
generates the correct answer, it indicates that the model is Results on Noise Robustness
capable of correcting errors in the document. We evaluated the accuracy based on the different noise ratios
Considering that the model may not fully adhere to in- in external documents, and the results are shown in Table 1.
structions, for rejection rate and error detection rate, we We can see that:
also use ChatGPT to conduct additional evaluation of the (1) RAG can effect improve the responses of LLMs.
answers. Specifically, we assess the model’s responses by LLMs have shown strong performance even in the presence
using instructions and demonstrations to determine if they of noise, indicating that RAG is a promising way for LLMs
can reflect information that is not present in the document or to generate accurate and reliable responses.
identify any factual errors. (2) The increasing noise rate poses a challenge for
RAG in LLMs. Specifically, when the noise ratio exceeds
Experiments 80%, the accuracy decreases significantly at a significance
level of 0.05. For example, the performance of ChatGPT has
In this section, we evaluate the performance of various decreased from 96.33% to 76.00%, while the performance
LLMs, analyze and discuss the results, summarizing the of ChatGLM2-6B has decreased from 91.33% to 57.33%.
main challenges that existing LLMs encounter when using
external knowledge. Error Analysis. To better comprehend the negative im-
pact of noise on model generation, we examined the incor-
Settings rect answers and found that these errors typically originate
from three reasons, as shown in Table 2.
Task formats. Due to contextual limitations, we provide 5 (1) Long-distance information. LLMs often face diffi-
external documents for each question. In our experiments culty in identifying the correct answer from external docu-
on noise robustness, we evaluate scenarios with noise ra- ments when the information related to the question is distant
tios ranging from 0 to 0.8. To comprehensively evaluate the from the information related to the answer. This scenario
overall capabilities, we have adopted a unified instruction is quite common as longer texts are frequently encountered
for each language, as shown in Figure 3. The experiments
3
were conducted using an NVIDIA GeForce RTX 3090. We use gpt-3.5-turbo api in the experiments.
on the internet. In such cases, it is typical for the question’s Languages English Chinese
information to be initially presented at the start of the doc- Rej Rej∗ Rej Rej∗
ument and subsequently referred to using pronouns. In Ta- ChatGPT 24.67 45.00 5.33 43.33
ble 2, the question information (“Qatar Open 2022”) is only ChatGLM-6B 9.00 25.00 6.33 17.00
mentioned once at the beginning and is far from where the ChatGLM2-6B 10.33 41.33 6.33 36.33
Vicuna-7B-v1.3 17.00 33.33 3.37 24.67
answer text “Anett Kontaveit” appears. This situation may Qwen-7B-Chat 31.00 35.67 8.67 25.33
cause LLMs to depend on information from other docu- BELLE-7B-2M 5.67 32.33 5.33 13.67
ments and create false impressions, i.e., hallucination.
(2) Evidence uncertainty. Before highly anticipated Table 3: The result of negative rejection. Rej means the re-
events, like the release of new Apple products or the an- jection rate (%) and Rej∗ means the rejection rate evaluated
nouncement of the Oscars, there is often a significant by ChatGPT. We can see that negative rejection poses a chal-
amount of speculative information circulating on the inter- lenge for RAG in LLMs.
net. Although the relevant documents explicitly state that
it is uncertain or speculative content, they can still impact Question Answer Response
on the retrieval-augmented generation of LLMs. In Table 2, The information provided in the document
is that Fox has attached “The Big Short”
when the noise ratio increases, the content of erroneous who will direct
Irredeemable film?
Jeymes
Samuel
writer-director Adam McKay to its movie
documents is all about some people’s predictions about the adaptation of “Irredeemable” from Boom!
Comics. Therefore, the answer is Adam McKay.
name of the headset (“Apple Reality Pro”). Even if there is Based on the information provided in the
a correct answer (“Vision Pro”) in the relevant documents, Which country won the document, the country that won the most medals
most medals at the Norway at the 2022 Winter Olympics is Germany.
LLMs can still be misled by uncertain evidences. 2022 Winter Olympics? Germany has won a total of 31 medals, including 14
gold medals, 14 bronze medals, and 31 silver medals.
(3) Concept confusion. The concepts in external docu-
ments may be similar to, but different from, the concepts in
the question. This can cause confusion for LLMs and make Table 4: Error cases of negative rejection generated by
LLMs generate incorrect answers. In Table 2, the model an- ChatGLM2-6B. The red text highlights the error answers.
swer focuses on the concept “automotive revenue” in the
document rather than “revenue” in the question.
Based on the analysis above, we have identified certain and result in incorrect responses. In future developments, it
limitations in LLMs regarding retrieval-augmented genera- will be crucial for LLMs to enhance their ability to accu-
tion. To effectively handle the vast amount of noise present rately match questions with the appropriate documents.
on the internet, further detailed enhancements are required Results on Information Integration testbed
for the model such as long documents modeling and precise
concept comprehension. We evaluated the accuracy based on the different noise ratios
in external documents, and the results are shown in Table 5.
Results on Negative Rejection testbed When comparing the model to Table 1, we observed that
it has a weak information integration ability, which in turn
We evaluated the rejection rate when only noise documents
affects its noise robustness. We can see that:
were provided. The results are shown in Table 3. In addi-
(1) Information integration poses a challenge for RAG
tion to evaluating the rejection rate through exact matching
in LLMs. Even without noise, the highest accuracy of LLMs
(Rej in Table 3), we also utilize ChatGPT to determine if
can only reach 60% and 67% for English and Chinese,
the responses from the LLMs contain any rejection informa-
respectively. After adding noise, the highest accuracy de-
tion (Rej∗ in Table 3). We can see that: Negative Rejection
creases to 43% and 55%. These results suggest that LLMs
poses a challenge for RAG in LLMs. The highest rejection
struggle with integrating information effectively and are not
rates for LLMs in English and Chinese were only 45% and
well-suited for directly answering complex questions.
43.33%, respectively. This suggests that LLMs can be easily
(2) Complex questions are more challenging for RAG
misled by noisy documents, leading to incorrect answers.
with noisy documents. Performance decline becomes sig-
In addition, through comparing Rej and Rej∗ , we found
nificant when the noise ratio is 0.4, but for simple problems,
that LLMs fail to strictly follow instructions, and they often
a significant decline occurs only at a noise ratio of 0.8 at a
generate unpredictable responses, which make it hard to use
significance level of 0.05. This indicates that complex prob-
them as state triggers (such as for recognizing rejection).
lems are more vulnerable to interference from noise. We
We conduct case studies in Table 4. The first error is
speculate that this is because solving complex problems re-
because of Evidence uncertainty. Although the document
quires integrating information from multiple documents, and
only mentions contact with “Adam McKay” and does not
this information can be considered as noise to each other,
explicitly state that he is the director of the movie, the
making it harder for the model to extract relevant informa-
model still concludes that he holds this role. The first er-
tion from the documents.
ror is because of Concept confusion. The information pro-
vided in the answer pertains to “the 2018 Winter Olympics” Error Analysis. We conducted an error analysis on
instead of “the 2022 Olympics” mentioned in the question. ChatGLM2-6B (noise ratio is 0). Apart from the similar er-
Retrieval-augmented generation poses a greater challenge of rors founded in the noise robustness experiment (38% of the
negative rejection compared to answer directly as it presents total), there are also three types of unique errors. We have
relevant documents that could potentially mislead the LLMs presented these cases in Table 6.
English Chinese Acc Accdoc ED ED∗ CR
Noise Ratio 0 0.2 0.4 0 0.2 0.4
ChatGPT 55 51 34 63 58 47 ChatGPT-zh 91 17 1 3 33.33
ChatGLM-6B 45 36 35 60 53 52 Qwen-7B-Chat-zh 77 12 5 4 25.00
ChatGLM2-6B 34 32 21 44 43 32 ChatGPT-en 89 9 8 7 57.14
Vicuna-7B-v1.3 60 53 43 43 36 25
Qwen-7B-Chat 55 50 37 67 56 55
BELLE-7B-2M 40 34 24 49 41 38 Table 7: The result of counterfactual robustness. ACC is the
accuracy (%) of LLMs without external documents. ACCdoc
Table 5: The experimental result of information integration is the accuracy (%) of LLMs with counterfactual documents.
measured by accuracy (%) under different noise ratios. We ED and ED∗ are error detection rates evaluated by exact
can see that information integration poses a challenge for matching and ChatGPT, respectively. CR is the error cor-
RAG in LLMs. rection rate.