0% found this document useful (0 votes)
54 views9 pages

Benchmarking Large Language Models in Retrieval-Augmented Generation

The document evaluates the performance of different large language models in retrieval-augmented generation across four abilities: noise robustness, negative rejection, information integration, and counterfactual robustness. It establishes a benchmark called Retrieval-Augmented Generation Benchmark to systematically analyze the capabilities and limitations of large language models when applying retrieval-augmented generation.

Uploaded by

vshanyiao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views9 pages

Benchmarking Large Language Models in Retrieval-Augmented Generation

The document evaluates the performance of different large language models in retrieval-augmented generation across four abilities: noise robustness, negative rejection, information integration, and counterfactual robustness. It establishes a benchmark called Retrieval-Augmented Generation Benchmark to systematically analyze the capabilities and limitations of large language models when applying retrieval-augmented generation.

Uploaded by

vshanyiao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen1,3 , Hongyu Lin1, * , Xianpei Han1,2, * , Le Sun1,2


1
Chinese Information Processing Laboratory 2 State Key Laboratory of Computer Science
Institute of Software, Chinese Academy of Sciences, Beijing, China
3
University of Chinese Academy of Sciences, Beijing, China
{jiawei2020,hongyu,xianpei,sunle}@iscas.ac.cn
arXiv:2309.01431v2 [cs.CL] 20 Dec 2023

Abstract Noise Robustness Negative Rejection


Question Question
Retrieval-Augmented Generation (RAG) is a promising ap- Who was awarded the 2022 Nobel prize in
literature?
Who was awarded the 2022 Nobel prize in
literature?
proach for mitigating the hallucination of large language External documents contain noises External documents are all noises
models (LLMs). However, existing research lacks rigorous The Nobel Prize in Literature for 2022 is The Nobel Prize in Literature for 2021 is
awarded to the French author Annie Ernaux, awarded to the novelist Abdulrazak Gurnah,
evaluation of the impact of retrieval-augmented generation “for the courage and clinical acuity … born in Zanzibar and active in …

on different large language models, which make it challeng- The Nobel Prize in Literature for 2021 is The 2020 Nobel Laureate in Literature,
awarded to the novelist Abdulrazak Gurnah,
ing to identify the potential bottlenecks in the capabilities born in Zanzibar and active in …
poet Louise Glück, has written both poetry
and essays about poetry. Since her…

of RAG for different LLMs. In this paper, we systemati- Retrieval Augmented Retrieval Augmented
cally investigate the impact of Retrieval-Augmented Gener- Generation Generation

ation on large language models. We analyze the performance Annie Ernaux


I can not answer the question because of the
insufficient information in documents
of different large language models in 4 fundamental abili-
ties required for RAG, including noise robustness, negative
Information Integration Counterfactual Robustness
rejection, information integration, and counterfactual robust- Question Question
ness. To this end, we establish Retrieval-Augmented Genera- When were the ChatGPT app for iOS and Which city hosted the Olympic games in
tion Benchmark (RGB), a new corpus for RAG evaluation in ChatGPT api launched? 2004?

both English and Chinese. RGB divides the instances within External documents contain all answers Counterfactual external documents
On May 18th, 2023, OpenAI introduced its The 2004 Olympic Games returned home to
the benchmark into 4 separate testbeds based on the afore- own ChatGPT app for iOS… New York, birthplace of the …

mentioned fundamental abilities required to resolve the case. That changed on March 1, when OpenAI After leading all voting rounds, New York
Then we evaluate 6 representative LLMs on RGB to diag- announced the release of API access to
ChatGPT and Whisper,…
easily defeated Rome in the fifth and
final vote …
nose the challenges of current LLMs when applying RAG. Retrieval Augmented Retrieval Augmented
Generation Generation
Evaluation reveals that while LLMs exhibit a certain degree
of noise robustness, they still struggle significantly in terms of May 18 and March 1. There are factual errors in the provided
documents. The answer should be Athens.
negative rejection, information integration, and dealing with
false information. The aforementioned assessment outcomes
indicate that there is still a considerable journey ahead to ef- Figure 1: Illustration of 4 kinds of abilities required for
fectively apply RAG to LLMs. retrieval-augmented generation of LLMs.

Introduction
2022; Izacard et al. 2022). With the help of external knowl-
Recently, there have been impressive advancements in large edge, LLMs can generate more accurate and reliable re-
language models (LLMs) like ChatGPT (OpenAI 2022) and sponses. The most common method is to use a search engine
ChatGLM (THUDM 2023a). Although these models have as a retriever such as New Bing. Due to the vast amount of
shown remarkable general abilities (Bang et al. 2023; Guo information available on the Internet, using a search engine
et al. 2023), they still suffer severely from challenges includ- can provide more real-time information.
ing factual hallucination (Cao et al. 2020; Raunak, Menezes,
and Junczys-Dowmunt 2021; Ji et al. 2023), knowledge out- However, Retrieval-Augmented Generation brings not
dating (He, Zhang, and Roth 2022), and the lack of domain- only positive effects to LLMs (Liu, Zhang, and Liang 2023;
specific expertise (Li et al. 2023c; Shen et al. 2023). Maynez et al. 2020). On one hand, there is a significant
Incorporating external knowledge via information re- amount of noise information even fake news in the content
trieval, i.e., Retrieval-Augmented Generation (RAG), has available on the Internet, which poses challenges for search
been regarded as a promising way to resolve the above chal- engines in accurately retrieving desirable knowledge. On the
lenges. (Guu et al. 2020; Lewis et al. 2020; Borgeaud et al. other hand, LLMs suffer from unreliable generation chal-
lenge. LLMs can be misled by incorrect information con-
* Corresponding authors. tained in the context (Bian et al. 2023) and also suffer from
Copyright © 2024, Association for the Advancement of Artificial hallucination during the generation (Adlakha et al. 2023),
Intelligence (www.aaai.org). All rights reserved. resulting in generating content that goes beyond external in-
formation. These challenges result in LLMs being unable to Based on RGB, we conduct evaluation on 6 state-of-
consistently generate reliable and accurate responses. Un- the-art large language models including ChatGPT (Ope-
fortunately, currently there lacks of comprehensive under- nAI 2022), ChatGLM-6B (THUDM 2023a), ChatGLM2-
standing on how these factors can influence RAG, and how 6B (THUDM 2023b), Vicuna-7b (Chiang et al. 2023),
could each model survives from these drawbacks and im- Qwen-7B-Chat (QwenLM 2023), BELLE-7B (Yunjie Ji
provement their performance via information retrieval. As a 2023). We found that even though RAG can improve the re-
result, there is a pressing need for a comprehensive evalua- sponse accuracy of LLMs, they still suffer from the above-
tion of LLMs on their ability to effectively utilize retrieved mentioned challenges significantly. Specifically, we found
information, as well as their ability to withstand the various that even though LLMs demonstrate some level of noise ro-
drawbacks present in information retrieval. bustness, they tend to confuse similar information and fre-
To this end, this paper conducts a comprehensive evalua- quently generate inaccurate answers when relevant informa-
tion of RAG for current LLMs. Specifically, we create a new tion exists. For example, when faced with a question about
Retrieval-Augmented Generation Benchmark, namely RGB, the 2022 Nobel Prize in Literature, if there are noisy docu-
in both English and Chinese. In order to ensure that the in- ments about the 2021 Nobel Prize in Literature in external
ternal knowledge of LLMs does not introduce bias into the documents, LLMs may become confused and provide inac-
evaluation results, RGB chooses to aggregate the latest news curate answers. Besides, LLMs frequently fail to reject an-
information and constructs queries based on the news infor- swering and generate incorrect answers when none of the
mation. Then, based on these queries, we use Search API to external documents contain relevant information. Further-
fetch relevant documents and select most relevant snippets more, LLMs lack the ability to summarize from multiple
from the content as external retrieved documents. Finally, documents, and therefore if multiple documents are needed
based on different compositions of query and document-set to answer a question, LLMs often fail to provide accurate
pairs, we expand the corpus and divided it into 4 testbeds to answer. Finally, we found that even when the LLMs contain
evaluate the following basic abilities of LLMs according to the required knowledge and are given warnings about po-
the common challenges in RAG, as shown in Figure 1: tential risks in the retrieved information through instruction,
• Noise Robustness, which means a LLM can extract use- they still tend to trust and prioritize the retrieved information
ful information from noisy documents. In this paper, we over their own existing knowledge. The experimental results
define noisy documents as those that are relevant to the mentioned above highlight the need for further resolution of
question but do not contain any information of the an- important issues in the existing RAG method. Therefore, it
swer. For the instance in Figure 1, the noisy documents is crucial to exercise caution and carefully design its usage.
related to the question “Who was awarded the 2022 No- Generally speaking, the contributions of this paper are1 :
bel Prize in Literature” include reports about the 2021 • We proposed to evaluate four capabilities for retrieval-
Nobel Prize in Literature. To this end, the testbed for augmented generation of LLMs and created the
noise robustness contains instances whose external doc- Retrieval-Augmented Generation Benchmark in both En-
uments contain a certain number of noisy documents glish and Chinese. To best of our knowledge, it is the first
based on the desired noise ratio. benchmark designed to assess these four capabilities for
• Negative Rejection, which means that a LLM should re- retrieval-augmented generation of LLMs.
ject to answer the question when the required knowledge • We evaluated the existing LLMs using RGB and found
is not present in any retrieved document. The testbed for the limitations of them in the four different abilities.
negative rejection contains instances whose external doc- • We analyzed the responses of LLMs in RGB and identi-
uments are only with noisy documents. LLMs are ex- fied their current shortcomings as well as suggested di-
pected to indicate “insufficient information” or other re- rections for improvement.
jection signals.
• Information Integration, which evaluates whether Related work
LLMs can answer complex questions that require inte-
Retrieval-augmented models The knowledge stored in
grating information from multiple documents. For the in-
large language models is commonly out-of-date (He, Zhang,
stance in Figure 1, for the question “When were the Chat-
and Roth 2022) and they also sometimes generate hallu-
GPT app for iOS and ChatGPT api launched?”, LLMs
cination (Cao et al. 2020; Raunak, Menezes, and Junczys-
are expected to provide information of the launch dates
Dowmunt 2021; Ji et al. 2023) i.e., they may generate ir-
for both the ChatGPT iOS app and ChatGPT API. The
relevant or factually incorrect contents. By using external
testbed for information integration contains instances
knowledge as guidance, retrieval-augmented models can
that can only be answered using multiple documents.
generate more accurate and reliable responses (Guu et al.
• Counterfactual Robustness, which evaluates whether 2020; Lewis et al. 2020; Borgeaud et al. 2022; Izacard
LLMs can identify risks of known factual errors in the et al. 2022; Shi et al. 2023; Ren et al. 2023). Retrieval-
retrieved documents when the LLMs are given warnings augmented models have achieved remarkable results in var-
about potential risks in the retrieved information through ious tasks such as open-domain QA (Izacard and Grave
instruction. The testbed for counterfactual robustness in- 2021; Trivedi et al. 2023; Li et al. 2023a), dialogue (Cai
cludes instances that can be answered directly by the
1
LLMs, but the external documents contain factual errors. Our code&data: https://github.com/chen700564/RGB.
Top1 Chunk Top2 Chunk …… Top30 Chunk
and responsibility of LLMs, M3Exam (Zhang et al. 2023)
Rerank by dense Dense retrieval model
focuses on human exam and ToolBench (Qin et al. 2023)
retrieval model
……
evaluates how well LLMs use external tools. Recently, Ad-
Who was awarded the 2022 Nobel
Prize for Physiology and Medicine?”, Chun2
Chunk lakha et al. (2023) evaluate the RAG of LLMs in exist QA
dataset. Different from their work, we focus on 4 required
{"link": "https://www.nobelprize.org/prizes/medicine/", "title":
"The Nobel Prize in Physiology or Medicine 2022", "snippet": "The abilities of RAG and create Retrieval-Augmented Genera-
Nobel Assembly..."}, ...
Retrieve using tion Benchmark to evaluate the LLMs.
search engine Google Search API

Query: Who was awarded the 2022 Nobel Prize for Physiology and
Medicine?”, Retrieval-Augmented Generation Benchmark
{
Data adjustment “Question”: “Who was awarded the 2022
In this section, we first introduce the specific retrieval-
and filtering by Nobel Prize for Physiology and Medicine?”, augmented generation abilities we aim to evaluate. Next, we
Human “Answer”: ['Svante Pääbo','Svante Paabo’]
} outline the process of constructing the RAG benchmark for
Related event: 2022 Nobel Prize for Physiology and Medicine evaluation. Lastly, we present the evaluation metrics.
Question: Who was awarded the 2022 Nobel Prize for Physiology
and Medicine?
Key information: Svante Pääbo and Svante Paabo

gpt-3.5-turbo api
Required abilities of RAG
Data generation by
ChatGPT We simulate the process of a user querying and obtaining
information. Suppose the user retrieves a current event news,
External knowledge is the key to resolving the problems
speculate the event that the user is concerned about and the
question that he/she may want to know, and generate the key
of LLMs such as hallucination and outdated knowledge,
information corresponding to the answer to the question. …
… which can make LLMs generate more accurate and reliable
News: The 2022 Nobel Prize for Physiology and Medicine was …
responses through retrieval-augmented generation (RAG).
News Collection
The 2022 Nobel Prize for Physiology and Medicine was awarded on
Monday to Swedish scientist Svante Pääbo for sequencing the
However, LLMs cannot always response as expected with
genome of the Neanderthal.
RAG. For one thing, there are numerous irrelevant docu-
ments and false information on the Internet. Incorporating
Figure 2: The process of data generation. Firstly, we use these external documents into LLMs could have a detrimen-
models to extract (event, question, answer) from news ar- tal effect. For anthoer, LLMs suffer from the unreliable gen-
ticles. Next, we utilize search engines to retrieve relevant eration challenge. The generation of LLMs is often unpre-
web pages. Finally, a dense retrieval model is employed to dictable, and we cannot guarantee that they will utilize the
re-rank the content of these web pages. useful information entailed in the external documents. Ad-
ditionally, LLMs can easily be misled by incorrect infor-
mation in the document. To this end, we build Retrieval-
et al. 2019a,b; Peng et al. 2023), domain-specific ques- Augmented Generation Benchmark (RGB) to evaluate the
tion answering (Cui et al. 2023) and code generation (Zhou retrieval-augmented generation of LLMs, and we concern
et al. 2023b). Recently, with the development of large mod- about 4 specific abilities:
els, a series of retrieval-enhanced tools and products have Noise Robustness is the robustness of LLMs in noisy
gained widespread attention, such as ChatGPT retrieval plu- documents. As retrievers are not perfect, the external knowl-
gin, Langchain, New Bing, etc. However, in real-world sce- edge they retrieve often contains a significant amount of
narios, the retrieved text inevitably contains noise. There- noise, i.e., documents which are relevant to the question but
fore, in this paper we conducted a systematic evaluation and do not contain any information about the answer. To effec-
analysis of retrieval-augmented generation in LLMs. tively answer user questions, LLMs must be able to extract
the necessary information from documents despite there are
Evaluation of LLMs Evaluating LLMs has received sig- noisy documents.
nificant attention due to their remarkable general capabil- Negative Rejection is a measure of whether LLMs can
ity (Chang et al. 2023). It enables us to gain a deeper under- decline to answer a question when none of the contexts pro-
standing of the specific abilities and limitations of LLMs, vide useful information. In real-world situations, the search
while also providing valuable guidance for future research. engine often fails to retrieve documents containing the an-
In the past, benchmarks such as GLUE (Wang et al. 2019b) swers. In these cases, it is important for the model to have
and SuperCLUE (Wang et al. 2019a) primarily focused on the capability to reject recognition and avoid generating mis-
evaluating NLP tasks, particularly in natural language un- leading content.
derstanding. However, these evaluations often fail to fully Information Integration is a capacity to integrate an-
capture the capabilities of LLMs. MMLU (Hendrycks et al. swers from multiple documents. In many cases, the an-
2021) was then proposed to measure the knowledge acquired swer to a question may be contained in multiple documents.
by language models when pre-training. Recently, with the For example, for the question “Who are the champions of
development of LLMs, a series of general evaluation bench- the U.S. Open 2022 men’s and women’s singles?”, the two
marks have emerged, such as AGIEval (Zhong et al. 2023), champions may be mentioned in different documents. In or-
C-Eval (Huang et al. 2023), AlpacaEval (Li et al. 2023b), der to provide better answers to complex questions, it is nec-
OpenLLM Leaderboard (Edward Beeching 2023), etc. In essary for LLMs to have the ability to integrate information.
addition to general abilities, there are also specific bench- Counterfactual Robustness refers to a capacity to han-
marks that focus on evaluating the capabilities of models. dle errors in external knowledge. In the real world, there is
For example, CValues (Xu et al. 2023a) focuses on the safety an abundance of false information on the internet. Please
English Chinese
note that we only evaluate the situation that LLMs are given System instruction
You are an accurate and reliable AI assistant that can
System instruction
你是一个准确和可靠的人工智能助手,
warnings about potential risks in the retrieved information answer questions with the help of external documents.
Please note that external documents may contain noisy
能够借助外部文档回答问题,请注意
外部文档可能存在噪声事实性错误。
through instruction. or factually incorrect information. If the information in
如果文档中的信息包含了正确答案,
the document contains the correct answer, you will give
你将进行准确的回答。如果文档中的
In real-world scenarios, it is not possible to obtain per- an accurate answer. If the information in the document
信息不包含答案,你将生成“文档信
does not contain the answer, you will generate ’I can not
息不足,因此我无法基于提供的文档
fect documents with all the necessary external knowledge. answer the question because of the insufficient
回答该问题。”如果部分文档中存在
information in documents.‘ If there are inconsistencies
Therefore, evaluating these four abilities of the model be- 与事实不一致的错误,请先生成“提
with the facts in some of the documents, please generate
供文档的文档存在事实性错误。”,
the response 'There are factual errors in the provided
comes essential in order to measure the RAG of LLMs. documents.' and provide the correct answer.
并生成正确答案。

User input Instruction


User input Instruction
文档:\n{DOCS} \n\n问题:\n{QUERY}
Data construction Document:\n{DOCS} \n\nQuestion:\n{QUERY}

Inspired by previous benchmarks for LLMs, RGB utilizes


a question-answering format for evaluation. We evaluate the Figure 3: The instructions used in our experiments, which
LLMs by judging the retrieval-augmented responses of them include a system instruction followed by a user input instruc-
to the questions. To simulate real-world scenarios, we con- tion. The “{DOCS}” and “{QUERY}” will be replaced by
struct question and answer data using actual news articles. the external documents and the question.
Due to the abundance of knowledge contained within the
LLMs there is a potential for bias when measuring the first
three abilities. To mitigate this, the instances of RGB are quires utilizing information from various documents. Dif-
constructed by latest news articles. Additionally, we retrieve ferent from the first three abilities, the data of counterfactual
external documents from Internet through search engines. robustness is constructed solely based on the internal knowl-
Finally, we expand the corpus and divided it into 4 testbeds edge of the model. Based on the aforementioned generated
to evaluate the above basic abilities of LLMs. The overall questions mentioned above, we adopt ChatGPT to automat-
procedure of our data construction is illustrated in Figure 2. ically generate its known knowledge. Specifically, we use
QA instances generation. We first collect latest news ar- prompts to allow the model to generate both questions and
ticles and use prompts to make ChatGPT generate events, answers that are already known. For example, based on the
questions, and answers for each articles. For example, as question “Who was awarded the 2022 Nobel Prize for Phys-
shown in the Figure 2, for a report about “The 2022 Nobel iology and Medicine?”, the model will generate the known
Prize”, ChatGPT will generate corresponding event, ques- question “Who was awarded the 2021 Nobel Prize in Lit-
tion and provide key information for answering it. By gen- erature?” and answer “Abdulrazak Gurnah”. We then man-
erating events, the model is able to preliminarily filter out ually verified the generated answers, and retrieve relevant
news articles that do not contain any events. After genera- documents as described above. In order to make documents
tion, we manually check the answer and filter out data that contain factual errors, we manually modify the answers and
is difficult to retrieve through search engines. replace the corresponding parts in the document.
Retrieve using search engine. For each query, we use Finally, we collect totally 600 base questions in RGB,
Google’s API to fetch 10 relevant web pages and extract and 200 additional questions for the information integration
corresponding snippets of text from them. Simultaneously, ability and 200 additional questions for counterfactual ro-
we read these web pages and convert their textual content bustness ability. Half of the instances are in English, and the
into text chunks with a maximum length of 300 tokens. Us- other half are in Chinese.
ing an existing dense retrieval model 2 , we select the top-30
Evaluation metrics
text chunks that match the query most effectively. These re-
trieved text chunks, along with the snippets provided by the The core of this benchmark is to evaluate whether LLMs can
search API, will serve as our external documents. These doc- utilize the provided external documents to acquire knowl-
uments will be divided into positive documents and negative edge and generate reasonable answers. We evaluate the re-
documents based on whether they contain the answer. sponses of LLMs in order to measure above-mentioned four
Testbeds construction for each ability. We expand the abilities of them.
corpus and divided it into 4 testbeds to evaluate the above Accuracy is used to measure noise robustness and infor-
basic abilities of LLMs. To evaluate the noise robustness, mation integration. We employ an exact matching approach
we sample varying numbers of negative documents ac- where if the generated text contains an exact match to the
cording to the desired ratio of noises. For negative rejec- answer, it is considered as a correct answer.
tion, all the external documents are sampled from negative Rejection rate is used to measure negative rejection.
documents. For the information integration ability, we fur- When only noisy documents are provided, LLMs should
ther construct data based on the above generated questions. output the specific content – “I can not answer the question
This involves expanding or rewriting these questions so that because of the insufficient information in documents.” (We
their answers encompass multiple aspects. For example, the use instructions to inform the model.). If the model gener-
question “Who won the MVP of Super Bowl 2023?” can ates this content, it indicates a successful rejection.
be rewrite as “Who won the MVPs of Super Bowl 2022 Error detection rate measures whether the model can
and 2023?”. Consequently, answering such questions re- detect the factual errors in the documents for counterfactual
robustness. When the provided documents contain factual
2 errors, the model should output the specific content – “There
Chinese: https://huggingface.co/moka-ai/m3e-base; English:
https://huggingface.co/sentence-transformers/all-mpnet-base-v2. are factual errors in the provided documents.” (We use in-
English Chinese
Noise Ratio 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
ChatGPT (OpenAI 2022) 96.33 94.67 94.00 90.00 76.00 95.67 94.67 91.00 87.67 70.67
ChatGLM-6B (THUDM 2023a) 93.67 90.67 89.33 84.67 70.67 94.33 90.67 89.00 82.33 69.00
ChatGLM2-6B (THUDM 2023b) 91.33 89.67 83.00 77.33 57.33 86.67 82.33 76.67 72.33 54.00
Vicuna-7B-v1.3 (Chiang et al. 2023) 87.67 83.33 86.00 82.33 60.33 85.67 82.67 77.00 69.33 49.67
Qwen-7B-Chat (QwenLM 2023) 94.33 91.67 91.00 87.67 73.67 94.00 92.33 88.00 84.33 68.67
BELLE-7B-2M (Yunjie Ji 2023) 83.33 81.00 79.00 71.33 64.67 92.00 88.67 85.33 78.33 67.68

Table 1: The experimental result of noise robustness measured by accuracy (%) under different noise ratios. We can see that the
increasing noise rate poses a challenge for RAG in LLMs.

Long-distance information. Evidence uncertainty. Concept confusion.


Question Who did Iga Swiatek defeat to win the Qatar Open 2022? What is the name of Apple’s headset? What was Tesla’s revenue in Q1 2022?
Answer Anett Kontaveit Vision Pro 18.76 billion
Positive document Positive document Positive document
In February, Swiatek entered into the Qatar Open ... Apple (AAPL.O) on Monday unveiled a costly Tesla, Inc. (TSLA) reported Q1 FY 2022 earnings results
In the final, she won ... Anett Kontaveit ... augmented-reality headset called the Vision Pro ... ... detailed revenues of $18.76 billion ...
Documents
Negative document Negative document Negative document
This time, she defeated Ons Jabeur 6-2, 7-6(5) to win ... is what Gurman believes will be called ...first-quarter earnings for 2022 ...
the 2022 US Open, ... Apple Reality Pro. ... ...Automotive revenue reached $16.86 billion...
Iga Swiatek defeated Ons Jabeur in the second round According to the document, the name of Apple’s According to the financial results provided in the article,
Responses
of the Qatar Open 2022 to win the tournament. headset is Apple Reality Pro. Tesla’s revenue in Q1 2022 was $16.86 billion.

Table 2: Error cases of noise robustness, and only one positive document and one negative document are shown. The responses
are generated by ChatGLM2-6B. The blue text indicates the matching parts between the document and the question or answer,
while the red text highlights the non-matching parts.

structions to inform the model.). If the model generates this Models We conduct evaluation on 6 state-of-the-art
content, it indicates that the model has detected erroneous large language models which can generate both En-
information in the document. glish and Chinese including ChatGPT (OpenAI 2022)3 ,
Error correction rate measures whether the model can ChatGLM-6B (THUDM 2023a), ChatGLM2-6B (THUDM
provide the correct answer after identifying errors for coun- 2023b), Vicuna-7b-v1.3 (Chiang et al. 2023), Qwen-7B-
terfactual robustness. The model is asked to generate the cor- Chat (QwenLM 2023), BELLE-7B-2M (Yunjie Ji 2023).
rect answer after identifying the factual errors. If the model
generates the correct answer, it indicates that the model is Results on Noise Robustness
capable of correcting errors in the document. We evaluated the accuracy based on the different noise ratios
Considering that the model may not fully adhere to in- in external documents, and the results are shown in Table 1.
structions, for rejection rate and error detection rate, we We can see that:
also use ChatGPT to conduct additional evaluation of the (1) RAG can effect improve the responses of LLMs.
answers. Specifically, we assess the model’s responses by LLMs have shown strong performance even in the presence
using instructions and demonstrations to determine if they of noise, indicating that RAG is a promising way for LLMs
can reflect information that is not present in the document or to generate accurate and reliable responses.
identify any factual errors. (2) The increasing noise rate poses a challenge for
RAG in LLMs. Specifically, when the noise ratio exceeds
Experiments 80%, the accuracy decreases significantly at a significance
level of 0.05. For example, the performance of ChatGPT has
In this section, we evaluate the performance of various decreased from 96.33% to 76.00%, while the performance
LLMs, analyze and discuss the results, summarizing the of ChatGLM2-6B has decreased from 91.33% to 57.33%.
main challenges that existing LLMs encounter when using
external knowledge. Error Analysis. To better comprehend the negative im-
pact of noise on model generation, we examined the incor-
Settings rect answers and found that these errors typically originate
from three reasons, as shown in Table 2.
Task formats. Due to contextual limitations, we provide 5 (1) Long-distance information. LLMs often face diffi-
external documents for each question. In our experiments culty in identifying the correct answer from external docu-
on noise robustness, we evaluate scenarios with noise ra- ments when the information related to the question is distant
tios ranging from 0 to 0.8. To comprehensively evaluate the from the information related to the answer. This scenario
overall capabilities, we have adopted a unified instruction is quite common as longer texts are frequently encountered
for each language, as shown in Figure 3. The experiments
3
were conducted using an NVIDIA GeForce RTX 3090. We use gpt-3.5-turbo api in the experiments.
on the internet. In such cases, it is typical for the question’s Languages English Chinese
information to be initially presented at the start of the doc- Rej Rej∗ Rej Rej∗
ument and subsequently referred to using pronouns. In Ta- ChatGPT 24.67 45.00 5.33 43.33
ble 2, the question information (“Qatar Open 2022”) is only ChatGLM-6B 9.00 25.00 6.33 17.00
mentioned once at the beginning and is far from where the ChatGLM2-6B 10.33 41.33 6.33 36.33
Vicuna-7B-v1.3 17.00 33.33 3.37 24.67
answer text “Anett Kontaveit” appears. This situation may Qwen-7B-Chat 31.00 35.67 8.67 25.33
cause LLMs to depend on information from other docu- BELLE-7B-2M 5.67 32.33 5.33 13.67
ments and create false impressions, i.e., hallucination.
(2) Evidence uncertainty. Before highly anticipated Table 3: The result of negative rejection. Rej means the re-
events, like the release of new Apple products or the an- jection rate (%) and Rej∗ means the rejection rate evaluated
nouncement of the Oscars, there is often a significant by ChatGPT. We can see that negative rejection poses a chal-
amount of speculative information circulating on the inter- lenge for RAG in LLMs.
net. Although the relevant documents explicitly state that
it is uncertain or speculative content, they can still impact Question Answer Response
on the retrieval-augmented generation of LLMs. In Table 2, The information provided in the document
is that Fox has attached “The Big Short”
when the noise ratio increases, the content of erroneous who will direct
Irredeemable film?
Jeymes
Samuel
writer-director Adam McKay to its movie
documents is all about some people’s predictions about the adaptation of “Irredeemable” from Boom!
Comics. Therefore, the answer is Adam McKay.
name of the headset (“Apple Reality Pro”). Even if there is Based on the information provided in the
a correct answer (“Vision Pro”) in the relevant documents, Which country won the document, the country that won the most medals
most medals at the Norway at the 2022 Winter Olympics is Germany.
LLMs can still be misled by uncertain evidences. 2022 Winter Olympics? Germany has won a total of 31 medals, including 14
gold medals, 14 bronze medals, and 31 silver medals.
(3) Concept confusion. The concepts in external docu-
ments may be similar to, but different from, the concepts in
the question. This can cause confusion for LLMs and make Table 4: Error cases of negative rejection generated by
LLMs generate incorrect answers. In Table 2, the model an- ChatGLM2-6B. The red text highlights the error answers.
swer focuses on the concept “automotive revenue” in the
document rather than “revenue” in the question.
Based on the analysis above, we have identified certain and result in incorrect responses. In future developments, it
limitations in LLMs regarding retrieval-augmented genera- will be crucial for LLMs to enhance their ability to accu-
tion. To effectively handle the vast amount of noise present rately match questions with the appropriate documents.
on the internet, further detailed enhancements are required Results on Information Integration testbed
for the model such as long documents modeling and precise
concept comprehension. We evaluated the accuracy based on the different noise ratios
in external documents, and the results are shown in Table 5.
Results on Negative Rejection testbed When comparing the model to Table 1, we observed that
it has a weak information integration ability, which in turn
We evaluated the rejection rate when only noise documents
affects its noise robustness. We can see that:
were provided. The results are shown in Table 3. In addi-
(1) Information integration poses a challenge for RAG
tion to evaluating the rejection rate through exact matching
in LLMs. Even without noise, the highest accuracy of LLMs
(Rej in Table 3), we also utilize ChatGPT to determine if
can only reach 60% and 67% for English and Chinese,
the responses from the LLMs contain any rejection informa-
respectively. After adding noise, the highest accuracy de-
tion (Rej∗ in Table 3). We can see that: Negative Rejection
creases to 43% and 55%. These results suggest that LLMs
poses a challenge for RAG in LLMs. The highest rejection
struggle with integrating information effectively and are not
rates for LLMs in English and Chinese were only 45% and
well-suited for directly answering complex questions.
43.33%, respectively. This suggests that LLMs can be easily
(2) Complex questions are more challenging for RAG
misled by noisy documents, leading to incorrect answers.
with noisy documents. Performance decline becomes sig-
In addition, through comparing Rej and Rej∗ , we found
nificant when the noise ratio is 0.4, but for simple problems,
that LLMs fail to strictly follow instructions, and they often
a significant decline occurs only at a noise ratio of 0.8 at a
generate unpredictable responses, which make it hard to use
significance level of 0.05. This indicates that complex prob-
them as state triggers (such as for recognizing rejection).
lems are more vulnerable to interference from noise. We
We conduct case studies in Table 4. The first error is
speculate that this is because solving complex problems re-
because of Evidence uncertainty. Although the document
quires integrating information from multiple documents, and
only mentions contact with “Adam McKay” and does not
this information can be considered as noise to each other,
explicitly state that he is the director of the movie, the
making it harder for the model to extract relevant informa-
model still concludes that he holds this role. The first er-
tion from the documents.
ror is because of Concept confusion. The information pro-
vided in the answer pertains to “the 2018 Winter Olympics” Error Analysis. We conducted an error analysis on
instead of “the 2022 Olympics” mentioned in the question. ChatGLM2-6B (noise ratio is 0). Apart from the similar er-
Retrieval-augmented generation poses a greater challenge of rors founded in the noise robustness experiment (38% of the
negative rejection compared to answer directly as it presents total), there are also three types of unique errors. We have
relevant documents that could potentially mislead the LLMs presented these cases in Table 6.
English Chinese Acc Accdoc ED ED∗ CR
Noise Ratio 0 0.2 0.4 0 0.2 0.4
ChatGPT 55 51 34 63 58 47 ChatGPT-zh 91 17 1 3 33.33
ChatGLM-6B 45 36 35 60 53 52 Qwen-7B-Chat-zh 77 12 5 4 25.00
ChatGLM2-6B 34 32 21 44 43 32 ChatGPT-en 89 9 8 7 57.14
Vicuna-7B-v1.3 60 53 43 43 36 25
Qwen-7B-Chat 55 50 37 67 56 55
BELLE-7B-2M 40 34 24 49 41 38 Table 7: The result of counterfactual robustness. ACC is the
accuracy (%) of LLMs without external documents. ACCdoc
Table 5: The experimental result of information integration is the accuracy (%) of LLMs with counterfactual documents.
measured by accuracy (%) under different noise ratios. We ED and ED∗ are error detection rates evaluated by exact
can see that information integration poses a challenge for matching and ChatGPT, respectively. CR is the error cor-
RAG in LLMs. rection rate.

Question Answer Response Errors

What groupings are France and Group D


France and Germany are in
Merging thought approach to break down complex problems (Zhou
Group D of the 2022 FIFA
Germany in World Cup 2022? Group E
World Cup.
Error et al. 2023a; Xu et al. 2023b; Drozdov et al. 2023). How-
Cooper Kupp
According to the provided ever, these methods slow down the inference speed and can-
Who were the MVP of Super document, the MVP of Super Ignoring not provide timely responses.
Patrick
Bowl 2022 and 2023? Bowl LVI was Cooper Kupp Error
Mahomes
of the Los Angeles Rams.

What films won the 2022 and


CODA The film CODA won the Results on Counterfactual Robustness testbed
Everything award for Best Picture at the Misalignment
2023 Academy Awards for
Best Picture?
Everywhere 95th Academy Awards Error In order to ensure that LLMs possess relevant knowledge,
All at Once ceremony held on 2023.
we assess their performance by directly asking them ques-
tions. However, we found that most LLMs struggle to an-
Table 6: Error cases of information integration, the re-
swer them correctly. To ensure a more reasonable evalua-
sponses are generated by ChatGLM2-6B. The blue and red
tion, we only consider LLMs that have an accuracy rate of
texts represent the answers to two sub-questions.
over 70% as this threshold is relatively high and encom-
passes more LLMs. The results are shown in Table 7. We
present the following metrics: accuracy without any docu-
(1) Merging Error (28% of the total). The model some-
ments, accuracy with counterfactual documents, error de-
times merges the answers of the two sub-questions, resulting
tection rates, and error correction rates. We can see that It
in an error. It mistakenly uses the answer from one question
is hard for LLMs to identify and correct factual errors in the
to address both two questions. At this point, the model will
documents. This suggests that the model can be easily mis-
disregard any documents related to one sub-question. For
led by documents containing incorrect facts.
example, in Table 6, it incorrectly states that Group D is the
It is important to note that retrieval-augmented generation
World Cup group for both France and Germany, while in fact
is not designed to automatically address factual errors within
Germany is actually assigned to Group E.
a given context, as this contradicts the underlying assump-
(2) Ignoring Error (28% of the total). Sometimes, the
tion that the model lacks knowledge and relies on retrieved
model may ignore one of the sub-questions and only answer
documents for additional information. However, this issue is
the other. This error occurs when the model lacks a complete
crucial in practical applications due to the abundance of fake
understanding of the problem and fails to recognize that it
news on the internet. Existing LLMs do not have a safeguard
consists of multiple sub-problems. As a result, the model
to handle inaccurate responses caused by misinformation. In
only considers relevant documents for one sub-problem in
fact, they heavily depend on the information they retrieve.
order to generate an answer, disregarding the question posed
Even when LLMs contain the internal knowledge about the
by another sub-problem. For example, in Table 6, the model
questions, they often trust false information that is retrieved.
only provides the answer for the MVP of Super Bowl 2022
This presents significant a challenge for the future develop-
and does not consider 2023.
ment of RAG in LLMs.
(3) Misalignment Error (6% of the total). Sometimes,
the model incorrectly identifies the documents for one sub-
question as the documents for another sub-question, leading Conclusion
to misaligned answers. For example, in Table 6, the third an- In this paper, we evaluated four abilities of retrieval-
swer has two errors: an ignoring error and a misalignment er- augmented generation in LLMs: noise robustness, nega-
ror. Firstly, the model only mentioned the Best Picture of the tive rejection, information integration, and counterfactual
2023 (95th) Academy Awards, completely disregarding the robustness. To conduct the evaluation, we built Retrieval-
2022 awards. Additionally, it incorrectly stated that “CODA” Augmented Generation Benchmark (RGB). The instances of
is the Best Picture of 2023 when it was actually awarded as RGB are generated from latest news articles and the external
the Best Picture in 2022. documents obtained from search engines. The experimental
The errors mentioned above are primarily caused by the results suggest that current LLMs have limitations in the 4
limited understanding of complex questions, which hinders abilities. This indicates that there is still a significant amount
the ability to effectively utilize information from different of work needed to effectively apply RAG to LLMs. To en-
sub-problems. The key lies in improving the model’s rea- sure accurate and reliable responses from LLMs, it is crucial
soning capability. One possible solution is to use a chain-of- to exercise caution and carefully design for RAG.
Acknowledgements Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.;
This research work is supported by the National Natural Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica,
Science Foundation of China under Grants no. 62122077, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot
62106251, 62306303, the CAS Project for Young Scientists Impressing GPT-4 with 90%* ChatGPT Quality.
in Basic Research under Grant No.YSBR-040. Xianpei Han Cui, J.; Li, Z.; Yan, Y.; Chen, B.; and Yuan, L. 2023. Chat-
is sponsored by CCF- BaiChuan-Ebtech Foundation Model Law: Open-Source Legal Large Language Model with Inte-
Fund. grated External Knowledge Bases. arXiv:2306.16092.
Drozdov, A.; Schärli, N.; Akyürek, E.; Scales, N.; Song,
References X.; Chen, X.; Bousquet, O.; and Zhou, D. 2023. Compo-
sitional Semantic Parsing with Large Language Models. In
Adlakha, V.; BehnamGhader, P.; Lu, X. H.; Meade, N.; and The Eleventh International Conference on Learning Repre-
Reddy, S. 2023. Evaluating Correctness and Faithfulness sentations.
of Instruction-Following Models for Question Answering.
arXiv:2307.16877. Edward Beeching, N. H. S. H. N. L. N. R. O. S. L. T.
T. W., Clémentine Fourrier. 2023. Open LLM Leader-
Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, board. https://huggingface.co/spaces/HuggingFaceH4/
B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; Do, Q. V.; Xu, open llm leaderboard.
Y.; and Fung, P. 2023. A Multitask, Multilingual, Multi- Guo, B.; Zhang, X.; Wang, Z.; Jiang, M.; Nie, J.; Ding, Y.;
modal Evaluation of ChatGPT on Reasoning, Hallucination, Yue, J.; and Wu, Y. 2023. How Close is ChatGPT to Hu-
and Interactivity. arXiv:2302.04023. man Experts? Comparison Corpus, Evaluation, and Detec-
Bian, N.; Liu, P.; Han, X.; Lin, H.; Lu, Y.; He, B.; and tion. arXiv:2301.07597.
Sun, L. 2023. A Drop of Ink Makes a Million Think: The Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; and Chang, M.-W.
Spread of False Information in Large Language Models. 2020. REALM: Retrieval-Augmented Language Model Pre-
arXiv:2305.04812. Training. In Proceedings of the 37th International Confer-
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Ruther- ence on Machine Learning, ICML’20. JMLR.org.
ford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.-B.; He, H.; Zhang, H.; and Roth, D. 2022. Rethinking
Damoc, B.; Clark, A.; de Las Casas, D.; Guy, A.; Menick, J.; with Retrieval: Faithful Large Language Model Inference.
Ring, R.; Hennigan, T.; Huang, S.; Maggiore, L.; Jones, C.; arXiv:2301.00303.
Cassirer, A.; Brock, A.; Paganini, M.; Irving, G.; Vinyals, Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.;
O.; Osindero, S.; Simonyan, K.; Rae, J. W.; Elsen, E.; and Song, D.; and Steinhardt, J. 2021. Measuring Massive Mul-
Sifre, L. 2022. Improving language models by retrieving titask Language Understanding. In International Conference
from trillions of tokens. arXiv:2112.04426. on Learning Representations.
Cai, D.; Wang, Y.; Bi, W.; Tu, Z.; Liu, X.; Lam, W.; and Huang, Y.; Bai, Y.; Zhu, Z.; Zhang, J.; Zhang, J.; Su, T.;
Shi, S. 2019a. Skeleton-to-Response: Dialogue Genera- Liu, J.; Lv, C.; Zhang, Y.; Lei, J.; Fu, Y.; Sun, M.; and He,
tion Guided by Retrieval Memory. In Proceedings of the J. 2023. C-Eval: A Multi-Level Multi-Discipline Chinese
2019 Conference of the North American Chapter of the As- Evaluation Suite for Foundation Models. arXiv preprint
sociation for Computational Linguistics: Human Language arXiv:2305.08322.
Technologies, Volume 1 (Long and Short Papers), 1219– Izacard, G.; and Grave, E. 2021. Leveraging Passage Re-
1228. Minneapolis, Minnesota: Association for Computa- trieval with Generative Models for Open Domain Ques-
tional Linguistics. tion Answering. In Proceedings of the 16th Conference of
Cai, D.; Wang, Y.; Bi, W.; Tu, Z.; Liu, X.; and Shi, S. the European Chapter of the Association for Computational
2019b. Retrieval-guided Dialogue Response Generation via Linguistics: Main Volume, 874–880. Online: Association for
a Matching-to-Generation Framework. In Proceedings of Computational Linguistics.
the 2019 Conference on Empirical Methods in Natural Lan- Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni,
guage Processing and the 9th International Joint Confer- F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; and
ence on Natural Language Processing (EMNLP-IJCNLP), Grave, E. 2022. Atlas: Few-shot Learning with Retrieval
1866–1875. Hong Kong, China: Association for Computa- Augmented Language Models. arXiv:2208.03299.
tional Linguistics. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.;
Cao, M.; Dong, Y.; Wu, J.; and Cheung, J. C. K. 2020. Fac- Bang, Y. J.; Madotto, A.; and Fung, P. 2023. Survey of Hal-
tual Error Correction for Abstractive Summarization Mod- lucination in Natural Language Generation. ACM Comput.
els. In Proceedings of the 2020 Conference on Empirical Surv., 55(12).
Methods in Natural Language Processing (EMNLP), 6251– Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.;
6258. Online: Association for Computational Linguistics. Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel,
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented
K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; Ye, W.; Generation for Knowledge-Intensive NLP Tasks. In Pro-
Zhang, Y.; Chang, Y.; Yu, P. S.; Yang, Q.; and Xie, X. ceedings of the 34th International Conference on Neural
2023. A Survey on Evaluation of Large Language Models. Information Processing Systems, NIPS’20. Red Hook, NY,
arXiv:2307.03109. USA: Curran Associates Inc. ISBN 9781713829546.
Li, D.; Rawat, A. S.; Zaheer, M.; Wang, X.; Lukasik, M.; Trivedi, H.; Balasubramanian, N.; Khot, T.; and Sabharwal,
Veit, A.; Yu, F.; and Kumar, S. 2023a. Large Language A. 2023. Interleaving Retrieval with Chain-of-Thought Rea-
Models with Controllable Working Memory. In Findings of soning for Knowledge-Intensive Multi-Step Questions. In
the Association for Computational Linguistics: ACL 2023, Proceedings of the 61st Annual Meeting of the Associa-
1774–1793. Toronto, Canada: Association for Computa- tion for Computational Linguistics (Volume 1: Long Papers),
tional Linguistics. 10014–10037. Toronto, Canada: Association for Computa-
Li, X.; Zhang, T.; Dubois, Y.; Taori, R.; Gulrajani, I.; tional Linguistics.
Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023b. Al- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.;
pacaEval: An Automatic Evaluator of Instruction-following Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019a. Su-
Models. https://github.com/tatsu-lab/alpaca eval. perGLUE: A Stickier Benchmark for General-Purpose Lan-
Li, X.; Zhu, X.; Ma, Z.; Liu, X.; and Shah, S. 2023c. Are guage Understanding Systems. Red Hook, NY, USA: Curran
ChatGPT and GPT-4 General-Purpose Solvers for Financial Associates Inc.
Text Analytics? An Examination on Several Typical Tasks. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and
arXiv:2305.05862. Bowman, S. R. 2019b. GLUE: A Multi-Task Benchmark
Liu, N. F.; Zhang, T.; and Liang, P. 2023. Evaluating Verifi- and Analysis Platform for Natural Language Understanding.
ability in Generative Search Engines. arXiv:2304.09848. In International Conference on Learning Representations.
Maynez, J.; Narayan, S.; Bohnet, B.; and McDonald, R. Xu, G.; Liu, J.; Yan, M.; Xu, H.; Si, J.; Zhou, Z.; Yi, P.;
2020. On Faithfulness and Factuality in Abstractive Sum- Gao, X.; Sang, J.; Zhang, R.; Zhang, J.; Peng, C.; Huang, F.;
marization. In Proceedings of the 58th Annual Meeting of and Zhou, J. 2023a. CValues: Measuring the Values of Chi-
the Association for Computational Linguistics, 1906–1919. nese Large Language Models from Safety to Responsibility.
Online: Association for Computational Linguistics. arXiv:2307.09705.
OpenAI. 2022. Chatgpt: Optimizing language models for Xu, S.; Pang, L.; Shen, H.; Cheng, X.; and Chua, T.-
dialogue. https://openai.com/blog/chatgpt. S. 2023b. Search-in-the-Chain: Towards Accurate, Credi-
ble and Traceable Large Language Models for Knowledge-
Peng, B.; Galley, M.; He, P.; Cheng, H.; Xie, Y.; Hu, Y.;
intensive Tasks. arXiv:2304.14732.
Huang, Q.; Liden, L.; Yu, Z.; Chen, W.; and Gao, J. 2023.
Check Your Facts and Try Again: Improving Large Lan- Yunjie Ji, Y. G. Y. P. Q. N. B. M. X. L., Yong Deng. 2023.
guage Models with External Knowledge and Automated BELLE: Bloom-Enhanced Large Language model Engine.
Feedback. arXiv:2302.12813. https://github.com/LianjiaTech/BELLE.
Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Zhang, W.; Aljunied, S. M.; Gao, C.; Chia, Y. K.; and Bing,
Cong, X.; Tang, X.; Qian, B.; Zhao, S.; Tian, R.; Xie, R.; L. 2023. M3Exam: A Multilingual, Multimodal, Multilevel
Zhou, J.; Gerstein, M.; Li, D.; Liu, Z.; and Sun, M. 2023. Benchmark for Examining Large Language Models.
ToolLLM: Facilitating Large Language Models to Master Zhong, W.; Cui, R.; Guo, Y.; Liang, Y.; Lu, S.; Wang,
16000+ Real-world APIs. arXiv:2307.16789. Y.; Saied, A.; Chen, W.; and Duan, N. 2023. AGIEval:
QwenLM. 2023. Qwen-7B. https://github.com/QwenLM/ A Human-Centric Benchmark for Evaluating Foundation
Qwen-7B. Models. arXiv:2304.06364.
Raunak, V.; Menezes, A.; and Junczys-Dowmunt, M. 2021. Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang,
The Curious Case of Hallucinations in Neural Machine X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q. V.; and
Translation. In Proceedings of the 2021 Conference of the Chi, E. H. 2023a. Least-to-Most Prompting Enables Com-
North American Chapter of the Association for Computa- plex Reasoning in Large Language Models. In The Eleventh
tional Linguistics: Human Language Technologies, 1172– International Conference on Learning Representations.
1183. Online: Association for Computational Linguistics. Zhou, S.; Alon, U.; Xu, F. F.; Jiang, Z.; and Neubig, G.
Ren, R.; Wang, Y.; Qu, Y.; Zhao, W. X.; Liu, J.; Tian, H.; 2023b. DocPrompting: Generating Code by Retrieving the
Wu, H.; Wen, J.-R.; and Wang, H. 2023. Investigating the Docs. In The Eleventh International Conference on Learn-
Factual Knowledge Boundary of Large Language Models ing Representations.
with Retrieval Augmentation. arXiv:2307.11019.
Shen, X.; Chen, Z.; Backes, M.; and Zhang, Y. 2023. In
ChatGPT We Trust? Measuring and Characterizing the Re-
liability of ChatGPT. arXiv:2304.08979.
Shi, W.; Min, S.; Yasunaga, M.; Seo, M.; James, R.;
Lewis, M.; Zettlemoyer, L.; and tau Yih, W. 2023. RE-
PLUG: Retrieval-Augmented Black-Box Language Models.
arXiv:2301.12652.
THUDM. 2023a. ChatGLM-6B. https://github.com/
THUDM/ChatGLM-6B.
THUDM. 2023b. ChatGLM2-6B. https://github.com/
THUDM/ChatGLM2-6B.

You might also like