0% found this document useful (0 votes)

73 views16 pages

Rethinking Benchmark and Contamination For Language Models With Rephrased Samples

Uploaded by

timsmith1081574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views16 pages

Rethinking Benchmark and Contamination For Language Models With Rephrased Samples

Uploaded by

timsmith1081574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Rethinking Benchmark and Contamination for Language Models with

Rephrased Samples

Shuo Yang * 1 2 Wei-Lin Chiang * 1 Lianmin Zheng * 1 Joseph E. Gonzalez 1 Ion Stoica 1

Abstract 1. Introduction
arXiv:2311.04850v2 [cs.CL] 11 Nov 2023

The fast-growing capabilities of large language models

Large language models are increasingly trained make their evaluation more challenging than ever (Chang
on all the data ever produced by humans. Many et al., 2023). Although the community has established many
have raised concerns about the trustworthiness of benchmarks over a short period of time, the benchmark
public benchmarks due to potential contamination scores do not always reflect performance on real-world tasks.
in pre-training or fine-tuning datasets. While most There has been evidence that many prevalent benchmarks
data decontamination efforts apply string match- might have contaminated pre-training or fine-tuning datasets.
ing (e.g., n-gram overlap) to remove benchmark From the contamination analysis in Llama-2 (Touvron et al.,
data, we show that these methods are insufficient, 2023), over 10% of the MMLU test samples are highly
and simple variations of test data (e.g., paraphras- contaminated. Another example from GPT-4’s technical
ing, translation) can easily bypass these decon- report (OpenAI, 2023) shows that 25% of HumanEval has
tamination measures. Furthermore, we demon- been contaminated in their training data. Similar situation
strate that if such variation of test data is not also applies to open-source datasets. A popular code pre-
eliminated, a 13B model can easily overfit a test training set, StarCoder Data (Li et al., 2023), shows that
benchmark and achieve drastically high perfor- hundreds of test cases in the Stack (Kocetkov et al., 2022)
mance, on par with GPT-4. We validate such are contaminated with benchmarks.
observations in widely used benchmarks such as Despite being recognized as a crucial issue, accurately de-
MMLU, GSK8k, and HumanEval. To address this tecting contamination remains an open and challenging prob-
growing risk, we propose a stronger LLM-based lem. The most commonly used approaches are n-gram over-
decontamination method and apply it to popular lap and embedding similarity search. N-gram overlap relies
pre-training and fine-tuning datasets, revealing on string matching to detect contamination, widely used
significant previously unknown test overlap. For by leading developments such as GPT-4 (OpenAI, 2023),
example, in pre-training sets such as RedPajama- PaLM (Anil et al., 2023), and Llama (Touvron et al., 2023).
Data-1T and StarCoder-Data, we identified that However, it suffers from limited accuracy. Embedding sim-
8-18% of the HumanEval benchmark overlaps. ilarity search uses the embeddings of pre-trained models
Interestingly, we also find such contamination in (e.g., BERT) to find similar and potentially contaminated
synthetic dataset generated by GPT-3.5/4, sug- examples. However, choosing an appropriate similarity
gesting a potential risk of unintentional contami- threshold to strike a balance between recall and precision
nation. We urge the community to adopt stronger is often challenging. Moreover, there has been a growing
decontamination approaches when using public interest in training models using synthetic data produced
benchmarks. Moreover, we call for the commu- by LLMs (e.g., GPT-4) (Gunasekar et al., 2023; Taori et al.,
nity to actively develop fresh one-time exams to 2023; Wang et al., 2023b; Xu et al., 2023; Mukherjee et al.,
evaluate models accurately. Our decontamination 2023), in which contamination may be even harder to detect
tool is publicly available at https://github. by string matching. In Phi-1 report (Gunasekar et al., 2023),
com/lm-sys/llm-decontaminator. they discover a significant portion of the synthetic data sim-
ilar to some test samples in HumanEval that is undetectable
by n-gram overlap.

*
To study decontamination methods, in Section 3 we propose
Equal contribution 1 UC Berkeley 2 Shanghai Jiao Tong Univer- the concept of a “rephrased sample” which has the same
sity. Correspondence to: Shuo Yang <andy [email protected]>,
Ion Stoica <[email protected]>. semantics as the original sample but is hard to detect by ex-
isting contamination tests. Rephrased samples are generated

1
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Figure 1: A failure case of existing contamination detection methods (n-gram overlap, embedding similarity) on MMLU
benchmark. We place a question mark since the embedding similarity approach struggles to distinguish the rephrased
question from other questions in the same subject (high school US history). After rephrasing MMLU test cases, a Llama-2-
13B trained on a rephrased test set can reach 85.9 accuracy on MMLU while being undetectable by n-gram overlap.

by using LLMs to paraphrase or translate test samples into

another language. We show that if such rephrased samples
are used for training, the resulting model can easily overfit
and reach drastically high performance in test benchmarks.
Figure 1 demonstrates this concept with a test example from
the MMLU benchmark. We observe such phenomenon in
popular benchmarks such as MMLU, GSM-8k, and Hu-
manEval, where a finetuned 13B Llama model can match
GPT-4’s performance in all benchmarks while being unde-
tected by n-gram overlap as contamination, as shown in
Figure 2. Therefore, being able to detect such rephrased
samples becomes critical. We provide an in-depth analysis
on why existing decontamination methods fail and propose
a new LLM-based decontamination method in Section 4.
Our method first uses embedding similarity search to get
the top-k samples with the highest similarity with a given
test sample and then prompts a strong LLM such as GPT-4
to examine whether any of the top-k samples is too close to
the test case. Results show that our proposed LLM decon- Figure 2: After fine-tuned on rephrased samples, Llama 2
taminator works significantly better than existing methods. and CodeLlama achieve performance on par with GPT-4.
In Section 5.3, we apply our decontaminator to several
widely used pre-training and fine-tuning datasets. We suc- using public benchmarks. To address these concerns at their
cessfully reveal previously unknown test overlap with public core, we advocate for the development of fresh, one-time
benchmarks. Shown in Figure 3, in pre-training sets such as exams, similar to Codeforces and Kaggle competitions, for
RedPajama-Data-1T and StarCoder-Data, we identify that the accurate assessment of LLMs.
8-18% of the HumanEval benchmark are overlapped. We
also find a synthetic dataset generated by GPT-3.5, CodeAl- 2. Background
paca (Chaudhary, 2023), has a significant portion (12.8%)
of rephrased samples from HumanEval. This suggests a Contamination occurs when test set information is leaked in
potential contamination risk when training with synthetic the training set, resulting in an overly optimistic estimate of
data generated by LLMs. We urge the community to adopt the model’s score (accuracy, AUC, etc.). In this section, we
more robust decontamination methods for evaluating LLMs introduce common contamination detection methods, which
include n-gram overlap, embedding similarity search, de-

2
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

The Stack(4G) 18.9% to provide the most relevant training examples, where hu-
StarCoder-Data(2.4G) 15.9% mans can then judge whether these training examples meet
CodeAlpaca 12.8%
the contamination criteria. However, this approach is im-
practical because it induces a high computational overhead.
RedPajama-Data(16G) 8.53%

Figure 3: The contamination percentage of HumanEval 3. Rephrased Samples

benchmark in each training dataset. Note that The Stack
(4G), StarCoder-Data (2.4G), and RedPajama-Data (16G) Our goal is to investigate whether simple variations of test
are subsets. sets included in the training set could affect the resulting
benchmark performance. We refer to such variations of test
coding matching, and influence function. Table 1 compares cases as “rephrased samples”.
the computational costs of these methods, and whether they We consider various domains of benchmarks including math,
require access to training data or the model. knowledge, and coding in our experiments. Example 1 is a
N-gram overlap. The most common and widely used de- rephrased sample from GSM-8k that the 10-gram overlap
contamination method is n-gram overlap. The GPT-3 pa- fails to detect, while keeping the same semantic.
per (Brown et al., 2020) defines a 13-gram overlap as con-
tamination, and the GPT-4 report (OpenAI, 2023) defines a E XAMPLE 1 (GSM-8K R EPHRASED S AMPLE )
50-character overlap as contamination. N-gram overlap de- Original Test Case
tection is favored for its simplicity and speed but it can result Janet’s ducks lay 16 eggs per day. She eats three for
in a higher false negative rate if there’s a small difference. breakfast every morning and bakes muffins for her
Embedding similarity search. Embedding similarity friends every day with four. She sells the remainder
search uses transformer-generated embeddings to capture at the farmers’ market daily for $2 per fresh duck
prompts’ semantics. Popular approaches use models such egg. How much in dollars does she make every day
as sentence BERT (Reimers & Gurevych, 2019) to generate at the farmers’ market?
embeddings and employ cosine similarity to measure the
relevance of prompts. High similarity between training and Rephrased Test Case
test prompts suggests potential contamination (Lee et al., Janet’s ducks produce 16 eggs each day. She con-
2023). Although it can capture more semantic information sumes three of them for her morning meal and uses
than the n-gram approach, it requires specifying a threshold. four to bake muffins for her friends daily. The re-
If the threshold is set too high, it will result in a high false maining eggs are sold at the daily farmers’ market
negative rate; otherwise, setting it too low will lead to a high for $2 per egg. What is the daily amount in dollars
false positive rate. that she earns at the farmers’ market?
Decoding matching. Both n-gram overlap and embedding
similarity search require access to training data. In cases
3.1. Rephrasing Techniques
where training data is not available but the model is available,
decoding matching can be used as an alternative method There are some subtle differences in rephrasing techniques
to detect contamination. The intuition is that if the model because benchmark contamination takes on different forms.
is trained on contaminated training data, it is more likely For text-based benchmarks, we rephrase test cases without
to auto-complete a partial test prompt. (Li, 2023) However, altering semantics, such as by rearranging word order or
an auto-completed test prompt does not necessarily indi- substituting with synonymous terms. For code-based bench-
cate that the model has been trained on contaminated data, marks, we vary coding styles, naming conventions, and
and a model trained on test cases with variation will not implementations, but their semantics remain unchanged.
auto-complete the test prompt either. Therefore, decoding
Regarding the rephrasing process, we present a simple algo-
matching is often not acknowledged as definitive evidence
rithm for a given test set in Algorithm 1. This method helps
of contamination.
a test sample to escape from being detected. It first employs
Influence function. When both the model and the training a high-quality large language model (e.g., GPT-4) to pro-
data are available, the influence function (Koh & Liang, duce a rephrased version of the test prompt. Then, it utilizes
2020) can be used to identify contaminated samples. This detection like n-gram overlap to ensure the rephrased sam-
method takes a test sample and iteratively calculates an in- ple can’t be detected. To encourage diverse outputs, we set
fluence factor for each training sample. This influence factor a non-zero initial temperature. By applying this process to
quantitatively measures how relevant each training sample each prompt in the test set, we build a rephrased test set.
is to the current test sample. It then sorts the influence factor “RephraseLLM” denotes the high-quality LLM, like GPT-4

3
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Table 1: Contamination detection methods. M denotes the size of the training set, and N indicates the size of the test set.

Method require access to training data require access to model computational cost
N-gram overlap yes no O(M N )
Embedding similarity search yes no O(M N + M + N )
Decoding matching no yes O(N )
Influence function yes yes O(M 2 + M N )

or Claude. “isContaminated” can refer to any contamination from Python to C or Java solving the same problem. To
detection method, such as n-gram overlap or embedding further investigate the impact of translation techniques on
similarity search. coding benchmarks, we propose the multi-language data
augmentation.
Algorithm 1 The algorithm for rephrasing samples
Multi-languages data augmentation. For coding bench-
Ensure: Rephrase(T estSet, M axRetry) marks, we use multi-language data augmentation to enhance
1: RephrasedSet ← ∅ the translation technique. By incorporating multiple lan-
2: for t in T estSet do guages, we enhance the model’s generalization ability and
3: s ← RephraseLLM(t) ensure its comprehension that translated and original code
4: retry ← 0 serve the same function. In section 5.1, our experiments
5: while isContaminated(s, t) do indicate that multilingual data augmentation yields better
6: s ← RephraseLLM(t) results than single-language translation.
7: retry ← retry + 1
8: if retry>M axRetry then
9: s ← null
4. LLM Decontaminator
10: break In this section, we propose a new contamination detection
11: end if method that accurately removes rephrased samples from a
12: end while dataset relative to a benchmark.
13: RephrasedSet ← RephrasedSet ∪ {s}
14: end for 4.1. Algorithm
15: return RephrasedSet
In Section 2, we discuss the limitations of existing detection
methods including n-gram overlap and embedding similarity
3.2. Translation Techniques search. To address the limitations, we introduce the “LLM
decontaminator” in Algorithm 2. This method involves two
There are other kinds of rephrased samples beyond modifi- steps: First, for each test case, it identifies the top-k training
cations in word order. In real-world datasets, there are many items with the highest similarity using the embedding sim-
rephrasing techniques including the translation technique. ilarity search. Each pair is evaluated whether they are the
By employing these techniques, rephrased samples become same by an advanced LLM, such as GPT-4. This approach
more concealed and still can help models achieve dramatic helps to determine how many rephrased samples there are
score improvements. in a dataset with a moderate computational overhead. “Tem-
Prompts with identical meanings from different languages plate” is a structured prompt that, when paired with a test
yield varied embeddings in most language models. By trans- case and training case, instructs the “LLMDetector” to carry
lating test prompts into other languages, we can evade n- out a comparison and return either ‘True’ or ‘False’. In
gram overlap detection and standard embedding similarity this context, ‘True’ indicates that the training case might be
searches. Only embedding models specifically trained in a rephrased sample of the test case. “LLMDetector” is a
multiple languages can detect a translated sample. high-quality LLM like GPT-4. “TopKSimilarity” identifies
the top k most similar samples in the training data using
For text-based data, the translation technique enables eva- embedding similarity search.
sion of both n-gram overlap and embedding similarity
search, while significantly boosting the score. This method 4.2. Contamination Detection Visualization
capitalizes on the model’s multilingual translation capabili-
ties, effectively transforming a knowledge assessment into In Figure 4 we present a Venn diagram of contamination
a translation task. For coding benchmarks, the translation and different detection methods. The LLM decontamina-
technique also works well. We can translate a program tor takes advantage of embedding similarity search, which

4
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Algorithm 2 The algorithm for LLM decontaminator arrangements. Example 2 is a false positive example from
Ensure: Decontaminate(T rainSet, T estSet, k, T emplate) n-gram overlap detection. Even though their multi-choice
answer patterns match exactly, they are indeed different
1: Contamination ← ∅ problems. To reduce false positive issues, we introduce
2: for t in T estSet do a “question only” control group in MMLU experiments.
3: for c in TopKSimilarity(T rainSet, t, k) do “Question Only” refers to rephrasing just the question stem,
4: s ← LLMDetector(T emplate, t, c) while “Full Prompt” refers to rephrasing both the question
5: if s = T rue then stem and the options.
6: Contamination ← Contamination ∪
{(t, c)} E XAMPLE 2 (M ULTI -C HOICE FALSE P OSITIVE )
7: end if
8: end for • Statement 1— Every group of order p2 where
9: end for p is prime is Abelian.
10: return Contamination Statement 2 — For a fixed prime p a Sylow p-
subgroup of a group G is a normal subgroup of
G if and only if it is the only Sylow p-subgroup
helps it rapidly filter out possible possible contamination. of G.
In addition, it utilizes the strong LLMs’ reliable judgments. A. True, True
We show that n-gram overlap detection can result in a higher B. False, False
false negative rate when detecting rephrased samples, and C. True, False
embedding similarity search detects many false positives D. False, True
with a high threshold. Notably, the LLM decontamina-
tor showcases higher accuracy while detecting rephrased • Statement 1 — Every group of order 42 has a
samples. See Section 5.1 for comprehensive experimental normal subgroup of order 7.
results. Statement 2 — Every group of order 42 has a
normal subgroup of order 8.
5. Experiments A. True, True
B. False, False
In Section 5.1, we demonstrate that models trained on C. True, False
rephrased samples can achieve dramatically high scores, D. False, True
achieving GPT-4 performance in three widely used bench-
marks, MMLU, HumanEval, and GSM-8k. This suggests
that rephrased samples should be considered as contami- Other details. Large numbers often induce character over-
nation and should be removed from training data. In Sec- lap. To avoid this, we change the format of large numbers,
tion 5.2, we evaluate different contamination detection meth- such as alternating between commas and spaces. Proprietary
ods based on rephrased samples in MMLU/HumanEval. In terms in various domains can also trigger overlap issues. To
Section 5.3, we apply our decontaminator to widely-used circumvent this, we rotate between abbreviations and full
training sets and discover previously unknown contamina- terms and adjust capitalization, particularly when choices
tion. involve names or chemical formulas.
Benchmark results. On the rephrased test sets, we train the
5.1. Rephrased Samples Contaminate Benchmarks Llama-2-7b and Llama-2-13b, with 16 epochs. As shown in
Table 2, Llama-2 7B and 13B trained on rephrased samples
5.1.1. MMLU K NOWLEDGE B ENCHMARK
can achieve dramatically high scores on MMLU, from 45.3
MMLU (Hendrycks et al., 2020) is one of the benchmarks to 88.5. This suggests rephrased samples can significantly
with the widest range of subjects, covering 57 disciplines skew the benchmark numbers and should be considered as
from abstract algebra to professional psychology. Rephras- contamination. The original model is tested on 5-shot, and
ing MMLU requires considering a multitude of scenarios. the model trained on rephrased data is tested on 0-shot.
Given the complexity of MMLU and its multiple-choice
format, it is necessary to explain the rephrasing details in- 5.1.2. H UMAN E VAL C ODING B ENCHMARK
volved.
HumanEval (Chen et al., 2021) is a benchmark provided by
False positive issue. The use of n-gram overlap detection in OpenAI to evaluate the coding capabilities of large language
multiple-choice questions can easily result in false positives, models. It provides the model with an incomplete piece of
especially when different questions share similar option code and asks the model to complete it.

5
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Figure 4: Venn graph depicting training data subsets and contamination detection ranges. The solid circle represents the
training data and its subsets. The dashed circles enclose areas flagged for potential contamination by detection methods within
the dataset. Notably, the LLM decontaminator showcases higher accuracy. Embedding similarity search detects broadly but
with many false positives. N-gram overlap has a limited ability to spot rephrased samples. The LLM decontaminator refines
the results from embedding similarity search using LLMs, providing a precise and efficient contamination assessment.

Table 2: Accuracy on MMLU. “Rephrased Chinese” refers Table 3: Pass@1 on HumanEval.

to translating the questions into Chinese.
Model Original Fine-tune on test set
Model Original Rephrased English CodeLlama 7B 32.9 100
Question Only Full Prompt CodeLlama 13B 36.0 100
Llama 2 7B 45.3 88.5 82.0 Model Fine-tune on rephrased
Llama 2 13B 54.8 89.9 85.9 Python C Multi-languages
Model Test Set Rephrased Chinese CodeLlama 7B 67.7 45.7 59.8
Question Only Full Prompt CodeLlama 13B 81.1 48.2 67.1
Llama 2 7B 100 91.1 74.3
Llama 2 13B 100 93.7 80.9 Table 4: Accuracy on GSM-8K.

Fine-tune on Fine-tune on
Dead code injection. In real-world coding datasets, there Model Original
test set rephrased English
are some unreachable instructions. These dead codes sel-
Llama 2 7B 14.6 100 86.7
dom affect the semantics, and they help rephrased samples Llama 2 13B 28.7 100 95.3
to escape decontamination. Given that current detection
methods do not use compilers to remove dead code from
coding datasets, we investigate how dead codes interfere achieve 67.0 on HumanEval.
with detection methods.
5.1.3. GSM-8K M ATH B ENCHMARK
Benchmark results. We rephrase the HumanEval test set
in Python and translate it into five programming languages: GSM-8K (Cobbe et al., 2021) is a commonly used bench-
C, JavaScript, Rust, Go, and Java. We train CodeLlama 7B mark for testing the mathematical capabilities of LLMs.
and 13B on these codes respectively. Then, we construct
Benchmark results. Table 4 shows that Llama-2 7b and
a multi-programming-language dataset comprising the five
13b trained on rephrased samples achieve dramatically high
programming languages and train on it. Table 3 shows
scores on GSM-8K, from 28.7 to 95.3. The original model
CodeLlama’s performance on rephrased Python, rephrased
is tested on 5-shot, and the model trained on rephrased data
C, and the multi-programming-language dataset. CodeL-
is tested on 0-shot.
lama 7B and 13B trained on rephrased samples can achieve
dramatically high scores on HumanEval, from 32.9 to 67.7 We will explore the detection problems in Section 6 with
and 36.0 to 81.1, respectively. In contrast, GPT-4 can only GSM-8k as they relate to the “number substituted only

6
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

case”. 5.3. Contamination in Real World Datasets

To demonstrate the effectiveness LLM decontaminator, we
5.2. Evaluating Contamination Detection Methods
apply it to widely used real-world datasets and identify a
5.2.1. MMLU substantial amount of rephrased samples. Table 7 displays
the contamination percentage of different benchmarks in
We construct a decontamination benchmark based on three each training dataset.
subjects: abstract algebra, sociology, and US history in
MMLU. To compare the accuracy of detection methods CodeAlpaca (Chaudhary, 2023) is a synthetic dataset gen-
against rephrased samples, we construct 200 prompt pairs erated by OpenAI’s Davinci-003 using the self-instruct tech-
using both the original and rephrased test sets. These com- nique (Wang et al., 2023b). It contains 20K instruction-
prised 100 random pairs and 100 rephrased pairs. The f1 following data used for fine-tuning the CodeAlpaca model.
score on these pairs provides insight into the detection meth- CodeAlpaca-20K is used to train a number of well-known
ods’ ability to detect contamination, with higher values models, including Tulu (Wang et al., 2023a). Employing
indicating more precise detection. GPT-4 for detection with k=1 as the parameter, our find-
ings indicate the presence of 21 rephrased samples from the
We use random detection as our baseline, where scores sig- HumanEval test set, accounting for 12.8%. Example 3 is a
nificantly above random detection indicate the effectiveness rephrased sample of HumanEval in CodeAlpaca.
of a detection method. For n-gram overlap, we choose a 10-
gram approach. The embeddings are generated by multi-qa-
E XAMPLE 3 (C ODE A LPACA )
MiniLM-L6-cos-v1 and distiluse-base-multilingual-cased-
v1 (Reimers & Gurevych, 2019), with a threshold of 0.5.
HumanEval test
As shown in Table 5, except for the LLM decontaminator,
all other detection methods introduce some false positives. def sum_to_n(n: int):
Both rephrased and translated samples are undetected by """sum_to_n
is a function that
the n-gram overlap. With multi-qa BERT, the embedding sums numbers from 1 to n.
similarity search proves completely ineffective against trans- >>> sum_to_n(30)
465
lated samples. When using multilingual BERT, this method >>> sum_to_n(100)
struggles with the US History subject. The LLM decontami- 5050
>>> sum_to_n(5)
nator’s reliability and precision are evidenced by the highest 15
minimum and average f1 scores. >>> sum_to_n(10)
55
>>> sum_to_n(1)
5.2.2. H UMAN E VAL 1
"""
return sum(range(n + 1))
Now we show that existing detection methods fail to detect
rephrased samples of HumanEval, while the LLM decon-
taminator succeeds in detecting them. For HumanEval, we CodeAlpaca
construct 200 prompt pairs following the method previously
outlined for MMLU. For n-gram overlap detection, we use """
both 10-gram and 50-character overlap. Embeddings are Create a code that
summation
generated by CodeLlama and multi-qa-MiniLM-L6-cos-v1, of all numbers
with respective threshold adjustments at 0.9 and 0.6. We between 1 to n.
"""
evaluate the F1 score using n-gram overlap, embedding def sum_all_nums(n):
similarity search, and LLM decontaminator. res = 0
for i in range(1, n+1):
res += i
According to Table 6, we conclude that the embedding return res
similarity search proves effective for detection within the
print(sum_all_nums(n)) # 15
same programming language, but the effect is less noticeable
after translation. Among the methods examined, only the
LLM decontaminator reliably detects rephrased samples
in coding datasets. The similarity between programming RedPajama-Data-1T (Computer, 2023) is a widely-used
languages may explain why rephrased C is tougher to spot dataset to train open-source models. Both MPT (Team,
than rephrased JavaScript. JavaScript and Python are both 2023) and OpenLlama (Geng & Liu, 2023) use it as their pre-
interpreted languages that provide dynamic typing and some training dataset. In our study, we sample 16G of data from
functional programming constructs, so from a syntactical the GitHub subset and employ the LLM decontaminator
standpoint, JavaScript may be closer to Python. to detect, identifying 14 HumanEval rephrased samples in

7
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Figure 5: Distribution of embedding similarities among questions within the same subject. Note that it is difficult to set a
unified threshold to decontaminate due to the vast differences between subjects. For example, if we adjust the threshold to
0.8, “Abstract Algebra” may be properly spotted, but rephrased samples in “Sociology” become difficult to identify. If the
threshold is set to 0.4, “Abstract Algebra” will produce a large number of false positives.

Table 5: F1 scores of different detection methods on MMLU. The bold numbers indicate that the detection is reliable.

Algebra Sociology US History

Subjects Test Set Rephrased Rephrased Test Set Rephrased Rephrased Test Set Rephrased Rephrased
English Chinese English Chinese English Chinese
Random 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500
10-gram 0.926 0 0 1 0 0 0.816 0 0
Emb (Multi-QA BERT) 0.990 0.985 0.179 0.995 0.985 0.020 0.980 0.805 0
Emb (Multilingual BERT) 0.939 0.934 0.939 1 0.985 1 0.990 0.111 0.985
LLM Decontaminator 1 0.960 0.990 1 0.940 0.950 1 0.970 0.980

E XAMPLE 4 (R ED PAJAMA )
Table 6: F1 scores of detection methods on HumanEval

Rephrased HumanEval test

Test
Set Python C JS def change_base(x: int, base: int):
"""Change numerical base of input
Random 0.500 0.500 0.500 0.500 number x to base. return string
10-gram 1 0 0 0 representation after conversion.
Emb (CodeLlama) 0.966 0.903 0.438 0.503 base numbers are less than 10.
Emb (Multi-QA BERT) 0.985 0.938 0.774 0.788 >>> change_base(8, 3)
'22'
LLM Decontaminator 1 0.995 0.974 0.980 ...
"""
total. Example 4 is a rephrased sample of HumanEval in ret = ""
while x > 0:
RedPajama. ret = str(x % base) + ret
x //= base
MATH (Hendrycks et al., 2021) is a widely recognized return ret
math training dataset that spans various mathematical do-
mains, including algebra, geometry, and number theory.
RedPajama
It contributes to numerous math-centric datasets, such as
MathInstruct1 (Yue et al., 2023). The LLM decontamina- def convert_to_base(number, base):
tor reveals 79 instances of self-rephrased samples, which digits = "0123456789ABCDEF"
constitute 1.58% of the MATH test set. Below is a self- if number < base:
return digits[number]
rephrased sample from the MATH training set. Example 5 else:
is a rephrased sample of the MATH test in MATH training return convert_to_base(
number // base, base)
data. + digits[number % base]

1
The dataset was downloaded on Sep 30, 2023.

8
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Table 7: The Percentage of # Rephrased Sample Contamination in Real-world Datasets.

Size Rephrased
Training Set Benchmark Percentage (%)
Train Set Test Set Samples
The Stack (4G subset) HumanEval 500k 164 31 18.9
StarCoder-Data (2.4G subset) HumanEval 500k 164 26 15.9
CodeExercise-Python HumanEval 27k 164 26 15.9
CodeAlpaca HumanEval 20k 164 21 12.8
RedPajama-Data-1T (16G subset) HumanEval 1625k 164 14 8.5
Evol-Instruct-Code HumanEval 78.3k 164 13 7.9
rossetacode HumanEval 4.26k 164 4 2.4
MATHInstruct MATH Test 262k 5000 769 15.4
MATH Train MATH Test 7.5k 5000 79 1.6
FLAN CoT MMLU 184k 14042 76 0.5
WizardLM-Evol-Instruct MMLU 143k 14042 75 0.5

E XAMPLE 5 (MATH S ELF - CONTAMINATION ) 6. Discussion

(MATH test) In this section, we first discuss potential contamination be-
How many three-digit positive integers are multiples yond rephrased samples. We then discuss the importance of
of 11? the LLM decontaminator while using LLM such as GPT-4
(MATH train) to generate training data. In the end, we propose sugges-
How many positive 3-digit numbers are divisible by tions to enhance LLM evaluation (e.g. with fresh one-time
11? exams).

FLAN (Longpre et al., 2023) is a comprehensive knowl- 6.1. Beyond rephrased samples
edge training dataset, encompassing a wide variety of data In this study, we argue that rephrased test samples should
sources. We take the CoT subset, which constitutes 1.63% be considered as contamination because including them in
of FLAN. Utilizing GPT-4 for detection and set k=1 for the the training data can skew the benchmark results. However,
decontamination parameters. The findings show that 76 test formulating a precise definition of what constitutes contam-
cases, or 0.543% of the MMLU test set are rephrased. ination remains challenging. For instance, we discover in
the GSM-8k math benchmark, a training and a test example
E XAMPLE 6 (FLAN C OT) may only differ in numbers (see Example 7).
(MMLU test) E XAMPLE 7 (GSM-8 K N UMBER S UBSTITUTED
What type of meat is on a traditional Reuben sand- O NLY C ASE )
wich?
A. turkey (GSM-8k test)
B. bologna Emil is 19 years old now. When he turns 24, he will
C. corned beef be half the age of his dad but twice as old as his
D. pepperoni brother. What is the sum of the ages of his dad and
Answer: C his brother now?
(FLAN CoT) (GSM-8k)
The Reuben sandwich is an American hot sandwich When Diane turns 30, she will be half the age of
composed of corned beef, Swiss cheese, sauerkraut, Alex and twice as old as Allison. Diane is 16 years
and Russian dressing, grilled between slices of rye old now. What is the sum of the ages of Alex and
bread. Several variants exist. Allison now?
What is the meat in a reuben sandwich? Let’s have
some stream of consciousness first. If models are trained with such number substituted cases,
they tend to only memorize the solutions and may have poor
We examine more datasets and present examples in Ap- generalization beyond the seen patterns. Thus, the result-
pendix B. ing benchmark numbers may not be effective in capturing

9
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

model’s performance in math problem-solving. This is an LLMs from memorizing answer patterns. This method in-
open question we suggest the community to debate further. volves making modifications to the question and requires
the LLM to output results in a specific format. Another
6.2. Contamination in Synthetic Data approach is to employ dynamic benchmarks (Kiela et al.,
2021; Ma et al., 2021), using human-in-the-loop evaluations
The issue of unintentional contamination may occur more to reduce the risk of benchmark contamination.
often as models are increasingly trained on data generated
by LLMs, in which subtle benchmark contamination may
present. For instance, we discover several contamination 8. Conclusion
in CodeAlpaca dataset generated by GPT in Section 5.3. In this work, we study benchmark contamination in the
Phi-1 (Gunasekar et al., 2023) also detected subtle contami- context of large language models and evaluate existing de-
nation from LLM-generated data. As a result, we have to contamination methods. We show that existing detection
be more aware of potential contamination while training methods can not detect test cases with simple variations. We
models on synthetic data. We suggest model developers to demonstrate that if such variation of test data is not elim-
adopt stronger measures for decontamination. inated, a 13B model can easily overfit the test benchmark
and achieve drastically high performance. To address this,
6.3. Enhancing Benchmarks for LLMs we propose a new detection method LLM decontaminator.
While our proposed decontamination method can serve as a We apply it to real-world datasets and reveal previously
useful tool, how to detect contamination without access to unknown test overlap. We urge the community to adopt
training data remains an open problem. We propose to build stronger decontamination approaches when using public
fresh one-time questions to evaluate LLMs instead of relying benchmarks. We call for the community to actively develop
on static benchmarks. For example, in coding domain, one fresh one-time exams to accurately evaluate LLMs.
could consider using weekly coding competitions such as
CodeForces. We suggest that benchmarks should iterate as Acknowledgement
fast as model development.
We would like to express our gratitude to Ying Sheng for
the early discussion on rephrased samples. We also ex-
7. Related Work tend our thanks to Dacheng Li, Erran Li, Hao Liu, Jacob
There has been interests in studying how to identify or ex- Steinhardt, Hao Zhang, and Siyuan Zhuang for providing
tract training data from LLMs. These work examine LLMs’ insightful feedback. This project is partly supported by gifts
memorization from the perspective of data privacy (Carlini from Anyscale, Astronomer, Google, IBM, Intel, Lacework,
et al., 2021; Pan et al., 2020; Zanella-Béguelin et al., 2020; Microsoft, MBZUAI, Samsung SDS, Uber, and VMware.
Balle et al., 2022) or discuss the boundary between general- Lianmin Zheng is supported by a Meta Ph.D. Fellowship.
ization and memorization (Zhang et al., 2017; Olson et al.,
2018; Recht et al., 2019; Carlini et al., 2023), but they do References
not focus on the context of benchmark contamination.
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D.,
Some studies on contamination detection methods are con- Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z.,
ducted as well. Some are concerned with detecting and Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier-
filtering web datasets (Dodge et al., 2021; Xu & Koehn, Hellstern, K., Mishra, G., Moreira, E., Omernick, M.,
2017), employing traditional detection techniques such as Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang,
n-gram overlap. Others explore new detection methods sim- Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha,
ilar to decoding matching without access to training data. J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M.,
Exchange detection (Oren et al., 2023) considers the order Cheng, Y., Cherry, C., Choquette-Choo, C. A., Chowd-
of test cases within a benchmark, suggesting that if a model hery, A., Crepy, C., Dave, S., Dehghani, M., Dev, S.,
remembers the sequence of test cases, it may be contam- Devlin, J., Dı́az, M., Du, N., Dyer, E., Feinberg, V., Feng,
inated. The min-k prob detection (Shi et al., 2023) uses F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S.,
outlier tokens to estimate LLM contamination. This method Gonzalez, L., Gur-Ari, G., Hand, S., Hashemi, H., Hou,
analyzes the token probabilities within an arbitrary text X. L., Howland, J., Hu, A., Hui, J., Hurwitz, J., Isard, M., It-
If the LLM exhibits excessively high probabilities for some tycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun,
of these tokens, it may indicate that text X has been mixed M., Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li,
into the training set. M., Li, W., Li, Y., Li, J., Lim, H., Lin, H., Liu, Z., Liu,
F., Maggioni, M., Mahendru, A., Maynez, J., Misra, V.,
There are also related works on benchmark enhancement
Moussalem, M., Nado, Z., Nham, J., Ni, E., Nystrom, A.,
through perturbations (Zong et al., 2023), which prevents

10
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Parrish, A., Pellat, M., Polacek, M., Polozov, A., Pope, Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco,
R., Qiao, S., Reif, E., Richter, B., Riley, P., Ros, A. C., G., Groeneveld, D., Mitchell, M., and Gardner, M.
Roy, A., Saeta, B., Samuel, R., Shelby, R., Slone, A., Documenting large webtext corpora: A case study
Smilkov, D., So, D. R., Sohn, D., Tokumine, S., Valter, on the colossal clean crawled corpus. arXiv preprint
D., Vasudevan, V., Vodrahalli, K., Wang, X., Wang, P., arXiv:2104.08758, 2021.
Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y.,
Xue, L., Yin, P., Yu, J., Zhang, Q., Zheng, S., Zheng, Geng, X. and Liu, H. Openllama: An open reproduction
C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y. Palm 2 of llama, May 2023. URL https://github.com/
technical report, 2023. openlm-research/open_llama.

Balle, B., Cherubin, G., and Hayes, J. Reconstructing train- Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T.,
ing data with informed adversaries. In 2022 IEEE Sympo- Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P.,
sium on Security and Privacy (SP), pp. 1138–1156. IEEE, de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl,
2022. H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee,
Y. T., and Li, Y. Textbooks are all you need, 2023.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika,
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
M., Song, D., and Steinhardt, J. Measuring mas-
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
sive multitask language understanding. arXiv preprint
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,
arXiv:2009.03300, 2020.
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Radford, A., Sutskever, I., and Amodei, D. Language S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
models are few-shot learners, 2020. ematical problem solving with the math dataset. NeurIPS,
2021.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-
Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu,
Erlingsson, U., Oprea, A., and Raffel, C. Extracting Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma,
training data from large language models, 2021. Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia,
R., Bansal, M., Potts, C., and Williams, A. Dynabench:
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Rethinking benchmarking in nlp, 2021.
and Zhang, C. Quantifying memorization across neural
language models, 2023. Kocetkov, D., Li, R., Allal, L. B., Li, J., Mou, C., Ferrandis,
C. M., Jernite, Y., Mitchell, M., Hughes, S., Wolf, T.,
Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Bahdanau, D., von Werra, L., and de Vries, H. The stack:
Yang, L., Yi, X., Wang, C., Wang, Y., et al. A survey 3 tb of permissively licensed source code, 2022.
on evaluation of large language models. arXiv preprint
arXiv:2307.03109, 2023. Koh, P. W. and Liang, P. Understanding black-box predic-
tions via influence functions, 2020.
Chaudhary, S. Code alpaca: An instruction-following llama
model for code generation. https://github.com/ Lee, A. N., Hunter, C. J., and Ruiz, N. Platypus: Quick,
sahil280114/codealpaca, 2023. cheap, and powerful refinement of llms, 2023.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D.,
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q.,
G., et al. Evaluating large language models trained on Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O.,
code. arXiv preprint arXiv:2107.03374, 2021. Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko,
O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Uma-
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., pathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, Z., Murthy, R., Stillerman, J., Patel, S. S., Abulkhanov,
R., et al. Training verifiers to solve math word problems. D., Zocca, M., Dey, M., Zhang, Z., Fahmy, N., Bhat-
arXiv preprint arXiv:2110.14168, 2021. tacharyya, U., Yu, W., Singh, S., Luccioni, S., Villegas,
P., Kunakov, M., Zhdanov, F., Romero, M., Lee, T., Timor,
Computer, T. Redpajama: An open source recipe to N., Ding, J., Schlesinger, C., Schoelkopf, H., Ebert, J.,
reproduce llama training dataset, April 2023. URL Dao, T., Mishra, M., Gu, A., Robinson, J., Anderson,
https://github.com/togethercomputer/ C. J., Dolan-Gavitt, B., Contractor, D., Reddy, S., Fried,
RedPajama-Data. D., Bahdanau, D., Jernite, Y., Ferrandis, C. M., Hughes,

11
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

S., Wolf, T., Guha, A., von Werra, L., and de Vries, H. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Starcoder: may the source be with you!, 2023. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen,
Li, Y. Estimating contamination via perplexity: Quantifying
M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W.,
memorisation in language model evaluation, 2023.
Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn,
Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez,
Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., and Roberts, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S.,
A. The flan collection: Designing data and methods for Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y.,
effective instruction tuning, 2023. Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog,
I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi,
Ma, Z., Ethayarajh, K., Thrush, T., Jain, S., Wu, L., Jia, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R.,
R., Potts, C., Williams, A., and Kiela, D. Dynaboard: Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X.,
An evaluation-as-a-service platform for holistic next- Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur,
generation benchmarking. Advances in Neural Infor- M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S.,
mation Processing Systems, 34:10351–10367, 2021. and Scialom, T. Llama 2: Open foundation and fine-tuned
Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, chat models, 2023.
H., and Awadallah, A. Orca: Progressive learning from Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu,
complex explanation traces of gpt-4, 2023. K. R., Wadden, D., MacMillan, K., Smith, N. A., Beltagy,
Olson, M., Wyner, A., and Berk, R. Modern neural net- I., and Hajishirzi, H. How far can camels go? exploring
works generalize on small data sets. Advances in neural the state of instruction tuning on open resources, 2023a.
information processing systems, 31, 2018.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
OpenAI. Gpt-4 technical report, 2023. Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning
language models with self-generated instructions, 2023b.
Oren, Y., Meister, N., Chatterji, N., Ladhak, F., and
Hashimoto, T. B. Proving test set contamination in black Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao,
box language models, 2023. C., and Jiang, D. Wizardlm: Empowering large language
models to follow complex instructions, 2023.
Pan, X., Zhang, M., Ji, S., and Yang, M. Privacy risks of
general-purpose language models. In 2020 IEEE Sympo- Xu, H. and Koehn, P. Zipporah: a fast and scalable data
sium on Security and Privacy (SP), pp. 1314–1331. IEEE, cleaning system for noisy web-crawled parallel corpora.
2020. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pp. 2945–
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do
2950, 2017.
imagenet classifiers generalize to imagenet?, 2019.
Reimers, N. and Gurevych, I. Sentence-bert: Sentence em- Yue, X., Qu, X., Zhang, G., Fu, Y., Huang, W., Sun, H., Su,
beddings using siamese bert-networks. In Proceedings Y., and Chen, W. Mammoth: Building math generalist
of the 2019 Conference on Empirical Methods in Natu- models through hybrid instruction tuning, 2023.
ral Language Processing. Association for Computational Zanella-Béguelin, S., Wutschitz, L., Tople, S., Rühle, V.,
Linguistics, 11 2019. URL https://arxiv.org/ Paverd, A., Ohrimenko, O., Köpf, B., and Brockschmidt,
abs/1908.10084. M. Analyzing information leakage of updates to natu-
Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., ral language models. In Proceedings of the 2020 ACM
Chen, D., and Zettlemoyer, L. Detecting pretraining data SIGSAC conference on computer and communications
from large language models, 2023. security, pp. 363–375, 2020.

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.
X., Guestrin, C., Liang, P., and Hashimoto, T. B. Understanding deep learning requires rethinking general-
Stanford alpaca: An instruction-following llama ization, 2017.
model. https://github.com/tatsu-lab/
Zong, Y., Yu, T., Zhao, B., Chavhan, R., and Hospedales, T.
stanford_alpaca, 2023.
Fool your (vision and) language model with embarrass-
Team, M. N. Introducing mpt-7b: A new standard for ingly simple permutations, 2023.
open-source, commercially usable llms, 2023. URL www.
mosaicml.com/blog/mpt-7b. Accessed: 2023-
05-05.

12
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

A. Rephrase Instruction Prompts HumanEval Rephrase Instructions

We successfully constructed a rephrase prompt template: Please make significant modifications to the pro-
• Please rephrase the following question without altering gram below. Make as many changes as possible by:
its meaning. 1. Ensure that no more than three consecutive words
• Ensure that no more than ten consecutive words are are repeated and try to use similar words as substi-
repeated and try to use similar words as substitutes tutes where possible. 2. Please ensure there aren’t
where possible. 50 consecutive repeated characters. 3. Employ-
• Please ensure there aren’t 50 consecutive identical char- ing various structures, such as replacing for loops
acters. with while loops. 4. You might consider insert-
• When encountering mathematical formulas, please try ing some meaningless commands to bypass n-gram
to substitute the variable names. Ensure the formulas check, like ’pass’. 5. Rewording each sentence in
aren’t identical to the original. For instance, you can the comments and giving each variable a new name.
replace ’x’ with ’y’ or ’a’. 6. Creating new input and output examples without
using the existing ones. 7. If feasible, implement
MMLU Rephrase Instructions
the function with a different algorithm.
Please rephrase the following question without al-
tering its meaning, ensuring you adjust the word
order appropriately. Ensure that no more than five
consecutive words are repeated and try to use sim-
ilar words as substitutes where possible. Do not
change the format of the multiple-choice question.
When encountering mathematical formulas, please
try to substitute the variable names. Ensure the for-
mulas aren’t identical to the original. When you
come across a single number or letter, consider re-
placing it with a sentence. When encountering a
long sequence of numbers, if they are separated by
HumanEval Translate Instructions
spaces, you can replace the spaces with commas; if
separated by commas, you can replace them with Please translate the given program from Python to
spaces. Consider the prompt and choices as a whole; C. Make as many changes as possible by: 1. En-
there shouldn’t be consecutive words. If options are sure that no more than three consecutive words are
challenging to rephrase, consider altering the initial repeated and try to use similar words as substitutes
letter’s case. where possible. 2. Please ensure there aren’t 50 con-
secutive repeated characters. 3. Employing various
MMLU Translate Instructions structures, such as replacing for loops with while
loops. 4. You might consider inserting some mean-
Please translate the following question into lan- ingless commands to bypass n-gram check, like ’int
guage, ensuring you adjust the word order appro- useless var = 0;’. 5. Rewording each sentence in
priately. Ensure that no more than five consecutive the comments and giving each variable a new name.
words are repeated and try to use similar words as 6. Creating new input and output examples without
substitutes where possible. Do not change the for- using the existing ones. 7. If feasible, implement
mat of the multiple-choice question. When encoun- the function with a different algorithm.
tering mathematical formulas, please try to substi-
tute the variable names. Ensure the formulas aren’t
identical to the original. When you come across a
single number or letter, consider replacing it with
a sentence. When encountering a long sequence of
numbers, if they are separated by spaces, you can
replace the spaces with commas; if separated by
commas, you can replace them with spaces. If all B. Rephrase Examples
else fails, you can directly translate the numbers and
chemicals into language. Below are examples of rephrased samples in other real-
world datasets.

13
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

MATHInstruct Rephrased Sample (before Sep. 30 Evol-Instruct-Code-80k-v1 Rephrased Sample

2023)
HumanEval test
MATH test
def fib(n: int):
• The volume of a cone is given by the formula """Return n-th
V = 31 Bh, where B is the area of the base Fibonacci number.
and h is the height. The area of the base of a >>> fib(10)
cone is 30 square units, and its height is 6.5 55
>>> fib(1)
units. What is the number of cubic units in its 1
volume? >>> fib(8)
21
6
• If p(x) = 2 − x2 and q(x) = x, what is the """
value of p(q(2))? if n == 0:
return 0
• Simplify the expression (x5 + 3x2 + 3x5 ) − if n == 1:
return 1
(x7 + 2x2 + 6x5 ).
return fib(n - 1) + fib(n - 2)
• The equation of the circle that passes through
(−1, 6) and which has a center at (2, 3) can be Evol-Instruct-Code-80k-v1
written as x2 + y 2 + Ax + By + C = 0. Find
A × B × C. def fib(n):
if n == 0:
return 0
MATHInstruct elif n == 1:
return 1
• The volume of a cone is given by the formula else:
V = 31 Bh, where B is the area of the base return fib(n-1) + fib(n-2)
and h is the height. The area of the base of a
n = int(input("Enter a
cone is 30 square units, and its height is 6.5 positive integer: "))
units. What is the number of cubic units in its print("The nth Fibonacci
volume? Let’s write a Python program to solve number is", fib(n))
it.
• If p(x) = 2 − x2 and q(x) = x6 , what is the
value of p(q(2))? Please write a program to
solve it.
• Simplify the expression (x5 + 3x2 + 3x5 ) −
(x7 + 2x2 + 6x5 ). Please respond by writing
a program in Python.
• The equation of the circle that passes through
(−1, 6) and which has a center at (2, 3) can be
written as x2 + y 2 + Ax + By + C = 0. Find
A × B × C. Let’s write a Python program to
solve it.

14
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

CodeExercise-Python-27k Rephrased Sample WizardLM-evol-instruct-V2 Rephrased Sample

MMLU test
HumanEval test
Question:
def digitSum(s): The ’nuclear family’ means:
"""Task
Write a function that takes a A. a group of people sharing living accommoda-
string as input and returns tion and meals
the sum of the upper
characters only' ASCII codes. B. a network of relatives extended within or be-
tween generations
Examples:
digitSum("") => 0 C. the new family created when an adult leaves
digitSum("abAB") => 131
digitSum("abcCd") => 67 home and gets married
digitSum("helloE") => 69
digitSum("woArBld") => 131 D. a two generation unit of parents and their chil-
digitSum("aAaaaXa") => 153 dren
"""
if s == "": return 0 Answer: D
return sum(ord(char)
if char.isupper() else 0
for char in s) WizardLM-evol-instruct-V2
Question:
Define the concept of a ’nuclear family’.
CodeExercise-Python-27k Answer:
A nuclear family is a type of family structure con-
def sum_ascii_values(string): sisting of a married couple and their children living
"""
This function takes a string together as a unit, typically in a single household.
as input and returns the This type of family structure is often viewed as the
sum of ASCII values of all traditional or ideal family structure in many cultures.
uppercase letters in the string. In some cases, nuclear families may also include
Parameters:
string (str): The input string. grandparents or other close relatives living together
Returns: in the same household. However, the defining char-
int: The sum of ASCII values of acteristic of a nuclear family is that it consists of a
uppercase letters in the string. single parental unit and their dependent offspring.
Example:
>>> sum_ascii_values('Hello
World')
628
"""
# Initialize sum to 0
ascii_sum = 0
# Iterate through each
# character in the input string
for char in string:
# Check if the character
# is an uppercase letter
if char.isupper():
# Add the ASCII value
# of the uppercase
# letter to the sum
ascii_sum += ord(char)
return ascii_sum

15
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

The Stack Rephrased Sample StarCoder-Data Rephrased Sample

HumanEval test HumanEval test

def is_happy(s): def iscube(a):

"""You are given a string s. '''
Your task is to check if the Write a function that takes an
string is happy or not. A integer a and returns True
string is happy if its length if this ingeger is a cube of
is at least 3 and every 3 some integer number.
consecutive letters are distinct Note: you may assume the input
For example: is always valid.
is_happy(a) => False Examples:
is_happy(aa) => False iscube(1) ==> True
is_happy(abcd) => True iscube(2) ==> False
is_happy(aabb) => False iscube(-1) ==> True
is_happy(adb) => True iscube(64) ==> True
is_happy(xyy) => False iscube(0) ==> True
""" iscube(180) ==> False
if len(s) < 3: '''
return False a = abs(a)
return int(round(a ** (1. / 3)))
for i in range(len(s) - 2): ** 3 == a

if s[i] == s[i+1]
or s[i+1] == s[i+2] StarCoder-Data
or s[i] == s[i+2]:
return False def iscube(a):
return True '''
Write a function that takes an
integer a and returns True
The Stack if this ingeger is a cube of
some integer number.
#[PROMPT] Note: you may assume the input
def is_happy(s): is always valid.
"""You are given a string s. Examples:
Your task is to check if the iscube(1) ==> True
string is happy or not. A iscube(2) ==> False (the
string is happy if its length length of each side must
is at least 3 and every 3 be greater than zero)
consecutive letters are distinct iscube(-1) ==> True
For example: iscube(64) ==> True
is_happy(a) => False iscube(0) ==> True
is_happy(aa) => False iscube(180) ==> False
is_happy(abcd) => True
is_happy(aabb) => False Example solution:
is_happy(adb) => True # line 1
is_happy(xyy) => False a = abs(a)
""" # line 2
#[SOLUTION] cube_root = int(round(a
if len(s) < 3: ** (1. / 3)))
return False # line 3
if cube_root ˆ 3 == a:
for i in range(len(s) - 2): # line 4
return True
if s[i] == s[i+1] # line 5
or s[i+1] == s[i+2] else:
or s[i] == s[i+2]: # line 6
return False return False
return True
'''
# Please print out which line of
# the above program contains an
# error. E.g. if the bug is on
# line 4 then print 4
# END OF CONTEXT
print("3")
# END OF SOLUTION
16

Liu Et Al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models For Verilog Code Generation
No ratings yet
Liu Et Al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models For Verilog Code Generation
8 pages
P LM: A A E B - LLM I T O: Anda N Utomatic Valuation Ench Mark For Nstruction Uning Ptimization
No ratings yet
P LM: A A E B - LLM I T O: Anda N Utomatic Valuation Ench Mark For Nstruction Uning Ptimization
21 pages
MIT's Undergraduate String Theory Project
100% (13)
MIT's Undergraduate String Theory Project
18 pages
LLM Project Cards
No ratings yet
LLM Project Cards
30 pages
L S LL MAF: Abel Upervised A Inetuning
No ratings yet
L S LL MAF: Abel Upervised A Inetuning
12 pages
Activation Space Interventions Can Be Transferred Between Large Language Models
No ratings yet
Activation Space Interventions Can Be Transferred Between Large Language Models
68 pages
CH-6 Assignment - Models Modified
No ratings yet
CH-6 Assignment - Models Modified
48 pages
Evaluating The Quality of Benchmark Datasets For Low-Resource Languages: A Case Study On Turkish
No ratings yet
Evaluating The Quality of Benchmark Datasets For Low-Resource Languages: A Case Study On Turkish
17 pages
CIBench Evaluating Your LLMs With A Code Interpret
No ratings yet
CIBench Evaluating Your LLMs With A Code Interpret
22 pages
Pariksha Tech Report V1-663980ea39a84
No ratings yet
Pariksha Tech Report V1-663980ea39a84
26 pages
Research Paper 10
No ratings yet
Research Paper 10
12 pages
Preference Leakage - A Contamination Problem in LLM-as-a-judge
No ratings yet
Preference Leakage - A Contamination Problem in LLM-as-a-judge
17 pages
JTT v6.21 en
No ratings yet
JTT v6.21 en
32 pages
Mathematical - Optimization in Management of Services
No ratings yet
Mathematical - Optimization in Management of Services
20 pages
Unsupervised Data Validation Methods For
No ratings yet
Unsupervised Data Validation Methods For
10 pages
Improving Pretraining Data Using Perplexity Correlations
No ratings yet
Improving Pretraining Data Using Perplexity Correlations
31 pages
2024.findings Emnlp.839
No ratings yet
2024.findings Emnlp.839
13 pages
Omkw 1
No ratings yet
Omkw 1
32 pages
Enhancing Multilingual Prompt-Based Code Generation in Llms Via Zero-Shot Cross-Lingual Transfer
No ratings yet
Enhancing Multilingual Prompt-Based Code Generation in Llms Via Zero-Shot Cross-Lingual Transfer
11 pages
Beyond Graphs - Can Large Language Models Comprehend Hypergraphs?
No ratings yet
Beyond Graphs - Can Large Language Models Comprehend Hypergraphs?
26 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
Fine Tunepaper 1
No ratings yet
Fine Tunepaper 1
9 pages
Finetuning LLM For Vulnerability Detection
No ratings yet
Finetuning LLM For Vulnerability Detection
12 pages
Evaluating and Mitigating Social Bias For Large Language Models in Open-Ended Settings
No ratings yet
Evaluating and Mitigating Social Bias For Large Language Models in Open-Ended Settings
12 pages
Syntactic and Semantic C
No ratings yet
Syntactic and Semantic C
34 pages
Evaluation of Llms Should Not Ignore Non-Determinism: The Good, The Bad, and The Greedy
No ratings yet
Evaluation of Llms Should Not Ignore Non-Determinism: The Good, The Bad, and The Greedy
10 pages
Llama2 Page8
No ratings yet
Llama2 Page8
1 page
Benchmark Data Contamination of Large Language Models: A Survey
No ratings yet
Benchmark Data Contamination of Large Language Models: A Survey
31 pages
Cls 8 - Math D - Term 1 - LP 4
No ratings yet
Cls 8 - Math D - Term 1 - LP 4
2 pages
Engineering Support of Planning and Scheduling Revsion 81
No ratings yet
Engineering Support of Planning and Scheduling Revsion 81
22 pages
MPC - 1ST Year Jee Mains Coes Paper 10.11.2024
No ratings yet
MPC - 1ST Year Jee Mains Coes Paper 10.11.2024
8 pages
Memorization Vs Generalization Quantifying Data Le
No ratings yet
Memorization Vs Generalization Quantifying Data Le
11 pages
Challenges and Applications of Large Language Models: Desi GN Behavior
No ratings yet
Challenges and Applications of Large Language Models: Desi GN Behavior
72 pages
B M: A C A V L L M E F: Eyond Etrics Ritical Nalysis of The Ariability IN Arge Anguage Odel Valuation Rameworks
No ratings yet
B M: A C A V L L M E F: Eyond Etrics Ritical Nalysis of The Ariability IN Arge Anguage Odel Valuation Rameworks
15 pages
Llms 1 15
No ratings yet
Llms 1 15
15 pages
Synthetic Data Generation in Low-Resource Settings Via Fine-Tuning of Large Language Models
No ratings yet
Synthetic Data Generation in Low-Resource Settings Via Fine-Tuning of Large Language Models
12 pages
T T LLM: T D C L L M: IME Ravel in S Racing ATA Ontamination in Arge Anguage Odels
No ratings yet
T T LLM: T D C L L M: IME Ravel in S Racing ATA Ontamination in Arge Anguage Odels
22 pages
Lab Module 1.2 - NMK 11103
No ratings yet
Lab Module 1.2 - NMK 11103
10 pages
UFO Glasnost Marina Popowitsch LQ
No ratings yet
UFO Glasnost Marina Popowitsch LQ
288 pages
NeurIPS 2021 Coco LM Correcting and Contrasting Text Sequences For Language Model Pretraining Paper
No ratings yet
NeurIPS 2021 Coco LM Correcting and Contrasting Text Sequences For Language Model Pretraining Paper
13 pages
Llama Pro
No ratings yet
Llama Pro
21 pages
Is Your Code Generated by Chatgpt Really Correct?: Rigorous Evaluation of Large Language Models For Code Generation
No ratings yet
Is Your Code Generated by Chatgpt Really Correct?: Rigorous Evaluation of Large Language Models For Code Generation
15 pages
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
No ratings yet
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
55 pages
2 Ruf
No ratings yet
2 Ruf
11 pages
Aligning Large Language Models With Recommendation Knowledge
No ratings yet
Aligning Large Language Models With Recommendation Knowledge
16 pages
P T S C B B L M: Roving EST ET Ontamination in Lack OX Anguage Odels
No ratings yet
P T S C B B L M: Roving EST ET Ontamination in Lack OX Anguage Odels
17 pages
Formulation In-Vitro Evaluation of Sulfanilamide 15% Vaginal Cream
No ratings yet
Formulation In-Vitro Evaluation of Sulfanilamide 15% Vaginal Cream
3 pages
Evo Code Bench
No ratings yet
Evo Code Bench
15 pages
Token-by-Token Regeneration and Domain Biases - A Benchmark of LLMs On Advanced Mathematical Problem-Solving
No ratings yet
Token-by-Token Regeneration and Domain Biases - A Benchmark of LLMs On Advanced Mathematical Problem-Solving
8 pages
Semeval-2023 Task 4: Fine-Tuning Vs Prompting, Can Language Models Understand Human Values?
No ratings yet
Semeval-2023 Task 4: Fine-Tuning Vs Prompting, Can Language Models Understand Human Values?
7 pages
LLM Benchmarks
No ratings yet
LLM Benchmarks
5 pages
Large Language Models For Data Annotation - A Survey
No ratings yet
Large Language Models For Data Annotation - A Survey
22 pages
Text Classification Using Hugging Face
No ratings yet
Text Classification Using Hugging Face
1 page
Fault-Aware Neural Code Rankers: Jeevana Priya Inala Chenglong Wang Mei Yang Andres Codas
No ratings yet
Fault-Aware Neural Code Rankers: Jeevana Priya Inala Chenglong Wang Mei Yang Andres Codas
19 pages
Generative AI Exists Because of The Transformer
No ratings yet
Generative AI Exists Because of The Transformer
52 pages
Science
No ratings yet
Science
4 pages
Sample-Efficient Human Evaluation of Large Language Models Via Maximum Discrepancy Competition
No ratings yet
Sample-Efficient Human Evaluation of Large Language Models Via Maximum Discrepancy Competition
32 pages
Unit 6
No ratings yet
Unit 6
9 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
2024 Findings-Eacl 141
No ratings yet
2024 Findings-Eacl 141
17 pages
GMAT Quant Topic 8 - Probability Solutions
No ratings yet
GMAT Quant Topic 8 - Probability Solutions
20 pages
Improving Large Language Model
No ratings yet
Improving Large Language Model
14 pages
C100 Service Training Manual:: All Wheel Drive (AWD)
No ratings yet
C100 Service Training Manual:: All Wheel Drive (AWD)
18 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
27 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
LLaMA 2
No ratings yet
LLaMA 2
77 pages
ELREA 多个lora适配器动态选取
No ratings yet
ELREA 多个lora适配器动态选取
29 pages
pdf2306 08997 PDF
No ratings yet
pdf2306 08997 PDF
20 pages
Class 6 Maths Test (30!06!2025)
No ratings yet
Class 6 Maths Test (30!06!2025)
2 pages
Robust and Efficient Fine-Tuning of Llms With Bayesian Reparameterization of Low-Rank Adaptation
No ratings yet
Robust and Efficient Fine-Tuning of Llms With Bayesian Reparameterization of Low-Rank Adaptation
48 pages
8051 Instruction Set
No ratings yet
8051 Instruction Set
50 pages
Applying Ai To Rebuild Middle Class Jobs
No ratings yet
Applying Ai To Rebuild Middle Class Jobs
22 pages
Tda8580j Datasheet
100% (1)
Tda8580j Datasheet
28 pages
Code Llama
No ratings yet
Code Llama
47 pages
Recent Advances in Language Modeling (2022-2025)
No ratings yet
Recent Advances in Language Modeling (2022-2025)
5 pages
Statics - Chapter 5
No ratings yet
Statics - Chapter 5
12 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
25 pages
Amptec 601ES - Explosive Safety Digital Multimeter (DMM)
No ratings yet
Amptec 601ES - Explosive Safety Digital Multimeter (DMM)
2 pages
Large Language Model Routing With Benchmark Datasets
No ratings yet
Large Language Model Routing With Benchmark Datasets
18 pages
Best Practice Catalog: Machine Condition Monitoring
No ratings yet
Best Practice Catalog: Machine Condition Monitoring
18 pages
Omnia SST: Audio Processing Software
No ratings yet
Omnia SST: Audio Processing Software
3 pages
Annotator Bias Llms
No ratings yet
Annotator Bias Llms
14 pages
Platypus
No ratings yet
Platypus
17 pages
Musical Elements Table
No ratings yet
Musical Elements Table
3 pages
Religion and Science
100% (1)
Religion and Science
4 pages
Chapter 2 - FIR Filters - Digital Filter Design
No ratings yet
Chapter 2 - FIR Filters - Digital Filter Design
100 pages
The Solution of The Zodiac Killer's 340-Character Cipher
No ratings yet
The Solution of The Zodiac Killer's 340-Character Cipher
62 pages
I Am The One Who Would Awaken You
No ratings yet
I Am The One Who Would Awaken You
5 pages
Clay Shale
No ratings yet
Clay Shale
22 pages
Sales Budgeting and Forecasting
0% (1)
Sales Budgeting and Forecasting
16 pages
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
No ratings yet
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
45 pages
Large Language Models Understand and Can Be Enhanced by Emotional Stimuli
No ratings yet
Large Language Models Understand and Can Be Enhanced by Emotional Stimuli
32 pages
Skill-Mix - A Flexible and Expandable Family of Evaluations For AI Models
No ratings yet
Skill-Mix - A Flexible and Expandable Family of Evaluations For AI Models
33 pages
Categorical Deep Learning - An Algebraic Theory of Architectures
No ratings yet
Categorical Deep Learning - An Algebraic Theory of Architectures
29 pages
Does GPT-4 Pass The Turing Test
No ratings yet
Does GPT-4 Pass The Turing Test
25 pages
"The Way To Become A Man": The Influence of Commercial Sex On Male Psychosocial Development by Adrian DeLuna Garcia
No ratings yet
"The Way To Become A Man": The Influence of Commercial Sex On Male Psychosocial Development by Adrian DeLuna Garcia
244 pages
"The Way To Become A Man": The Influence of Commercial Sex On Male Psychosocial Development by Adrian DeLuna Garcia
No ratings yet
"The Way To Become A Man": The Influence of Commercial Sex On Male Psychosocial Development by Adrian DeLuna Garcia
244 pages
Linearity of Relation Decoding in Transformer Language Models
No ratings yet
Linearity of Relation Decoding in Transformer Language Models
23 pages
Using The Leica TC 307 v2
No ratings yet
Using The Leica TC 307 v2
4 pages
Mobile ALOHA - Learning Bimanual Mobile Manipulation With Low-Cost Whole-Body Teleoperation
No ratings yet
Mobile ALOHA - Learning Bimanual Mobile Manipulation With Low-Cost Whole-Body Teleoperation
20 pages
Who's Harry Potter? Approximate Unlearning in LLMs
No ratings yet
Who's Harry Potter? Approximate Unlearning in LLMs
21 pages
Detectability of Solar Panels As A Technosignature IOPscience
No ratings yet
Detectability of Solar Panels As A Technosignature IOPscience
16 pages
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
No ratings yet
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
19 pages
A Theory For Emergence of Complex Skills in Language Models
No ratings yet
A Theory For Emergence of Complex Skills in Language Models
17 pages
How To Transfer Algorithmic Reasoning Knowledge To Learn New Algorithms?
No ratings yet
How To Transfer Algorithmic Reasoning Knowledge To Learn New Algorithms?
21 pages
Introducing ChatGPT II
No ratings yet
Introducing ChatGPT II
16 pages
Defense in Depth - An Action Plan To Increase The Safety and Security of Advanced AI
No ratings yet
Defense in Depth - An Action Plan To Increase The Safety and Security of Advanced AI
13 pages
The Cat Is Out of The Bag: Orientalism Anti-Blackness and White Supremacy in Dr. Seuss's Children's Books
No ratings yet
The Cat Is Out of The Bag: Orientalism Anti-Blackness and White Supremacy in Dr. Seuss's Children's Books
51 pages
ACEs Wild Making Meaning Out of Trauma Through Altruism Born of Suffering by Jessica Gibson
No ratings yet
ACEs Wild Making Meaning Out of Trauma Through Altruism Born of Suffering by Jessica Gibson
107 pages
Affordable Travel Club Application-US
No ratings yet
Affordable Travel Club Application-US
1 page
OpenVoice - Versatile Instant Voice Cloning
No ratings yet
OpenVoice - Versatile Instant Voice Cloning
7 pages
Le Club Francais Case
No ratings yet
Le Club Francais Case
8 pages
a094mMPMC Multiple Choice Questions
No ratings yet
a094mMPMC Multiple Choice Questions
7 pages
Making Sense of and Healing Suffering Insights From Buddhism and Critical Social Science Ruben Flores
No ratings yet
Making Sense of and Healing Suffering Insights From Buddhism and Critical Social Science Ruben Flores
13 pages
An Electronic Thesaurus of Vedic Texts by Jost Gippert
No ratings yet
An Electronic Thesaurus of Vedic Texts by Jost Gippert
15 pages
Humankind Is Literally One Family
No ratings yet
Humankind Is Literally One Family
3 pages
JOTRON TRON UAIS TR-2500 - Operation - Installation Manual
No ratings yet
JOTRON TRON UAIS TR-2500 - Operation - Installation Manual
77 pages
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet

Rethinking Benchmark and Contamination For Language Models With Rephrased Samples

Uploaded by

Rethinking Benchmark and Contamination For Language Models With Rephrased Samples

Uploaded by

Rethinking Benchmark and Contamination for Language Models with

The fast-growing capabilities of large language models

by using LLMs to paraphrase or translate test samples into

Figure 3: The contamination percentage of HumanEval 3. Rephrased Samples

Table 2: Accuracy on MMLU. “Rephrased Chinese” refers Table 3: Pass@1 on HumanEval.

case”. 5.3. Contamination in Real World Datasets

Algebra Sociology US History

Rephrased HumanEval test

Table 7: The Percentage of # Rephrased Sample Contamination in Real-world Datasets.

E XAMPLE 5 (MATH S ELF - CONTAMINATION ) 6. Discussion

A. Rephrase Instruction Prompts HumanEval Rephrase Instructions

MATHInstruct Rephrased Sample (before Sep. 30 Evol-Instruct-Code-80k-v1 Rephrased Sample

CodeExercise-Python-27k Rephrased Sample WizardLM-evol-instruct-V2 Rephrased Sample

The Stack Rephrased Sample StarCoder-Data Rephrased Sample

HumanEval test HumanEval test

def is_happy(s): def iscube(a):

You might also like