0% found this document useful (0 votes)
73 views16 pages

Rethinking Benchmark and Contamination For Language Models With Rephrased Samples

Uploaded by

timsmith1081574
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views16 pages

Rethinking Benchmark and Contamination For Language Models With Rephrased Samples

Uploaded by

timsmith1081574
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Rethinking Benchmark and Contamination for Language Models with

Rephrased Samples

Shuo Yang * 1 2 Wei-Lin Chiang * 1 Lianmin Zheng * 1 Joseph E. Gonzalez 1 Ion Stoica 1

Abstract 1. Introduction
arXiv:2311.04850v2 [cs.CL] 11 Nov 2023

The fast-growing capabilities of large language models


Large language models are increasingly trained make their evaluation more challenging than ever (Chang
on all the data ever produced by humans. Many et al., 2023). Although the community has established many
have raised concerns about the trustworthiness of benchmarks over a short period of time, the benchmark
public benchmarks due to potential contamination scores do not always reflect performance on real-world tasks.
in pre-training or fine-tuning datasets. While most There has been evidence that many prevalent benchmarks
data decontamination efforts apply string match- might have contaminated pre-training or fine-tuning datasets.
ing (e.g., n-gram overlap) to remove benchmark From the contamination analysis in Llama-2 (Touvron et al.,
data, we show that these methods are insufficient, 2023), over 10% of the MMLU test samples are highly
and simple variations of test data (e.g., paraphras- contaminated. Another example from GPT-4’s technical
ing, translation) can easily bypass these decon- report (OpenAI, 2023) shows that 25% of HumanEval has
tamination measures. Furthermore, we demon- been contaminated in their training data. Similar situation
strate that if such variation of test data is not also applies to open-source datasets. A popular code pre-
eliminated, a 13B model can easily overfit a test training set, StarCoder Data (Li et al., 2023), shows that
benchmark and achieve drastically high perfor- hundreds of test cases in the Stack (Kocetkov et al., 2022)
mance, on par with GPT-4. We validate such are contaminated with benchmarks.
observations in widely used benchmarks such as Despite being recognized as a crucial issue, accurately de-
MMLU, GSK8k, and HumanEval. To address this tecting contamination remains an open and challenging prob-
growing risk, we propose a stronger LLM-based lem. The most commonly used approaches are n-gram over-
decontamination method and apply it to popular lap and embedding similarity search. N-gram overlap relies
pre-training and fine-tuning datasets, revealing on string matching to detect contamination, widely used
significant previously unknown test overlap. For by leading developments such as GPT-4 (OpenAI, 2023),
example, in pre-training sets such as RedPajama- PaLM (Anil et al., 2023), and Llama (Touvron et al., 2023).
Data-1T and StarCoder-Data, we identified that However, it suffers from limited accuracy. Embedding sim-
8-18% of the HumanEval benchmark overlaps. ilarity search uses the embeddings of pre-trained models
Interestingly, we also find such contamination in (e.g., BERT) to find similar and potentially contaminated
synthetic dataset generated by GPT-3.5/4, sug- examples. However, choosing an appropriate similarity
gesting a potential risk of unintentional contami- threshold to strike a balance between recall and precision
nation. We urge the community to adopt stronger is often challenging. Moreover, there has been a growing
decontamination approaches when using public interest in training models using synthetic data produced
benchmarks. Moreover, we call for the commu- by LLMs (e.g., GPT-4) (Gunasekar et al., 2023; Taori et al.,
nity to actively develop fresh one-time exams to 2023; Wang et al., 2023b; Xu et al., 2023; Mukherjee et al.,
evaluate models accurately. Our decontamination 2023), in which contamination may be even harder to detect
tool is publicly available at https://github. by string matching. In Phi-1 report (Gunasekar et al., 2023),
com/lm-sys/llm-decontaminator. they discover a significant portion of the synthetic data sim-
ilar to some test samples in HumanEval that is undetectable
by n-gram overlap.

*
To study decontamination methods, in Section 3 we propose
Equal contribution 1 UC Berkeley 2 Shanghai Jiao Tong Univer- the concept of a “rephrased sample” which has the same
sity. Correspondence to: Shuo Yang <andy [email protected]>,
Ion Stoica <[email protected]>. semantics as the original sample but is hard to detect by ex-
isting contamination tests. Rephrased samples are generated

1
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Figure 1: A failure case of existing contamination detection methods (n-gram overlap, embedding similarity) on MMLU
benchmark. We place a question mark since the embedding similarity approach struggles to distinguish the rephrased
question from other questions in the same subject (high school US history). After rephrasing MMLU test cases, a Llama-2-
13B trained on a rephrased test set can reach 85.9 accuracy on MMLU while being undetectable by n-gram overlap.

by using LLMs to paraphrase or translate test samples into


another language. We show that if such rephrased samples
are used for training, the resulting model can easily overfit
and reach drastically high performance in test benchmarks.
Figure 1 demonstrates this concept with a test example from
the MMLU benchmark. We observe such phenomenon in
popular benchmarks such as MMLU, GSM-8k, and Hu-
manEval, where a finetuned 13B Llama model can match
GPT-4’s performance in all benchmarks while being unde-
tected by n-gram overlap as contamination, as shown in
Figure 2. Therefore, being able to detect such rephrased
samples becomes critical. We provide an in-depth analysis
on why existing decontamination methods fail and propose
a new LLM-based decontamination method in Section 4.
Our method first uses embedding similarity search to get
the top-k samples with the highest similarity with a given
test sample and then prompts a strong LLM such as GPT-4
to examine whether any of the top-k samples is too close to
the test case. Results show that our proposed LLM decon- Figure 2: After fine-tuned on rephrased samples, Llama 2
taminator works significantly better than existing methods. and CodeLlama achieve performance on par with GPT-4.
In Section 5.3, we apply our decontaminator to several
widely used pre-training and fine-tuning datasets. We suc- using public benchmarks. To address these concerns at their
cessfully reveal previously unknown test overlap with public core, we advocate for the development of fresh, one-time
benchmarks. Shown in Figure 3, in pre-training sets such as exams, similar to Codeforces and Kaggle competitions, for
RedPajama-Data-1T and StarCoder-Data, we identify that the accurate assessment of LLMs.
8-18% of the HumanEval benchmark are overlapped. We
also find a synthetic dataset generated by GPT-3.5, CodeAl- 2. Background
paca (Chaudhary, 2023), has a significant portion (12.8%)
of rephrased samples from HumanEval. This suggests a Contamination occurs when test set information is leaked in
potential contamination risk when training with synthetic the training set, resulting in an overly optimistic estimate of
data generated by LLMs. We urge the community to adopt the model’s score (accuracy, AUC, etc.). In this section, we
more robust decontamination methods for evaluating LLMs introduce common contamination detection methods, which
include n-gram overlap, embedding similarity search, de-

2
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

The Stack(4G) 18.9% to provide the most relevant training examples, where hu-
StarCoder-Data(2.4G) 15.9% mans can then judge whether these training examples meet
CodeAlpaca 12.8%
the contamination criteria. However, this approach is im-
practical because it induces a high computational overhead.
RedPajama-Data(16G) 8.53%

Figure 3: The contamination percentage of HumanEval 3. Rephrased Samples


benchmark in each training dataset. Note that The Stack
(4G), StarCoder-Data (2.4G), and RedPajama-Data (16G) Our goal is to investigate whether simple variations of test
are subsets. sets included in the training set could affect the resulting
benchmark performance. We refer to such variations of test
coding matching, and influence function. Table 1 compares cases as “rephrased samples”.
the computational costs of these methods, and whether they We consider various domains of benchmarks including math,
require access to training data or the model. knowledge, and coding in our experiments. Example 1 is a
N-gram overlap. The most common and widely used de- rephrased sample from GSM-8k that the 10-gram overlap
contamination method is n-gram overlap. The GPT-3 pa- fails to detect, while keeping the same semantic.
per (Brown et al., 2020) defines a 13-gram overlap as con-
tamination, and the GPT-4 report (OpenAI, 2023) defines a E XAMPLE 1 (GSM-8K R EPHRASED S AMPLE )
50-character overlap as contamination. N-gram overlap de- Original Test Case
tection is favored for its simplicity and speed but it can result Janet’s ducks lay 16 eggs per day. She eats three for
in a higher false negative rate if there’s a small difference. breakfast every morning and bakes muffins for her
Embedding similarity search. Embedding similarity friends every day with four. She sells the remainder
search uses transformer-generated embeddings to capture at the farmers’ market daily for $2 per fresh duck
prompts’ semantics. Popular approaches use models such egg. How much in dollars does she make every day
as sentence BERT (Reimers & Gurevych, 2019) to generate at the farmers’ market?
embeddings and employ cosine similarity to measure the
relevance of prompts. High similarity between training and Rephrased Test Case
test prompts suggests potential contamination (Lee et al., Janet’s ducks produce 16 eggs each day. She con-
2023). Although it can capture more semantic information sumes three of them for her morning meal and uses
than the n-gram approach, it requires specifying a threshold. four to bake muffins for her friends daily. The re-
If the threshold is set too high, it will result in a high false maining eggs are sold at the daily farmers’ market
negative rate; otherwise, setting it too low will lead to a high for $2 per egg. What is the daily amount in dollars
false positive rate. that she earns at the farmers’ market?
Decoding matching. Both n-gram overlap and embedding
similarity search require access to training data. In cases
3.1. Rephrasing Techniques
where training data is not available but the model is available,
decoding matching can be used as an alternative method There are some subtle differences in rephrasing techniques
to detect contamination. The intuition is that if the model because benchmark contamination takes on different forms.
is trained on contaminated training data, it is more likely For text-based benchmarks, we rephrase test cases without
to auto-complete a partial test prompt. (Li, 2023) However, altering semantics, such as by rearranging word order or
an auto-completed test prompt does not necessarily indi- substituting with synonymous terms. For code-based bench-
cate that the model has been trained on contaminated data, marks, we vary coding styles, naming conventions, and
and a model trained on test cases with variation will not implementations, but their semantics remain unchanged.
auto-complete the test prompt either. Therefore, decoding
Regarding the rephrasing process, we present a simple algo-
matching is often not acknowledged as definitive evidence
rithm for a given test set in Algorithm 1. This method helps
of contamination.
a test sample to escape from being detected. It first employs
Influence function. When both the model and the training a high-quality large language model (e.g., GPT-4) to pro-
data are available, the influence function (Koh & Liang, duce a rephrased version of the test prompt. Then, it utilizes
2020) can be used to identify contaminated samples. This detection like n-gram overlap to ensure the rephrased sam-
method takes a test sample and iteratively calculates an in- ple can’t be detected. To encourage diverse outputs, we set
fluence factor for each training sample. This influence factor a non-zero initial temperature. By applying this process to
quantitatively measures how relevant each training sample each prompt in the test set, we build a rephrased test set.
is to the current test sample. It then sorts the influence factor “RephraseLLM” denotes the high-quality LLM, like GPT-4

3
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Table 1: Contamination detection methods. M denotes the size of the training set, and N indicates the size of the test set.

Method require access to training data require access to model computational cost
N-gram overlap yes no O(M N )
Embedding similarity search yes no O(M N + M + N )
Decoding matching no yes O(N )
Influence function yes yes O(M 2 + M N )

or Claude. “isContaminated” can refer to any contamination from Python to C or Java solving the same problem. To
detection method, such as n-gram overlap or embedding further investigate the impact of translation techniques on
similarity search. coding benchmarks, we propose the multi-language data
augmentation.
Algorithm 1 The algorithm for rephrasing samples
Multi-languages data augmentation. For coding bench-
Ensure: Rephrase(T estSet, M axRetry) marks, we use multi-language data augmentation to enhance
1: RephrasedSet ← ∅ the translation technique. By incorporating multiple lan-
2: for t in T estSet do guages, we enhance the model’s generalization ability and
3: s ← RephraseLLM(t) ensure its comprehension that translated and original code
4: retry ← 0 serve the same function. In section 5.1, our experiments
5: while isContaminated(s, t) do indicate that multilingual data augmentation yields better
6: s ← RephraseLLM(t) results than single-language translation.
7: retry ← retry + 1
8: if retry>M axRetry then
9: s ← null
4. LLM Decontaminator
10: break In this section, we propose a new contamination detection
11: end if method that accurately removes rephrased samples from a
12: end while dataset relative to a benchmark.
13: RephrasedSet ← RephrasedSet ∪ {s}
14: end for 4.1. Algorithm
15: return RephrasedSet
In Section 2, we discuss the limitations of existing detection
methods including n-gram overlap and embedding similarity
3.2. Translation Techniques search. To address the limitations, we introduce the “LLM
decontaminator” in Algorithm 2. This method involves two
There are other kinds of rephrased samples beyond modifi- steps: First, for each test case, it identifies the top-k training
cations in word order. In real-world datasets, there are many items with the highest similarity using the embedding sim-
rephrasing techniques including the translation technique. ilarity search. Each pair is evaluated whether they are the
By employing these techniques, rephrased samples become same by an advanced LLM, such as GPT-4. This approach
more concealed and still can help models achieve dramatic helps to determine how many rephrased samples there are
score improvements. in a dataset with a moderate computational overhead. “Tem-
Prompts with identical meanings from different languages plate” is a structured prompt that, when paired with a test
yield varied embeddings in most language models. By trans- case and training case, instructs the “LLMDetector” to carry
lating test prompts into other languages, we can evade n- out a comparison and return either ‘True’ or ‘False’. In
gram overlap detection and standard embedding similarity this context, ‘True’ indicates that the training case might be
searches. Only embedding models specifically trained in a rephrased sample of the test case. “LLMDetector” is a
multiple languages can detect a translated sample. high-quality LLM like GPT-4. “TopKSimilarity” identifies
the top k most similar samples in the training data using
For text-based data, the translation technique enables eva- embedding similarity search.
sion of both n-gram overlap and embedding similarity
search, while significantly boosting the score. This method 4.2. Contamination Detection Visualization
capitalizes on the model’s multilingual translation capabili-
ties, effectively transforming a knowledge assessment into In Figure 4 we present a Venn diagram of contamination
a translation task. For coding benchmarks, the translation and different detection methods. The LLM decontamina-
technique also works well. We can translate a program tor takes advantage of embedding similarity search, which

4
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Algorithm 2 The algorithm for LLM decontaminator arrangements. Example 2 is a false positive example from
Ensure: Decontaminate(T rainSet, T estSet, k, T emplate) n-gram overlap detection. Even though their multi-choice
answer patterns match exactly, they are indeed different
1: Contamination ← ∅ problems. To reduce false positive issues, we introduce
2: for t in T estSet do a “question only” control group in MMLU experiments.
3: for c in TopKSimilarity(T rainSet, t, k) do “Question Only” refers to rephrasing just the question stem,
4: s ← LLMDetector(T emplate, t, c) while “Full Prompt” refers to rephrasing both the question
5: if s = T rue then stem and the options.
6: Contamination ← Contamination ∪
{(t, c)} E XAMPLE 2 (M ULTI -C HOICE FALSE P OSITIVE )
7: end if
8: end for • Statement 1— Every group of order p2 where
9: end for p is prime is Abelian.
10: return Contamination Statement 2 — For a fixed prime p a Sylow p-
subgroup of a group G is a normal subgroup of
G if and only if it is the only Sylow p-subgroup
helps it rapidly filter out possible possible contamination. of G.
In addition, it utilizes the strong LLMs’ reliable judgments. A. True, True
We show that n-gram overlap detection can result in a higher B. False, False
false negative rate when detecting rephrased samples, and C. True, False
embedding similarity search detects many false positives D. False, True
with a high threshold. Notably, the LLM decontamina-
tor showcases higher accuracy while detecting rephrased • Statement 1 — Every group of order 42 has a
samples. See Section 5.1 for comprehensive experimental normal subgroup of order 7.
results. Statement 2 — Every group of order 42 has a
normal subgroup of order 8.
5. Experiments A. True, True
B. False, False
In Section 5.1, we demonstrate that models trained on C. True, False
rephrased samples can achieve dramatically high scores, D. False, True
achieving GPT-4 performance in three widely used bench-
marks, MMLU, HumanEval, and GSM-8k. This suggests
that rephrased samples should be considered as contami- Other details. Large numbers often induce character over-
nation and should be removed from training data. In Sec- lap. To avoid this, we change the format of large numbers,
tion 5.2, we evaluate different contamination detection meth- such as alternating between commas and spaces. Proprietary
ods based on rephrased samples in MMLU/HumanEval. In terms in various domains can also trigger overlap issues. To
Section 5.3, we apply our decontaminator to widely-used circumvent this, we rotate between abbreviations and full
training sets and discover previously unknown contamina- terms and adjust capitalization, particularly when choices
tion. involve names or chemical formulas.
Benchmark results. On the rephrased test sets, we train the
5.1. Rephrased Samples Contaminate Benchmarks Llama-2-7b and Llama-2-13b, with 16 epochs. As shown in
Table 2, Llama-2 7B and 13B trained on rephrased samples
5.1.1. MMLU K NOWLEDGE B ENCHMARK
can achieve dramatically high scores on MMLU, from 45.3
MMLU (Hendrycks et al., 2020) is one of the benchmarks to 88.5. This suggests rephrased samples can significantly
with the widest range of subjects, covering 57 disciplines skew the benchmark numbers and should be considered as
from abstract algebra to professional psychology. Rephras- contamination. The original model is tested on 5-shot, and
ing MMLU requires considering a multitude of scenarios. the model trained on rephrased data is tested on 0-shot.
Given the complexity of MMLU and its multiple-choice
format, it is necessary to explain the rephrasing details in- 5.1.2. H UMAN E VAL C ODING B ENCHMARK
volved.
HumanEval (Chen et al., 2021) is a benchmark provided by
False positive issue. The use of n-gram overlap detection in OpenAI to evaluate the coding capabilities of large language
multiple-choice questions can easily result in false positives, models. It provides the model with an incomplete piece of
especially when different questions share similar option code and asks the model to complete it.

5
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Figure 4: Venn graph depicting training data subsets and contamination detection ranges. The solid circle represents the
training data and its subsets. The dashed circles enclose areas flagged for potential contamination by detection methods within
the dataset. Notably, the LLM decontaminator showcases higher accuracy. Embedding similarity search detects broadly but
with many false positives. N-gram overlap has a limited ability to spot rephrased samples. The LLM decontaminator refines
the results from embedding similarity search using LLMs, providing a precise and efficient contamination assessment.

Table 2: Accuracy on MMLU. “Rephrased Chinese” refers Table 3: Pass@1 on HumanEval.


to translating the questions into Chinese.
Model Original Fine-tune on test set
Model Original Rephrased English CodeLlama 7B 32.9 100
Question Only Full Prompt CodeLlama 13B 36.0 100
Llama 2 7B 45.3 88.5 82.0 Model Fine-tune on rephrased
Llama 2 13B 54.8 89.9 85.9 Python C Multi-languages
Model Test Set Rephrased Chinese CodeLlama 7B 67.7 45.7 59.8
Question Only Full Prompt CodeLlama 13B 81.1 48.2 67.1
Llama 2 7B 100 91.1 74.3
Llama 2 13B 100 93.7 80.9 Table 4: Accuracy on GSM-8K.

Fine-tune on Fine-tune on
Dead code injection. In real-world coding datasets, there Model Original
test set rephrased English
are some unreachable instructions. These dead codes sel-
Llama 2 7B 14.6 100 86.7
dom affect the semantics, and they help rephrased samples Llama 2 13B 28.7 100 95.3
to escape decontamination. Given that current detection
methods do not use compilers to remove dead code from
coding datasets, we investigate how dead codes interfere achieve 67.0 on HumanEval.
with detection methods.
5.1.3. GSM-8K M ATH B ENCHMARK
Benchmark results. We rephrase the HumanEval test set
in Python and translate it into five programming languages: GSM-8K (Cobbe et al., 2021) is a commonly used bench-
C, JavaScript, Rust, Go, and Java. We train CodeLlama 7B mark for testing the mathematical capabilities of LLMs.
and 13B on these codes respectively. Then, we construct
Benchmark results. Table 4 shows that Llama-2 7b and
a multi-programming-language dataset comprising the five
13b trained on rephrased samples achieve dramatically high
programming languages and train on it. Table 3 shows
scores on GSM-8K, from 28.7 to 95.3. The original model
CodeLlama’s performance on rephrased Python, rephrased
is tested on 5-shot, and the model trained on rephrased data
C, and the multi-programming-language dataset. CodeL-
is tested on 0-shot.
lama 7B and 13B trained on rephrased samples can achieve
dramatically high scores on HumanEval, from 32.9 to 67.7 We will explore the detection problems in Section 6 with
and 36.0 to 81.1, respectively. In contrast, GPT-4 can only GSM-8k as they relate to the “number substituted only

6
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

case”. 5.3. Contamination in Real World Datasets


To demonstrate the effectiveness LLM decontaminator, we
5.2. Evaluating Contamination Detection Methods
apply it to widely used real-world datasets and identify a
5.2.1. MMLU substantial amount of rephrased samples. Table 7 displays
the contamination percentage of different benchmarks in
We construct a decontamination benchmark based on three each training dataset.
subjects: abstract algebra, sociology, and US history in
MMLU. To compare the accuracy of detection methods CodeAlpaca (Chaudhary, 2023) is a synthetic dataset gen-
against rephrased samples, we construct 200 prompt pairs erated by OpenAI’s Davinci-003 using the self-instruct tech-
using both the original and rephrased test sets. These com- nique (Wang et al., 2023b). It contains 20K instruction-
prised 100 random pairs and 100 rephrased pairs. The f1 following data used for fine-tuning the CodeAlpaca model.
score on these pairs provides insight into the detection meth- CodeAlpaca-20K is used to train a number of well-known
ods’ ability to detect contamination, with higher values models, including Tulu (Wang et al., 2023a). Employing
indicating more precise detection. GPT-4 for detection with k=1 as the parameter, our find-
ings indicate the presence of 21 rephrased samples from the
We use random detection as our baseline, where scores sig- HumanEval test set, accounting for 12.8%. Example 3 is a
nificantly above random detection indicate the effectiveness rephrased sample of HumanEval in CodeAlpaca.
of a detection method. For n-gram overlap, we choose a 10-
gram approach. The embeddings are generated by multi-qa-
E XAMPLE 3 (C ODE A LPACA )
MiniLM-L6-cos-v1 and distiluse-base-multilingual-cased-
v1 (Reimers & Gurevych, 2019), with a threshold of 0.5.
HumanEval test
As shown in Table 5, except for the LLM decontaminator,
all other detection methods introduce some false positives. def sum_to_n(n: int):
Both rephrased and translated samples are undetected by """sum_to_n
is a function that
the n-gram overlap. With multi-qa BERT, the embedding sums numbers from 1 to n.
similarity search proves completely ineffective against trans- >>> sum_to_n(30)
465
lated samples. When using multilingual BERT, this method >>> sum_to_n(100)
struggles with the US History subject. The LLM decontami- 5050
>>> sum_to_n(5)
nator’s reliability and precision are evidenced by the highest 15
minimum and average f1 scores. >>> sum_to_n(10)
55
>>> sum_to_n(1)
5.2.2. H UMAN E VAL 1
"""
return sum(range(n + 1))
Now we show that existing detection methods fail to detect
rephrased samples of HumanEval, while the LLM decon-
taminator succeeds in detecting them. For HumanEval, we CodeAlpaca
construct 200 prompt pairs following the method previously
outlined for MMLU. For n-gram overlap detection, we use """
both 10-gram and 50-character overlap. Embeddings are Create a code that
summation
generated by CodeLlama and multi-qa-MiniLM-L6-cos-v1, of all numbers
with respective threshold adjustments at 0.9 and 0.6. We between 1 to n.
"""
evaluate the F1 score using n-gram overlap, embedding def sum_all_nums(n):
similarity search, and LLM decontaminator. res = 0
for i in range(1, n+1):
res += i
According to Table 6, we conclude that the embedding return res
similarity search proves effective for detection within the
print(sum_all_nums(n)) # 15
same programming language, but the effect is less noticeable
after translation. Among the methods examined, only the
LLM decontaminator reliably detects rephrased samples
in coding datasets. The similarity between programming RedPajama-Data-1T (Computer, 2023) is a widely-used
languages may explain why rephrased C is tougher to spot dataset to train open-source models. Both MPT (Team,
than rephrased JavaScript. JavaScript and Python are both 2023) and OpenLlama (Geng & Liu, 2023) use it as their pre-
interpreted languages that provide dynamic typing and some training dataset. In our study, we sample 16G of data from
functional programming constructs, so from a syntactical the GitHub subset and employ the LLM decontaminator
standpoint, JavaScript may be closer to Python. to detect, identifying 14 HumanEval rephrased samples in

7
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Figure 5: Distribution of embedding similarities among questions within the same subject. Note that it is difficult to set a
unified threshold to decontaminate due to the vast differences between subjects. For example, if we adjust the threshold to
0.8, “Abstract Algebra” may be properly spotted, but rephrased samples in “Sociology” become difficult to identify. If the
threshold is set to 0.4, “Abstract Algebra” will produce a large number of false positives.

Table 5: F1 scores of different detection methods on MMLU. The bold numbers indicate that the detection is reliable.

Algebra Sociology US History


Subjects Test Set Rephrased Rephrased Test Set Rephrased Rephrased Test Set Rephrased Rephrased
English Chinese English Chinese English Chinese
Random 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500
10-gram 0.926 0 0 1 0 0 0.816 0 0
Emb (Multi-QA BERT) 0.990 0.985 0.179 0.995 0.985 0.020 0.980 0.805 0
Emb (Multilingual BERT) 0.939 0.934 0.939 1 0.985 1 0.990 0.111 0.985
LLM Decontaminator 1 0.960 0.990 1 0.940 0.950 1 0.970 0.980

E XAMPLE 4 (R ED PAJAMA )
Table 6: F1 scores of detection methods on HumanEval

Rephrased HumanEval test


Test
Set Python C JS def change_base(x: int, base: int):
"""Change numerical base of input
Random 0.500 0.500 0.500 0.500 number x to base. return string
10-gram 1 0 0 0 representation after conversion.
Emb (CodeLlama) 0.966 0.903 0.438 0.503 base numbers are less than 10.
Emb (Multi-QA BERT) 0.985 0.938 0.774 0.788 >>> change_base(8, 3)
'22'
LLM Decontaminator 1 0.995 0.974 0.980 ...
"""
total. Example 4 is a rephrased sample of HumanEval in ret = ""
while x > 0:
RedPajama. ret = str(x % base) + ret
x //= base
MATH (Hendrycks et al., 2021) is a widely recognized return ret
math training dataset that spans various mathematical do-
mains, including algebra, geometry, and number theory.
RedPajama
It contributes to numerous math-centric datasets, such as
MathInstruct1 (Yue et al., 2023). The LLM decontamina- def convert_to_base(number, base):
tor reveals 79 instances of self-rephrased samples, which digits = "0123456789ABCDEF"
constitute 1.58% of the MATH test set. Below is a self- if number < base:
return digits[number]
rephrased sample from the MATH training set. Example 5 else:
is a rephrased sample of the MATH test in MATH training return convert_to_base(
number // base, base)
data. + digits[number % base]

1
The dataset was downloaded on Sep 30, 2023.

8
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Table 7: The Percentage of # Rephrased Sample Contamination in Real-world Datasets.

Size Rephrased
Training Set Benchmark Percentage (%)
Train Set Test Set Samples
The Stack (4G subset) HumanEval 500k 164 31 18.9
StarCoder-Data (2.4G subset) HumanEval 500k 164 26 15.9
CodeExercise-Python HumanEval 27k 164 26 15.9
CodeAlpaca HumanEval 20k 164 21 12.8
RedPajama-Data-1T (16G subset) HumanEval 1625k 164 14 8.5
Evol-Instruct-Code HumanEval 78.3k 164 13 7.9
rossetacode HumanEval 4.26k 164 4 2.4
MATHInstruct MATH Test 262k 5000 769 15.4
MATH Train MATH Test 7.5k 5000 79 1.6
FLAN CoT MMLU 184k 14042 76 0.5
WizardLM-Evol-Instruct MMLU 143k 14042 75 0.5

E XAMPLE 5 (MATH S ELF - CONTAMINATION ) 6. Discussion


(MATH test) In this section, we first discuss potential contamination be-
How many three-digit positive integers are multiples yond rephrased samples. We then discuss the importance of
of 11? the LLM decontaminator while using LLM such as GPT-4
(MATH train) to generate training data. In the end, we propose sugges-
How many positive 3-digit numbers are divisible by tions to enhance LLM evaluation (e.g. with fresh one-time
11? exams).

FLAN (Longpre et al., 2023) is a comprehensive knowl- 6.1. Beyond rephrased samples
edge training dataset, encompassing a wide variety of data In this study, we argue that rephrased test samples should
sources. We take the CoT subset, which constitutes 1.63% be considered as contamination because including them in
of FLAN. Utilizing GPT-4 for detection and set k=1 for the the training data can skew the benchmark results. However,
decontamination parameters. The findings show that 76 test formulating a precise definition of what constitutes contam-
cases, or 0.543% of the MMLU test set are rephrased. ination remains challenging. For instance, we discover in
the GSM-8k math benchmark, a training and a test example
E XAMPLE 6 (FLAN C OT) may only differ in numbers (see Example 7).
(MMLU test) E XAMPLE 7 (GSM-8 K N UMBER S UBSTITUTED
What type of meat is on a traditional Reuben sand- O NLY C ASE )
wich?
A. turkey (GSM-8k test)
B. bologna Emil is 19 years old now. When he turns 24, he will
C. corned beef be half the age of his dad but twice as old as his
D. pepperoni brother. What is the sum of the ages of his dad and
Answer: C his brother now?
(FLAN CoT) (GSM-8k)
The Reuben sandwich is an American hot sandwich When Diane turns 30, she will be half the age of
composed of corned beef, Swiss cheese, sauerkraut, Alex and twice as old as Allison. Diane is 16 years
and Russian dressing, grilled between slices of rye old now. What is the sum of the ages of Alex and
bread. Several variants exist. Allison now?
What is the meat in a reuben sandwich? Let’s have
some stream of consciousness first. If models are trained with such number substituted cases,
they tend to only memorize the solutions and may have poor
We examine more datasets and present examples in Ap- generalization beyond the seen patterns. Thus, the result-
pendix B. ing benchmark numbers may not be effective in capturing

9
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

model’s performance in math problem-solving. This is an LLMs from memorizing answer patterns. This method in-
open question we suggest the community to debate further. volves making modifications to the question and requires
the LLM to output results in a specific format. Another
6.2. Contamination in Synthetic Data approach is to employ dynamic benchmarks (Kiela et al.,
2021; Ma et al., 2021), using human-in-the-loop evaluations
The issue of unintentional contamination may occur more to reduce the risk of benchmark contamination.
often as models are increasingly trained on data generated
by LLMs, in which subtle benchmark contamination may
present. For instance, we discover several contamination 8. Conclusion
in CodeAlpaca dataset generated by GPT in Section 5.3. In this work, we study benchmark contamination in the
Phi-1 (Gunasekar et al., 2023) also detected subtle contami- context of large language models and evaluate existing de-
nation from LLM-generated data. As a result, we have to contamination methods. We show that existing detection
be more aware of potential contamination while training methods can not detect test cases with simple variations. We
models on synthetic data. We suggest model developers to demonstrate that if such variation of test data is not elim-
adopt stronger measures for decontamination. inated, a 13B model can easily overfit the test benchmark
and achieve drastically high performance. To address this,
6.3. Enhancing Benchmarks for LLMs we propose a new detection method LLM decontaminator.
While our proposed decontamination method can serve as a We apply it to real-world datasets and reveal previously
useful tool, how to detect contamination without access to unknown test overlap. We urge the community to adopt
training data remains an open problem. We propose to build stronger decontamination approaches when using public
fresh one-time questions to evaluate LLMs instead of relying benchmarks. We call for the community to actively develop
on static benchmarks. For example, in coding domain, one fresh one-time exams to accurately evaluate LLMs.
could consider using weekly coding competitions such as
CodeForces. We suggest that benchmarks should iterate as Acknowledgement
fast as model development.
We would like to express our gratitude to Ying Sheng for
the early discussion on rephrased samples. We also ex-
7. Related Work tend our thanks to Dacheng Li, Erran Li, Hao Liu, Jacob
There has been interests in studying how to identify or ex- Steinhardt, Hao Zhang, and Siyuan Zhuang for providing
tract training data from LLMs. These work examine LLMs’ insightful feedback. This project is partly supported by gifts
memorization from the perspective of data privacy (Carlini from Anyscale, Astronomer, Google, IBM, Intel, Lacework,
et al., 2021; Pan et al., 2020; Zanella-Béguelin et al., 2020; Microsoft, MBZUAI, Samsung SDS, Uber, and VMware.
Balle et al., 2022) or discuss the boundary between general- Lianmin Zheng is supported by a Meta Ph.D. Fellowship.
ization and memorization (Zhang et al., 2017; Olson et al.,
2018; Recht et al., 2019; Carlini et al., 2023), but they do References
not focus on the context of benchmark contamination.
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D.,
Some studies on contamination detection methods are con- Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z.,
ducted as well. Some are concerned with detecting and Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier-
filtering web datasets (Dodge et al., 2021; Xu & Koehn, Hellstern, K., Mishra, G., Moreira, E., Omernick, M.,
2017), employing traditional detection techniques such as Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang,
n-gram overlap. Others explore new detection methods sim- Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha,
ilar to decoding matching without access to training data. J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M.,
Exchange detection (Oren et al., 2023) considers the order Cheng, Y., Cherry, C., Choquette-Choo, C. A., Chowd-
of test cases within a benchmark, suggesting that if a model hery, A., Crepy, C., Dave, S., Dehghani, M., Dev, S.,
remembers the sequence of test cases, it may be contam- Devlin, J., Dı́az, M., Du, N., Dyer, E., Feinberg, V., Feng,
inated. The min-k prob detection (Shi et al., 2023) uses F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S.,
outlier tokens to estimate LLM contamination. This method Gonzalez, L., Gur-Ari, G., Hand, S., Hashemi, H., Hou,
analyzes the token probabilities within an arbitrary text X. L., Howland, J., Hu, A., Hui, J., Hurwitz, J., Isard, M., It-
If the LLM exhibits excessively high probabilities for some tycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun,
of these tokens, it may indicate that text X has been mixed M., Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li,
into the training set. M., Li, W., Li, Y., Li, J., Lim, H., Lin, H., Liu, Z., Liu,
F., Maggioni, M., Mahendru, A., Maynez, J., Misra, V.,
There are also related works on benchmark enhancement
Moussalem, M., Nado, Z., Nham, J., Ni, E., Nystrom, A.,
through perturbations (Zong et al., 2023), which prevents

10
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Parrish, A., Pellat, M., Polacek, M., Polozov, A., Pope, Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco,
R., Qiao, S., Reif, E., Richter, B., Riley, P., Ros, A. C., G., Groeneveld, D., Mitchell, M., and Gardner, M.
Roy, A., Saeta, B., Samuel, R., Shelby, R., Slone, A., Documenting large webtext corpora: A case study
Smilkov, D., So, D. R., Sohn, D., Tokumine, S., Valter, on the colossal clean crawled corpus. arXiv preprint
D., Vasudevan, V., Vodrahalli, K., Wang, X., Wang, P., arXiv:2104.08758, 2021.
Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y.,
Xue, L., Yin, P., Yu, J., Zhang, Q., Zheng, S., Zheng, Geng, X. and Liu, H. Openllama: An open reproduction
C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y. Palm 2 of llama, May 2023. URL https://github.com/
technical report, 2023. openlm-research/open_llama.

Balle, B., Cherubin, G., and Hayes, J. Reconstructing train- Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T.,
ing data with informed adversaries. In 2022 IEEE Sympo- Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P.,
sium on Security and Privacy (SP), pp. 1138–1156. IEEE, de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl,
2022. H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee,
Y. T., and Li, Y. Textbooks are all you need, 2023.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika,
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
M., Song, D., and Steinhardt, J. Measuring mas-
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
sive multitask language understanding. arXiv preprint
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,
arXiv:2009.03300, 2020.
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Radford, A., Sutskever, I., and Amodei, D. Language S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
models are few-shot learners, 2020. ematical problem solving with the math dataset. NeurIPS,
2021.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-
Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu,
Erlingsson, U., Oprea, A., and Raffel, C. Extracting Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma,
training data from large language models, 2021. Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia,
R., Bansal, M., Potts, C., and Williams, A. Dynabench:
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Rethinking benchmarking in nlp, 2021.
and Zhang, C. Quantifying memorization across neural
language models, 2023. Kocetkov, D., Li, R., Allal, L. B., Li, J., Mou, C., Ferrandis,
C. M., Jernite, Y., Mitchell, M., Hughes, S., Wolf, T.,
Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Bahdanau, D., von Werra, L., and de Vries, H. The stack:
Yang, L., Yi, X., Wang, C., Wang, Y., et al. A survey 3 tb of permissively licensed source code, 2022.
on evaluation of large language models. arXiv preprint
arXiv:2307.03109, 2023. Koh, P. W. and Liang, P. Understanding black-box predic-
tions via influence functions, 2020.
Chaudhary, S. Code alpaca: An instruction-following llama
model for code generation. https://github.com/ Lee, A. N., Hunter, C. J., and Ruiz, N. Platypus: Quick,
sahil280114/codealpaca, 2023. cheap, and powerful refinement of llms, 2023.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D.,
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q.,
G., et al. Evaluating large language models trained on Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O.,
code. arXiv preprint arXiv:2107.03374, 2021. Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko,
O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Uma-
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., pathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, Z., Murthy, R., Stillerman, J., Patel, S. S., Abulkhanov,
R., et al. Training verifiers to solve math word problems. D., Zocca, M., Dey, M., Zhang, Z., Fahmy, N., Bhat-
arXiv preprint arXiv:2110.14168, 2021. tacharyya, U., Yu, W., Singh, S., Luccioni, S., Villegas,
P., Kunakov, M., Zhdanov, F., Romero, M., Lee, T., Timor,
Computer, T. Redpajama: An open source recipe to N., Ding, J., Schlesinger, C., Schoelkopf, H., Ebert, J.,
reproduce llama training dataset, April 2023. URL Dao, T., Mishra, M., Gu, A., Robinson, J., Anderson,
https://github.com/togethercomputer/ C. J., Dolan-Gavitt, B., Contractor, D., Reddy, S., Fried,
RedPajama-Data. D., Bahdanau, D., Jernite, Y., Ferrandis, C. M., Hughes,

11
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

S., Wolf, T., Guha, A., von Werra, L., and de Vries, H. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Starcoder: may the source be with you!, 2023. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen,
Li, Y. Estimating contamination via perplexity: Quantifying
M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W.,
memorisation in language model evaluation, 2023.
Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn,
Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez,
Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., and Roberts, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S.,
A. The flan collection: Designing data and methods for Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y.,
effective instruction tuning, 2023. Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog,
I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi,
Ma, Z., Ethayarajh, K., Thrush, T., Jain, S., Wu, L., Jia, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R.,
R., Potts, C., Williams, A., and Kiela, D. Dynaboard: Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X.,
An evaluation-as-a-service platform for holistic next- Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur,
generation benchmarking. Advances in Neural Infor- M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S.,
mation Processing Systems, 34:10351–10367, 2021. and Scialom, T. Llama 2: Open foundation and fine-tuned
Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, chat models, 2023.
H., and Awadallah, A. Orca: Progressive learning from Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu,
complex explanation traces of gpt-4, 2023. K. R., Wadden, D., MacMillan, K., Smith, N. A., Beltagy,
Olson, M., Wyner, A., and Berk, R. Modern neural net- I., and Hajishirzi, H. How far can camels go? exploring
works generalize on small data sets. Advances in neural the state of instruction tuning on open resources, 2023a.
information processing systems, 31, 2018.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
OpenAI. Gpt-4 technical report, 2023. Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning
language models with self-generated instructions, 2023b.
Oren, Y., Meister, N., Chatterji, N., Ladhak, F., and
Hashimoto, T. B. Proving test set contamination in black Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao,
box language models, 2023. C., and Jiang, D. Wizardlm: Empowering large language
models to follow complex instructions, 2023.
Pan, X., Zhang, M., Ji, S., and Yang, M. Privacy risks of
general-purpose language models. In 2020 IEEE Sympo- Xu, H. and Koehn, P. Zipporah: a fast and scalable data
sium on Security and Privacy (SP), pp. 1314–1331. IEEE, cleaning system for noisy web-crawled parallel corpora.
2020. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pp. 2945–
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do
2950, 2017.
imagenet classifiers generalize to imagenet?, 2019.
Reimers, N. and Gurevych, I. Sentence-bert: Sentence em- Yue, X., Qu, X., Zhang, G., Fu, Y., Huang, W., Sun, H., Su,
beddings using siamese bert-networks. In Proceedings Y., and Chen, W. Mammoth: Building math generalist
of the 2019 Conference on Empirical Methods in Natu- models through hybrid instruction tuning, 2023.
ral Language Processing. Association for Computational Zanella-Béguelin, S., Wutschitz, L., Tople, S., Rühle, V.,
Linguistics, 11 2019. URL https://arxiv.org/ Paverd, A., Ohrimenko, O., Köpf, B., and Brockschmidt,
abs/1908.10084. M. Analyzing information leakage of updates to natu-
Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., ral language models. In Proceedings of the 2020 ACM
Chen, D., and Zettlemoyer, L. Detecting pretraining data SIGSAC conference on computer and communications
from large language models, 2023. security, pp. 363–375, 2020.

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.
X., Guestrin, C., Liang, P., and Hashimoto, T. B. Understanding deep learning requires rethinking general-
Stanford alpaca: An instruction-following llama ization, 2017.
model. https://github.com/tatsu-lab/
Zong, Y., Yu, T., Zhao, B., Chavhan, R., and Hospedales, T.
stanford_alpaca, 2023.
Fool your (vision and) language model with embarrass-
Team, M. N. Introducing mpt-7b: A new standard for ingly simple permutations, 2023.
open-source, commercially usable llms, 2023. URL www.
mosaicml.com/blog/mpt-7b. Accessed: 2023-
05-05.

12
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

A. Rephrase Instruction Prompts HumanEval Rephrase Instructions


We successfully constructed a rephrase prompt template: Please make significant modifications to the pro-
• Please rephrase the following question without altering gram below. Make as many changes as possible by:
its meaning. 1. Ensure that no more than three consecutive words
• Ensure that no more than ten consecutive words are are repeated and try to use similar words as substi-
repeated and try to use similar words as substitutes tutes where possible. 2. Please ensure there aren’t
where possible. 50 consecutive repeated characters. 3. Employ-
• Please ensure there aren’t 50 consecutive identical char- ing various structures, such as replacing for loops
acters. with while loops. 4. You might consider insert-
• When encountering mathematical formulas, please try ing some meaningless commands to bypass n-gram
to substitute the variable names. Ensure the formulas check, like ’pass’. 5. Rewording each sentence in
aren’t identical to the original. For instance, you can the comments and giving each variable a new name.
replace ’x’ with ’y’ or ’a’. 6. Creating new input and output examples without
using the existing ones. 7. If feasible, implement
MMLU Rephrase Instructions
the function with a different algorithm.
Please rephrase the following question without al-
tering its meaning, ensuring you adjust the word
order appropriately. Ensure that no more than five
consecutive words are repeated and try to use sim-
ilar words as substitutes where possible. Do not
change the format of the multiple-choice question.
When encountering mathematical formulas, please
try to substitute the variable names. Ensure the for-
mulas aren’t identical to the original. When you
come across a single number or letter, consider re-
placing it with a sentence. When encountering a
long sequence of numbers, if they are separated by
HumanEval Translate Instructions
spaces, you can replace the spaces with commas; if
separated by commas, you can replace them with Please translate the given program from Python to
spaces. Consider the prompt and choices as a whole; C. Make as many changes as possible by: 1. En-
there shouldn’t be consecutive words. If options are sure that no more than three consecutive words are
challenging to rephrase, consider altering the initial repeated and try to use similar words as substitutes
letter’s case. where possible. 2. Please ensure there aren’t 50 con-
secutive repeated characters. 3. Employing various
MMLU Translate Instructions structures, such as replacing for loops with while
loops. 4. You might consider inserting some mean-
Please translate the following question into lan- ingless commands to bypass n-gram check, like ’int
guage, ensuring you adjust the word order appro- useless var = 0;’. 5. Rewording each sentence in
priately. Ensure that no more than five consecutive the comments and giving each variable a new name.
words are repeated and try to use similar words as 6. Creating new input and output examples without
substitutes where possible. Do not change the for- using the existing ones. 7. If feasible, implement
mat of the multiple-choice question. When encoun- the function with a different algorithm.
tering mathematical formulas, please try to substi-
tute the variable names. Ensure the formulas aren’t
identical to the original. When you come across a
single number or letter, consider replacing it with
a sentence. When encountering a long sequence of
numbers, if they are separated by spaces, you can
replace the spaces with commas; if separated by
commas, you can replace them with spaces. If all B. Rephrase Examples
else fails, you can directly translate the numbers and
chemicals into language. Below are examples of rephrased samples in other real-
world datasets.

13
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

MATHInstruct Rephrased Sample (before Sep. 30 Evol-Instruct-Code-80k-v1 Rephrased Sample


2023)
HumanEval test
MATH test
def fib(n: int):
• The volume of a cone is given by the formula """Return n-th
V = 31 Bh, where B is the area of the base Fibonacci number.
and h is the height. The area of the base of a >>> fib(10)
cone is 30 square units, and its height is 6.5 55
>>> fib(1)
units. What is the number of cubic units in its 1
volume? >>> fib(8)
21
6
• If p(x) = 2 − x2 and q(x) = x, what is the """
value of p(q(2))? if n == 0:
return 0
• Simplify the expression (x5 + 3x2 + 3x5 ) − if n == 1:
return 1
(x7 + 2x2 + 6x5 ).
return fib(n - 1) + fib(n - 2)
• The equation of the circle that passes through
(−1, 6) and which has a center at (2, 3) can be Evol-Instruct-Code-80k-v1
written as x2 + y 2 + Ax + By + C = 0. Find
A × B × C. def fib(n):
if n == 0:
return 0
MATHInstruct elif n == 1:
return 1
• The volume of a cone is given by the formula else:
V = 31 Bh, where B is the area of the base return fib(n-1) + fib(n-2)
and h is the height. The area of the base of a
n = int(input("Enter a
cone is 30 square units, and its height is 6.5 positive integer: "))
units. What is the number of cubic units in its print("The nth Fibonacci
volume? Let’s write a Python program to solve number is", fib(n))
it.
• If p(x) = 2 − x2 and q(x) = x6 , what is the
value of p(q(2))? Please write a program to
solve it.
• Simplify the expression (x5 + 3x2 + 3x5 ) −
(x7 + 2x2 + 6x5 ). Please respond by writing
a program in Python.
• The equation of the circle that passes through
(−1, 6) and which has a center at (2, 3) can be
written as x2 + y 2 + Ax + By + C = 0. Find
A × B × C. Let’s write a Python program to
solve it.

14
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

CodeExercise-Python-27k Rephrased Sample WizardLM-evol-instruct-V2 Rephrased Sample

MMLU test
HumanEval test
Question:
def digitSum(s): The ’nuclear family’ means:
"""Task
Write a function that takes a A. a group of people sharing living accommoda-
string as input and returns tion and meals
the sum of the upper
characters only' ASCII codes. B. a network of relatives extended within or be-
tween generations
Examples:
digitSum("") => 0 C. the new family created when an adult leaves
digitSum("abAB") => 131
digitSum("abcCd") => 67 home and gets married
digitSum("helloE") => 69
digitSum("woArBld") => 131 D. a two generation unit of parents and their chil-
digitSum("aAaaaXa") => 153 dren
"""
if s == "": return 0 Answer: D
return sum(ord(char)
if char.isupper() else 0
for char in s) WizardLM-evol-instruct-V2
Question:
Define the concept of a ’nuclear family’.
CodeExercise-Python-27k Answer:
A nuclear family is a type of family structure con-
def sum_ascii_values(string): sisting of a married couple and their children living
"""
This function takes a string together as a unit, typically in a single household.
as input and returns the This type of family structure is often viewed as the
sum of ASCII values of all traditional or ideal family structure in many cultures.
uppercase letters in the string. In some cases, nuclear families may also include
Parameters:
string (str): The input string. grandparents or other close relatives living together
Returns: in the same household. However, the defining char-
int: The sum of ASCII values of acteristic of a nuclear family is that it consists of a
uppercase letters in the string. single parental unit and their dependent offspring.
Example:
>>> sum_ascii_values('Hello
World')
628
"""
# Initialize sum to 0
ascii_sum = 0
# Iterate through each
# character in the input string
for char in string:
# Check if the character
# is an uppercase letter
if char.isupper():
# Add the ASCII value
# of the uppercase
# letter to the sum
ascii_sum += ord(char)
return ascii_sum

15
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

The Stack Rephrased Sample StarCoder-Data Rephrased Sample

HumanEval test HumanEval test

def is_happy(s): def iscube(a):


"""You are given a string s. '''
Your task is to check if the Write a function that takes an
string is happy or not. A integer a and returns True
string is happy if its length if this ingeger is a cube of
is at least 3 and every 3 some integer number.
consecutive letters are distinct Note: you may assume the input
For example: is always valid.
is_happy(a) => False Examples:
is_happy(aa) => False iscube(1) ==> True
is_happy(abcd) => True iscube(2) ==> False
is_happy(aabb) => False iscube(-1) ==> True
is_happy(adb) => True iscube(64) ==> True
is_happy(xyy) => False iscube(0) ==> True
""" iscube(180) ==> False
if len(s) < 3: '''
return False a = abs(a)
return int(round(a ** (1. / 3)))
for i in range(len(s) - 2): ** 3 == a

if s[i] == s[i+1]
or s[i+1] == s[i+2] StarCoder-Data
or s[i] == s[i+2]:
return False def iscube(a):
return True '''
Write a function that takes an
integer a and returns True
The Stack if this ingeger is a cube of
some integer number.
#[PROMPT] Note: you may assume the input
def is_happy(s): is always valid.
"""You are given a string s. Examples:
Your task is to check if the iscube(1) ==> True
string is happy or not. A iscube(2) ==> False (the
string is happy if its length length of each side must
is at least 3 and every 3 be greater than zero)
consecutive letters are distinct iscube(-1) ==> True
For example: iscube(64) ==> True
is_happy(a) => False iscube(0) ==> True
is_happy(aa) => False iscube(180) ==> False
is_happy(abcd) => True
is_happy(aabb) => False Example solution:
is_happy(adb) => True # line 1
is_happy(xyy) => False a = abs(a)
""" # line 2
#[SOLUTION] cube_root = int(round(a
if len(s) < 3: ** (1. / 3)))
return False # line 3
if cube_root ˆ 3 == a:
for i in range(len(s) - 2): # line 4
return True
if s[i] == s[i+1] # line 5
or s[i+1] == s[i+2] else:
or s[i] == s[i+2]: # line 6
return False return False
return True
'''
# Please print out which line of
# the above program contains an
# error. E.g. if the bug is on
# line 4 then print 4
# END OF CONTEXT
print("3")
# END OF SOLUTION
16

You might also like