Rethinking Benchmark and Contamination For Language Models With Rephrased Samples
Rethinking Benchmark and Contamination For Language Models With Rephrased Samples
Rephrased Samples
Shuo Yang * 1 2 Wei-Lin Chiang * 1 Lianmin Zheng * 1 Joseph E. Gonzalez 1 Ion Stoica 1
Abstract 1. Introduction
arXiv:2311.04850v2 [cs.CL] 11 Nov 2023
*
To study decontamination methods, in Section 3 we propose
Equal contribution 1 UC Berkeley 2 Shanghai Jiao Tong Univer- the concept of a “rephrased sample” which has the same
sity. Correspondence to: Shuo Yang <andy [email protected]>,
Ion Stoica <[email protected]>. semantics as the original sample but is hard to detect by ex-
isting contamination tests. Rephrased samples are generated
1
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Figure 1: A failure case of existing contamination detection methods (n-gram overlap, embedding similarity) on MMLU
benchmark. We place a question mark since the embedding similarity approach struggles to distinguish the rephrased
question from other questions in the same subject (high school US history). After rephrasing MMLU test cases, a Llama-2-
13B trained on a rephrased test set can reach 85.9 accuracy on MMLU while being undetectable by n-gram overlap.
2
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
The Stack(4G) 18.9% to provide the most relevant training examples, where hu-
StarCoder-Data(2.4G) 15.9% mans can then judge whether these training examples meet
CodeAlpaca 12.8%
the contamination criteria. However, this approach is im-
practical because it induces a high computational overhead.
RedPajama-Data(16G) 8.53%
3
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Table 1: Contamination detection methods. M denotes the size of the training set, and N indicates the size of the test set.
Method require access to training data require access to model computational cost
N-gram overlap yes no O(M N )
Embedding similarity search yes no O(M N + M + N )
Decoding matching no yes O(N )
Influence function yes yes O(M 2 + M N )
or Claude. “isContaminated” can refer to any contamination from Python to C or Java solving the same problem. To
detection method, such as n-gram overlap or embedding further investigate the impact of translation techniques on
similarity search. coding benchmarks, we propose the multi-language data
augmentation.
Algorithm 1 The algorithm for rephrasing samples
Multi-languages data augmentation. For coding bench-
Ensure: Rephrase(T estSet, M axRetry) marks, we use multi-language data augmentation to enhance
1: RephrasedSet ← ∅ the translation technique. By incorporating multiple lan-
2: for t in T estSet do guages, we enhance the model’s generalization ability and
3: s ← RephraseLLM(t) ensure its comprehension that translated and original code
4: retry ← 0 serve the same function. In section 5.1, our experiments
5: while isContaminated(s, t) do indicate that multilingual data augmentation yields better
6: s ← RephraseLLM(t) results than single-language translation.
7: retry ← retry + 1
8: if retry>M axRetry then
9: s ← null
4. LLM Decontaminator
10: break In this section, we propose a new contamination detection
11: end if method that accurately removes rephrased samples from a
12: end while dataset relative to a benchmark.
13: RephrasedSet ← RephrasedSet ∪ {s}
14: end for 4.1. Algorithm
15: return RephrasedSet
In Section 2, we discuss the limitations of existing detection
methods including n-gram overlap and embedding similarity
3.2. Translation Techniques search. To address the limitations, we introduce the “LLM
decontaminator” in Algorithm 2. This method involves two
There are other kinds of rephrased samples beyond modifi- steps: First, for each test case, it identifies the top-k training
cations in word order. In real-world datasets, there are many items with the highest similarity using the embedding sim-
rephrasing techniques including the translation technique. ilarity search. Each pair is evaluated whether they are the
By employing these techniques, rephrased samples become same by an advanced LLM, such as GPT-4. This approach
more concealed and still can help models achieve dramatic helps to determine how many rephrased samples there are
score improvements. in a dataset with a moderate computational overhead. “Tem-
Prompts with identical meanings from different languages plate” is a structured prompt that, when paired with a test
yield varied embeddings in most language models. By trans- case and training case, instructs the “LLMDetector” to carry
lating test prompts into other languages, we can evade n- out a comparison and return either ‘True’ or ‘False’. In
gram overlap detection and standard embedding similarity this context, ‘True’ indicates that the training case might be
searches. Only embedding models specifically trained in a rephrased sample of the test case. “LLMDetector” is a
multiple languages can detect a translated sample. high-quality LLM like GPT-4. “TopKSimilarity” identifies
the top k most similar samples in the training data using
For text-based data, the translation technique enables eva- embedding similarity search.
sion of both n-gram overlap and embedding similarity
search, while significantly boosting the score. This method 4.2. Contamination Detection Visualization
capitalizes on the model’s multilingual translation capabili-
ties, effectively transforming a knowledge assessment into In Figure 4 we present a Venn diagram of contamination
a translation task. For coding benchmarks, the translation and different detection methods. The LLM decontamina-
technique also works well. We can translate a program tor takes advantage of embedding similarity search, which
4
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Algorithm 2 The algorithm for LLM decontaminator arrangements. Example 2 is a false positive example from
Ensure: Decontaminate(T rainSet, T estSet, k, T emplate) n-gram overlap detection. Even though their multi-choice
answer patterns match exactly, they are indeed different
1: Contamination ← ∅ problems. To reduce false positive issues, we introduce
2: for t in T estSet do a “question only” control group in MMLU experiments.
3: for c in TopKSimilarity(T rainSet, t, k) do “Question Only” refers to rephrasing just the question stem,
4: s ← LLMDetector(T emplate, t, c) while “Full Prompt” refers to rephrasing both the question
5: if s = T rue then stem and the options.
6: Contamination ← Contamination ∪
{(t, c)} E XAMPLE 2 (M ULTI -C HOICE FALSE P OSITIVE )
7: end if
8: end for • Statement 1— Every group of order p2 where
9: end for p is prime is Abelian.
10: return Contamination Statement 2 — For a fixed prime p a Sylow p-
subgroup of a group G is a normal subgroup of
G if and only if it is the only Sylow p-subgroup
helps it rapidly filter out possible possible contamination. of G.
In addition, it utilizes the strong LLMs’ reliable judgments. A. True, True
We show that n-gram overlap detection can result in a higher B. False, False
false negative rate when detecting rephrased samples, and C. True, False
embedding similarity search detects many false positives D. False, True
with a high threshold. Notably, the LLM decontamina-
tor showcases higher accuracy while detecting rephrased • Statement 1 — Every group of order 42 has a
samples. See Section 5.1 for comprehensive experimental normal subgroup of order 7.
results. Statement 2 — Every group of order 42 has a
normal subgroup of order 8.
5. Experiments A. True, True
B. False, False
In Section 5.1, we demonstrate that models trained on C. True, False
rephrased samples can achieve dramatically high scores, D. False, True
achieving GPT-4 performance in three widely used bench-
marks, MMLU, HumanEval, and GSM-8k. This suggests
that rephrased samples should be considered as contami- Other details. Large numbers often induce character over-
nation and should be removed from training data. In Sec- lap. To avoid this, we change the format of large numbers,
tion 5.2, we evaluate different contamination detection meth- such as alternating between commas and spaces. Proprietary
ods based on rephrased samples in MMLU/HumanEval. In terms in various domains can also trigger overlap issues. To
Section 5.3, we apply our decontaminator to widely-used circumvent this, we rotate between abbreviations and full
training sets and discover previously unknown contamina- terms and adjust capitalization, particularly when choices
tion. involve names or chemical formulas.
Benchmark results. On the rephrased test sets, we train the
5.1. Rephrased Samples Contaminate Benchmarks Llama-2-7b and Llama-2-13b, with 16 epochs. As shown in
Table 2, Llama-2 7B and 13B trained on rephrased samples
5.1.1. MMLU K NOWLEDGE B ENCHMARK
can achieve dramatically high scores on MMLU, from 45.3
MMLU (Hendrycks et al., 2020) is one of the benchmarks to 88.5. This suggests rephrased samples can significantly
with the widest range of subjects, covering 57 disciplines skew the benchmark numbers and should be considered as
from abstract algebra to professional psychology. Rephras- contamination. The original model is tested on 5-shot, and
ing MMLU requires considering a multitude of scenarios. the model trained on rephrased data is tested on 0-shot.
Given the complexity of MMLU and its multiple-choice
format, it is necessary to explain the rephrasing details in- 5.1.2. H UMAN E VAL C ODING B ENCHMARK
volved.
HumanEval (Chen et al., 2021) is a benchmark provided by
False positive issue. The use of n-gram overlap detection in OpenAI to evaluate the coding capabilities of large language
multiple-choice questions can easily result in false positives, models. It provides the model with an incomplete piece of
especially when different questions share similar option code and asks the model to complete it.
5
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Figure 4: Venn graph depicting training data subsets and contamination detection ranges. The solid circle represents the
training data and its subsets. The dashed circles enclose areas flagged for potential contamination by detection methods within
the dataset. Notably, the LLM decontaminator showcases higher accuracy. Embedding similarity search detects broadly but
with many false positives. N-gram overlap has a limited ability to spot rephrased samples. The LLM decontaminator refines
the results from embedding similarity search using LLMs, providing a precise and efficient contamination assessment.
Fine-tune on Fine-tune on
Dead code injection. In real-world coding datasets, there Model Original
test set rephrased English
are some unreachable instructions. These dead codes sel-
Llama 2 7B 14.6 100 86.7
dom affect the semantics, and they help rephrased samples Llama 2 13B 28.7 100 95.3
to escape decontamination. Given that current detection
methods do not use compilers to remove dead code from
coding datasets, we investigate how dead codes interfere achieve 67.0 on HumanEval.
with detection methods.
5.1.3. GSM-8K M ATH B ENCHMARK
Benchmark results. We rephrase the HumanEval test set
in Python and translate it into five programming languages: GSM-8K (Cobbe et al., 2021) is a commonly used bench-
C, JavaScript, Rust, Go, and Java. We train CodeLlama 7B mark for testing the mathematical capabilities of LLMs.
and 13B on these codes respectively. Then, we construct
Benchmark results. Table 4 shows that Llama-2 7b and
a multi-programming-language dataset comprising the five
13b trained on rephrased samples achieve dramatically high
programming languages and train on it. Table 3 shows
scores on GSM-8K, from 28.7 to 95.3. The original model
CodeLlama’s performance on rephrased Python, rephrased
is tested on 5-shot, and the model trained on rephrased data
C, and the multi-programming-language dataset. CodeL-
is tested on 0-shot.
lama 7B and 13B trained on rephrased samples can achieve
dramatically high scores on HumanEval, from 32.9 to 67.7 We will explore the detection problems in Section 6 with
and 36.0 to 81.1, respectively. In contrast, GPT-4 can only GSM-8k as they relate to the “number substituted only
6
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
7
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Figure 5: Distribution of embedding similarities among questions within the same subject. Note that it is difficult to set a
unified threshold to decontaminate due to the vast differences between subjects. For example, if we adjust the threshold to
0.8, “Abstract Algebra” may be properly spotted, but rephrased samples in “Sociology” become difficult to identify. If the
threshold is set to 0.4, “Abstract Algebra” will produce a large number of false positives.
Table 5: F1 scores of different detection methods on MMLU. The bold numbers indicate that the detection is reliable.
E XAMPLE 4 (R ED PAJAMA )
Table 6: F1 scores of detection methods on HumanEval
1
The dataset was downloaded on Sep 30, 2023.
8
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Size Rephrased
Training Set Benchmark Percentage (%)
Train Set Test Set Samples
The Stack (4G subset) HumanEval 500k 164 31 18.9
StarCoder-Data (2.4G subset) HumanEval 500k 164 26 15.9
CodeExercise-Python HumanEval 27k 164 26 15.9
CodeAlpaca HumanEval 20k 164 21 12.8
RedPajama-Data-1T (16G subset) HumanEval 1625k 164 14 8.5
Evol-Instruct-Code HumanEval 78.3k 164 13 7.9
rossetacode HumanEval 4.26k 164 4 2.4
MATHInstruct MATH Test 262k 5000 769 15.4
MATH Train MATH Test 7.5k 5000 79 1.6
FLAN CoT MMLU 184k 14042 76 0.5
WizardLM-Evol-Instruct MMLU 143k 14042 75 0.5
FLAN (Longpre et al., 2023) is a comprehensive knowl- 6.1. Beyond rephrased samples
edge training dataset, encompassing a wide variety of data In this study, we argue that rephrased test samples should
sources. We take the CoT subset, which constitutes 1.63% be considered as contamination because including them in
of FLAN. Utilizing GPT-4 for detection and set k=1 for the the training data can skew the benchmark results. However,
decontamination parameters. The findings show that 76 test formulating a precise definition of what constitutes contam-
cases, or 0.543% of the MMLU test set are rephrased. ination remains challenging. For instance, we discover in
the GSM-8k math benchmark, a training and a test example
E XAMPLE 6 (FLAN C OT) may only differ in numbers (see Example 7).
(MMLU test) E XAMPLE 7 (GSM-8 K N UMBER S UBSTITUTED
What type of meat is on a traditional Reuben sand- O NLY C ASE )
wich?
A. turkey (GSM-8k test)
B. bologna Emil is 19 years old now. When he turns 24, he will
C. corned beef be half the age of his dad but twice as old as his
D. pepperoni brother. What is the sum of the ages of his dad and
Answer: C his brother now?
(FLAN CoT) (GSM-8k)
The Reuben sandwich is an American hot sandwich When Diane turns 30, she will be half the age of
composed of corned beef, Swiss cheese, sauerkraut, Alex and twice as old as Allison. Diane is 16 years
and Russian dressing, grilled between slices of rye old now. What is the sum of the ages of Alex and
bread. Several variants exist. Allison now?
What is the meat in a reuben sandwich? Let’s have
some stream of consciousness first. If models are trained with such number substituted cases,
they tend to only memorize the solutions and may have poor
We examine more datasets and present examples in Ap- generalization beyond the seen patterns. Thus, the result-
pendix B. ing benchmark numbers may not be effective in capturing
9
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
model’s performance in math problem-solving. This is an LLMs from memorizing answer patterns. This method in-
open question we suggest the community to debate further. volves making modifications to the question and requires
the LLM to output results in a specific format. Another
6.2. Contamination in Synthetic Data approach is to employ dynamic benchmarks (Kiela et al.,
2021; Ma et al., 2021), using human-in-the-loop evaluations
The issue of unintentional contamination may occur more to reduce the risk of benchmark contamination.
often as models are increasingly trained on data generated
by LLMs, in which subtle benchmark contamination may
present. For instance, we discover several contamination 8. Conclusion
in CodeAlpaca dataset generated by GPT in Section 5.3. In this work, we study benchmark contamination in the
Phi-1 (Gunasekar et al., 2023) also detected subtle contami- context of large language models and evaluate existing de-
nation from LLM-generated data. As a result, we have to contamination methods. We show that existing detection
be more aware of potential contamination while training methods can not detect test cases with simple variations. We
models on synthetic data. We suggest model developers to demonstrate that if such variation of test data is not elim-
adopt stronger measures for decontamination. inated, a 13B model can easily overfit the test benchmark
and achieve drastically high performance. To address this,
6.3. Enhancing Benchmarks for LLMs we propose a new detection method LLM decontaminator.
While our proposed decontamination method can serve as a We apply it to real-world datasets and reveal previously
useful tool, how to detect contamination without access to unknown test overlap. We urge the community to adopt
training data remains an open problem. We propose to build stronger decontamination approaches when using public
fresh one-time questions to evaluate LLMs instead of relying benchmarks. We call for the community to actively develop
on static benchmarks. For example, in coding domain, one fresh one-time exams to accurately evaluate LLMs.
could consider using weekly coding competitions such as
CodeForces. We suggest that benchmarks should iterate as Acknowledgement
fast as model development.
We would like to express our gratitude to Ying Sheng for
the early discussion on rephrased samples. We also ex-
7. Related Work tend our thanks to Dacheng Li, Erran Li, Hao Liu, Jacob
There has been interests in studying how to identify or ex- Steinhardt, Hao Zhang, and Siyuan Zhuang for providing
tract training data from LLMs. These work examine LLMs’ insightful feedback. This project is partly supported by gifts
memorization from the perspective of data privacy (Carlini from Anyscale, Astronomer, Google, IBM, Intel, Lacework,
et al., 2021; Pan et al., 2020; Zanella-Béguelin et al., 2020; Microsoft, MBZUAI, Samsung SDS, Uber, and VMware.
Balle et al., 2022) or discuss the boundary between general- Lianmin Zheng is supported by a Meta Ph.D. Fellowship.
ization and memorization (Zhang et al., 2017; Olson et al.,
2018; Recht et al., 2019; Carlini et al., 2023), but they do References
not focus on the context of benchmark contamination.
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D.,
Some studies on contamination detection methods are con- Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z.,
ducted as well. Some are concerned with detecting and Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier-
filtering web datasets (Dodge et al., 2021; Xu & Koehn, Hellstern, K., Mishra, G., Moreira, E., Omernick, M.,
2017), employing traditional detection techniques such as Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang,
n-gram overlap. Others explore new detection methods sim- Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha,
ilar to decoding matching without access to training data. J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M.,
Exchange detection (Oren et al., 2023) considers the order Cheng, Y., Cherry, C., Choquette-Choo, C. A., Chowd-
of test cases within a benchmark, suggesting that if a model hery, A., Crepy, C., Dave, S., Dehghani, M., Dev, S.,
remembers the sequence of test cases, it may be contam- Devlin, J., Dı́az, M., Du, N., Dyer, E., Feinberg, V., Feng,
inated. The min-k prob detection (Shi et al., 2023) uses F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S.,
outlier tokens to estimate LLM contamination. This method Gonzalez, L., Gur-Ari, G., Hand, S., Hashemi, H., Hou,
analyzes the token probabilities within an arbitrary text X. L., Howland, J., Hu, A., Hui, J., Hurwitz, J., Isard, M., It-
If the LLM exhibits excessively high probabilities for some tycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun,
of these tokens, it may indicate that text X has been mixed M., Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li,
into the training set. M., Li, W., Li, Y., Li, J., Lim, H., Lin, H., Liu, Z., Liu,
F., Maggioni, M., Mahendru, A., Maynez, J., Misra, V.,
There are also related works on benchmark enhancement
Moussalem, M., Nado, Z., Nham, J., Ni, E., Nystrom, A.,
through perturbations (Zong et al., 2023), which prevents
10
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Parrish, A., Pellat, M., Polacek, M., Polozov, A., Pope, Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco,
R., Qiao, S., Reif, E., Richter, B., Riley, P., Ros, A. C., G., Groeneveld, D., Mitchell, M., and Gardner, M.
Roy, A., Saeta, B., Samuel, R., Shelby, R., Slone, A., Documenting large webtext corpora: A case study
Smilkov, D., So, D. R., Sohn, D., Tokumine, S., Valter, on the colossal clean crawled corpus. arXiv preprint
D., Vasudevan, V., Vodrahalli, K., Wang, X., Wang, P., arXiv:2104.08758, 2021.
Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y.,
Xue, L., Yin, P., Yu, J., Zhang, Q., Zheng, S., Zheng, Geng, X. and Liu, H. Openllama: An open reproduction
C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y. Palm 2 of llama, May 2023. URL https://github.com/
technical report, 2023. openlm-research/open_llama.
Balle, B., Cherubin, G., and Hayes, J. Reconstructing train- Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T.,
ing data with informed adversaries. In 2022 IEEE Sympo- Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P.,
sium on Security and Privacy (SP), pp. 1138–1156. IEEE, de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl,
2022. H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee,
Y. T., and Li, Y. Textbooks are all you need, 2023.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika,
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
M., Song, D., and Steinhardt, J. Measuring mas-
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
sive multitask language understanding. arXiv preprint
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,
arXiv:2009.03300, 2020.
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Radford, A., Sutskever, I., and Amodei, D. Language S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
models are few-shot learners, 2020. ematical problem solving with the math dataset. NeurIPS,
2021.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-
Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu,
Erlingsson, U., Oprea, A., and Raffel, C. Extracting Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma,
training data from large language models, 2021. Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia,
R., Bansal, M., Potts, C., and Williams, A. Dynabench:
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Rethinking benchmarking in nlp, 2021.
and Zhang, C. Quantifying memorization across neural
language models, 2023. Kocetkov, D., Li, R., Allal, L. B., Li, J., Mou, C., Ferrandis,
C. M., Jernite, Y., Mitchell, M., Hughes, S., Wolf, T.,
Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Bahdanau, D., von Werra, L., and de Vries, H. The stack:
Yang, L., Yi, X., Wang, C., Wang, Y., et al. A survey 3 tb of permissively licensed source code, 2022.
on evaluation of large language models. arXiv preprint
arXiv:2307.03109, 2023. Koh, P. W. and Liang, P. Understanding black-box predic-
tions via influence functions, 2020.
Chaudhary, S. Code alpaca: An instruction-following llama
model for code generation. https://github.com/ Lee, A. N., Hunter, C. J., and Ruiz, N. Platypus: Quick,
sahil280114/codealpaca, 2023. cheap, and powerful refinement of llms, 2023.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D.,
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q.,
G., et al. Evaluating large language models trained on Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O.,
code. arXiv preprint arXiv:2107.03374, 2021. Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko,
O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Uma-
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., pathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, Z., Murthy, R., Stillerman, J., Patel, S. S., Abulkhanov,
R., et al. Training verifiers to solve math word problems. D., Zocca, M., Dey, M., Zhang, Z., Fahmy, N., Bhat-
arXiv preprint arXiv:2110.14168, 2021. tacharyya, U., Yu, W., Singh, S., Luccioni, S., Villegas,
P., Kunakov, M., Zhdanov, F., Romero, M., Lee, T., Timor,
Computer, T. Redpajama: An open source recipe to N., Ding, J., Schlesinger, C., Schoelkopf, H., Ebert, J.,
reproduce llama training dataset, April 2023. URL Dao, T., Mishra, M., Gu, A., Robinson, J., Anderson,
https://github.com/togethercomputer/ C. J., Dolan-Gavitt, B., Contractor, D., Reddy, S., Fried,
RedPajama-Data. D., Bahdanau, D., Jernite, Y., Ferrandis, C. M., Hughes,
11
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
S., Wolf, T., Guha, A., von Werra, L., and de Vries, H. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Starcoder: may the source be with you!, 2023. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen,
Li, Y. Estimating contamination via perplexity: Quantifying
M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W.,
memorisation in language model evaluation, 2023.
Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn,
Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez,
Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., and Roberts, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S.,
A. The flan collection: Designing data and methods for Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y.,
effective instruction tuning, 2023. Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog,
I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi,
Ma, Z., Ethayarajh, K., Thrush, T., Jain, S., Wu, L., Jia, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R.,
R., Potts, C., Williams, A., and Kiela, D. Dynaboard: Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X.,
An evaluation-as-a-service platform for holistic next- Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur,
generation benchmarking. Advances in Neural Infor- M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S.,
mation Processing Systems, 34:10351–10367, 2021. and Scialom, T. Llama 2: Open foundation and fine-tuned
Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, chat models, 2023.
H., and Awadallah, A. Orca: Progressive learning from Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu,
complex explanation traces of gpt-4, 2023. K. R., Wadden, D., MacMillan, K., Smith, N. A., Beltagy,
Olson, M., Wyner, A., and Berk, R. Modern neural net- I., and Hajishirzi, H. How far can camels go? exploring
works generalize on small data sets. Advances in neural the state of instruction tuning on open resources, 2023a.
information processing systems, 31, 2018.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
OpenAI. Gpt-4 technical report, 2023. Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning
language models with self-generated instructions, 2023b.
Oren, Y., Meister, N., Chatterji, N., Ladhak, F., and
Hashimoto, T. B. Proving test set contamination in black Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao,
box language models, 2023. C., and Jiang, D. Wizardlm: Empowering large language
models to follow complex instructions, 2023.
Pan, X., Zhang, M., Ji, S., and Yang, M. Privacy risks of
general-purpose language models. In 2020 IEEE Sympo- Xu, H. and Koehn, P. Zipporah: a fast and scalable data
sium on Security and Privacy (SP), pp. 1314–1331. IEEE, cleaning system for noisy web-crawled parallel corpora.
2020. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pp. 2945–
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do
2950, 2017.
imagenet classifiers generalize to imagenet?, 2019.
Reimers, N. and Gurevych, I. Sentence-bert: Sentence em- Yue, X., Qu, X., Zhang, G., Fu, Y., Huang, W., Sun, H., Su,
beddings using siamese bert-networks. In Proceedings Y., and Chen, W. Mammoth: Building math generalist
of the 2019 Conference on Empirical Methods in Natu- models through hybrid instruction tuning, 2023.
ral Language Processing. Association for Computational Zanella-Béguelin, S., Wutschitz, L., Tople, S., Rühle, V.,
Linguistics, 11 2019. URL https://arxiv.org/ Paverd, A., Ohrimenko, O., Köpf, B., and Brockschmidt,
abs/1908.10084. M. Analyzing information leakage of updates to natu-
Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., ral language models. In Proceedings of the 2020 ACM
Chen, D., and Zettlemoyer, L. Detecting pretraining data SIGSAC conference on computer and communications
from large language models, 2023. security, pp. 363–375, 2020.
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.
X., Guestrin, C., Liang, P., and Hashimoto, T. B. Understanding deep learning requires rethinking general-
Stanford alpaca: An instruction-following llama ization, 2017.
model. https://github.com/tatsu-lab/
Zong, Y., Yu, T., Zhao, B., Chavhan, R., and Hospedales, T.
stanford_alpaca, 2023.
Fool your (vision and) language model with embarrass-
Team, M. N. Introducing mpt-7b: A new standard for ingly simple permutations, 2023.
open-source, commercially usable llms, 2023. URL www.
mosaicml.com/blog/mpt-7b. Accessed: 2023-
05-05.
12
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
13
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
14
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
MMLU test
HumanEval test
Question:
def digitSum(s): The ’nuclear family’ means:
"""Task
Write a function that takes a A. a group of people sharing living accommoda-
string as input and returns tion and meals
the sum of the upper
characters only' ASCII codes. B. a network of relatives extended within or be-
tween generations
Examples:
digitSum("") => 0 C. the new family created when an adult leaves
digitSum("abAB") => 131
digitSum("abcCd") => 67 home and gets married
digitSum("helloE") => 69
digitSum("woArBld") => 131 D. a two generation unit of parents and their chil-
digitSum("aAaaaXa") => 153 dren
"""
if s == "": return 0 Answer: D
return sum(ord(char)
if char.isupper() else 0
for char in s) WizardLM-evol-instruct-V2
Question:
Define the concept of a ’nuclear family’.
CodeExercise-Python-27k Answer:
A nuclear family is a type of family structure con-
def sum_ascii_values(string): sisting of a married couple and their children living
"""
This function takes a string together as a unit, typically in a single household.
as input and returns the This type of family structure is often viewed as the
sum of ASCII values of all traditional or ideal family structure in many cultures.
uppercase letters in the string. In some cases, nuclear families may also include
Parameters:
string (str): The input string. grandparents or other close relatives living together
Returns: in the same household. However, the defining char-
int: The sum of ASCII values of acteristic of a nuclear family is that it consists of a
uppercase letters in the string. single parental unit and their dependent offspring.
Example:
>>> sum_ascii_values('Hello
World')
628
"""
# Initialize sum to 0
ascii_sum = 0
# Iterate through each
# character in the input string
for char in string:
# Check if the character
# is an uppercase letter
if char.isupper():
# Add the ASCII value
# of the uppercase
# letter to the sum
ascii_sum += ord(char)
return ascii_sum
15
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
if s[i] == s[i+1]
or s[i+1] == s[i+2] StarCoder-Data
or s[i] == s[i+2]:
return False def iscube(a):
return True '''
Write a function that takes an
integer a and returns True
The Stack if this ingeger is a cube of
some integer number.
#[PROMPT] Note: you may assume the input
def is_happy(s): is always valid.
"""You are given a string s. Examples:
Your task is to check if the iscube(1) ==> True
string is happy or not. A iscube(2) ==> False (the
string is happy if its length length of each side must
is at least 3 and every 3 be greater than zero)
consecutive letters are distinct iscube(-1) ==> True
For example: iscube(64) ==> True
is_happy(a) => False iscube(0) ==> True
is_happy(aa) => False iscube(180) ==> False
is_happy(abcd) => True
is_happy(aabb) => False Example solution:
is_happy(adb) => True # line 1
is_happy(xyy) => False a = abs(a)
""" # line 2
#[SOLUTION] cube_root = int(round(a
if len(s) < 3: ** (1. / 3)))
return False # line 3
if cube_root ˆ 3 == a:
for i in range(len(s) - 2): # line 4
return True
if s[i] == s[i+1] # line 5
or s[i+1] == s[i+2] else:
or s[i] == s[i+2]: # line 6
return False return False
return True
'''
# Please print out which line of
# the above program contains an
# error. E.g. if the bug is on
# line 4 then print 4
# END OF CONTEXT
print("3")
# END OF SOLUTION
16