Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Cheng, Jeffrey; Marone, Marc; Weller, Orion; Lawrie, Dawn; Khashabi, Daniel; Van Durme, Benjamin

Computer Science > Computation and Language

arXiv:2403.12958 (cs)

[Submitted on 19 Mar 2024 (v1), last revised 17 Sep 2024 (this version, v2)]

Title:Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Authors:Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme

View PDF HTML (experimental)

Abstract:Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated knowledge for these subsets closely align to their cutoff dates? In this work, we define the notion of an effective cutoff. This is distinct from the LLM designer reported cutoff and applies separately to sub-resources and topics. We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM by probing across versions of the data. Using this analysis, we find that effective cutoffs often differ from reported cutoffs. To understand the root cause of this observation, we conduct a direct large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.12958 [cs.CL]
	(or arXiv:2403.12958v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.12958

Submission history

From: Jeffrey Cheng [view email]
[v1] Tue, 19 Mar 2024 17:57:58 UTC (2,386 KB)
[v2] Tue, 17 Sep 2024 17:25:40 UTC (2,412 KB)

Computer Science > Computation and Language

Title:Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators