INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

Renduchintala, H S V N S Kowndinya; Killamsetty, Krishnateja; Bhatia, Sumit; Aggarwal, Milan; Ramakrishnan, Ganesh; Iyer, Rishabh; Krishnamurthy, Balaji

Computer Science > Computation and Language

arXiv:2305.06677v1 (cs)

[Submitted on 11 May 2023 (this version), latest version 19 Oct 2023 (v2)]

Title:INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

Authors:H S V N S Kowndinya Renduchintala, Krishnateja Killamsetty, Sumit Bhatia, Milan Aggarwal, Ganesh Ramakrishnan, Rishabh Iyer, Balaji Krishnamurthy

View PDF

Abstract:A salient characteristic of large pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora. Our results demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data while retaining up to $\sim99\%$ of the performance of the fully-trained models.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2305.06677 [cs.CL]
	(or arXiv:2305.06677v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.06677

Submission history

From: Sumit Bhatia [view email]
[v1] Thu, 11 May 2023 09:24:41 UTC (1,552 KB)
[v2] Thu, 19 Oct 2023 19:55:20 UTC (789 KB)

Computer Science > Computation and Language

Title:INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators