A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data

Harvey, Ethan; Chen, Wansu; Kent, David M.; Hughes, Michael C.

Computer Science > Machine Learning

arXiv:2311.18025 (cs)

[Submitted on 29 Nov 2023]

Title:A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data

Authors:Ethan Harvey, Wansu Chen, David M. Kent, Michael C. Hughes

View PDF

Abstract:Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much classifier accuracy may improve from a 2x, 10x, or 50x increase in data size. While existing work has focused on finding a single "best-fit" curve using various functional forms like power laws, we argue that modeling and assessing the uncertainty of predictions is critical yet has seen less attention. In this paper, we propose a Gaussian process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases. We evaluate our approach in terms of error, likelihood, and coverage across six datasets. Though we focus on medical tasks and image modalities, our open source approach generalizes to any kind of classifier.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2311.18025 [cs.LG]
	(or arXiv:2311.18025v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2311.18025

Submission history

From: Ethan Harvey [view email]
[v1] Wed, 29 Nov 2023 19:10:15 UTC (2,536 KB)

Computer Science > Machine Learning

Title:A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators