Computer Science > Computer Vision and Pattern Recognition
[Submitted on 24 Dec 2024 (v1), last revised 28 May 2025 (this version, v2)]
Title:The Impact of the Single-Label Assumption in Image Recognition Benchmarking
View PDF HTML (experimental)Abstract:Deep neural networks (DNNs) are typically evaluated under the assumption that each image has a single correct label. However, many images in benchmarks like ImageNet contain multiple valid labels, creating a mismatch between evaluation protocols and the actual complexity of visual data. This mismatch can penalize DNNs for predicting correct but unannotated labels, which may partly explain reported accuracy drops, such as the widely cited 11 to 14 percent top-1 accuracy decline on ImageNetV2, a replication test set for ImageNet. This raises the question: do such drops reflect genuine generalization failures or artifacts of restrictive evaluation metrics? We rigorously assess the impact of multi-label characteristics on reported accuracy gaps. To evaluate the multi-label prediction capability (MLPC) of single-label-trained models, we introduce a variable top-$k$ evaluation, where $k$ matches the number of valid labels per image. Applied to 315 ImageNet-trained models, our analyses demonstrate that conventional top-1 accuracy disproportionately penalizes valid but secondary predictions. We also propose Aggregate Subgroup Model Accuracy (ASMA) to better capture multi-label performance across model subgroups. Our results reveal wide variability in MLPC, with some models consistently ranking multiple correct labels higher. Under this evaluation, the perceived gap between ImageNet and ImageNetV2 narrows substantially. To further isolate multi-label recognition performance from contextual cues, we introduce PatchML, a synthetic dataset containing systematically combined object patches. PatchML demonstrates that many models trained with single-label supervision nonetheless recognize multiple objects. Altogether, these findings highlight limitations in single-label evaluation and reveal that modern DNNs have stronger multi-label capabilities than standard metrics suggest.
Submission history
From: Esla Timothy Anzaku [view email][v1] Tue, 24 Dec 2024 12:55:31 UTC (11,808 KB)
[v2] Wed, 28 May 2025 01:15:13 UTC (3,227 KB)
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.