Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Zhang, Jiangning; Chen, Xuhai; Wang, Yabiao; Wang, Chengjie; Liu, Yong; Li, Xiangtai; Yang, Ming-Hsuan; Tao, Dacheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.07495 (cs)

[Submitted on 12 Dec 2023 (v1), last revised 11 Aug 2024 (this version, v2)]

Title:Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Authors:Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Ming-Hsuan Yang, Dacheng Tao

View PDF HTML (experimental)

Abstract:This work studies a challenging and practical issue known as multi-class unsupervised anomaly detection (MUAD). This problem requires only normal images for training while simultaneously testing both normal and anomaly images across multiple classes. Existing reconstruction-based methods typically adopt pyramidal networks as encoders and decoders to obtain multi-resolution features, often involving complex sub-modules with extensive handcraft engineering. In contrast, a plain Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains, including detection and segmentation tasks. It is simpler, more effective, and elegant. Following this spirit, we explore the use of only plain ViT features for MUAD. We first abstract a Meta-AD concept by synthesizing current reconstruction-based methods. Subsequently, we instantiate a novel ViT-based ViTAD structure, designed incrementally from both global and local perspectives. This model provide a strong baseline to facilitate future research. Additionally, this paper uncovers several intriguing findings for further investigation. Finally, we comprehensively and fairly benchmark various approaches using eight metrics. Utilizing a basic training regimen with only an MSE loss, ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets. \Eg, achieving 85.4 mAD that surpasses UniAD by +3.0 for the MVTec AD dataset, and it requires only 1.1 hours and 2.3G GPU memory to complete model training on a single V100 that can serve as a strong baseline to facilitate the development of future research. Full code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.07495 [cs.CV]
	(or arXiv:2312.07495v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.07495

Submission history

From: Jiangning Zhang [view email]
[v1] Tue, 12 Dec 2023 18:28:59 UTC (26,582 KB)
[v2] Sun, 11 Aug 2024 14:27:16 UTC (24,182 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators