Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings

Moenck, Keno; Thieu, Duc Trung; Koch, Julian; Schüppstuhl, Thorsten

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.09637 (cs)

[Submitted on 14 Jun 2024]

Title:Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings

Authors:Keno Moenck, Duc Trung Thieu, Julian Koch, Thorsten Schüppstuhl

View PDF HTML (experimental)

Abstract:In recent years, the upstream of Large Language Models (LLM) has also encouraged the computer vision community to work on substantial multimodal datasets and train models on a scale in a self-/semi-supervised manner, resulting in Vision Foundation Models (VFM), as, e.g., Contrastive Language-Image Pre-training (CLIP). The models generalize well and perform outstandingly on everyday objects or scenes, even on downstream tasks, tasks the model has not been trained on, while the application in specialized domains, as in an industrial context, is still an open research question. Here, fine-tuning the models or transfer learning on domain-specific data is unavoidable when objecting to adequate performance. In this work, we, on the one hand, introduce a pipeline to generate the Industrial Language-Image Dataset (ILID) based on web-crawled data; on the other hand, we demonstrate effective self-supervised transfer learning and discussing downstream tasks after training on the cheaply acquired ILID, which does not necessitate human labeling or intervention. With the proposed approach, we contribute by transferring approaches from state-of-the-art research around foundation models, transfer learning strategies, and applications to the industrial domain.

Comments:	Dataset at this https URL training- and evaluation-related code at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.09637 [cs.CV]
	(or arXiv:2406.09637v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.09637

Submission history

From: Keno Moenck [view email]
[v1] Fri, 14 Jun 2024 00:06:52 UTC (8,583 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators