Artemis
Artemis
1 Introduction
The surge in multimedia content, especially online video, has driven the devel-
opment of AI models for faster, more efficient data querying. Retrieving video
2 N.H.Phuc et al.
frames based on textual descriptions of specific moments has become a key re-
search area, advancing global information retrieval [1]. As people increasingly
wish to revisit specific scenes from the vast amounts of video they consume,
the need for advanced, rapid retrieval systems has grown. This demand calls for
solutions that not only offer faster query speeds, but are also adaptable across
various platforms [2], [3].
A text-based multimodal search engine is a system that enables users to
search for multimedia content, such as images, videos, or audio, using text
queries. Unlike traditional search engines that rely solely on textual metadata,
a text-based multimodal search engine processes a combination of textual in-
formation (such as descriptions, captions, or keywords) and other content-based
features like visual, audio, or even contextual data [4], [5], [6]. For instance, in
a video retrieval scenario, users might input a text query like "a cat playing
piano", and the system would search for relevant videos not only by matching
textual metadata but also by analyzing the visual and audio content of videos.
The engine might identify visual elements (the cat and the piano) and detect
piano sounds. This makes the search more powerful and contextually accurate,
even for content that may not be fully or accurately labeled with metadata.
According to research on Vision-Language Pre-training (VLP)[7], numer-
ous advanced vision-language models have emerged, narrowing the gap between
pre-trained textual and visual modalities. The advent of Transformer models
has demonstrated superior capabilities in processing both language and images.
Transformers can learn deep representations from data, fully exploiting the con-
nections between features in images and text, thereby achieving high efficiency in
combining and synchronizing information from two different data types. In this
context, ViT H/14 [8] and BEiT3 [9] emerge as powerful and versatile choices
for vision-language tasks. ViT H/14 [8], a variant of the Contrastive Language-
Image Pretraining (CLIP) [10] model, effectively balances performance and ac-
cessibility. By learning from a large dataset of image-text pairs and optimizing
the cosine similarity between text and image embedding vectors, ViT H/14 [8]
not only shows excellent suitability for text-based video retrieval tasks but also
excels in computational efficiency. With its optimized design, this model can
operate smoothly on diverse hardware systems, from low-configuration comput-
ers to those without dedicated GPUs, expanding technology accessibility to a
broader audience. Meanwhile, BEiT3 [9] augments its power by deeply integrat-
ing language and visual modalities. Using a "masked image modeling" mech-
anism similar to how language models learn representations of masked words,
BEiT3 [9] can learn high-semantic image representations, effectively combining
image context with text. This makes BEiT3 [9] an optimal solution for problems
requiring complex multimodal processing.
Taking inspiration from these studies, our research introduces an advanced
system ArtemisSearch leveraging the potentials of the CLIP-ViT-H/14 [8] and
BEiT3 [9] models to extract abstract, semantically rich features from videos in
the dataset, while ensuring widespread deployment capability. Although highly
performance, this powerful model has not yet reached its full potential for accu-
ArtemisSearch: A Multimodal Search Engine 3
2 Methodology
This section offers an overview of video processing and the architecture employed
in our ArtemisSearch system.
4 N.H.Phuc et al.
These components work together to form a comprehensive tool for searching and
analyzing video content based on textual or image queries.
In this section, we showcase examples of how our system retrieves relevant videos
from a large collection using text queries from the 2024 Ho Chi Minh City AI
8 N.H.Phuc et al.
The 2024 Ho Chi Minh City AI Challenge dataset The dataset used for
this year’s competition comprises news and events reported across various media
channels within the past 18 months. The dataset includes:
– Video: The total duration of video content is 500 hours, divided into three
batches. Batch 1 consists of 100 hours, while both Batch 2 and Batch 3
contain 200 hours each.
– Keyframe: These frames serve as representative snapshots capturing spe-
cific events at particular time points.
– Metadata: The dataset also includes descriptive, spatial, and temporal in-
formation that corresponds to the videos.
The OCR integration allows the search engine to effectively parse and utilize
textual information embedded in images, thereby providing a more comprehen-
sive and context-aware search capability.
(a) Single query: Target (b) Multiple queries: Tar- (c) Query descriptions for
frame at rank 28 get frame at rank 1 sequential events
4 Conclusion
This paper presents ArtemisSearch, an innovative multimodal video retrieval sys-
tem that effectively addresses the growing challenges in managing and search-
ing large-scale video content. Our system makes several key contributions to
the field of multimedia information retrieval. First, we successfully integrated
state-of-the-art vision-language models (CLIP-ViT-H/14 [8] and BEiT3 [9]) with
OCR capabilities to create a comprehensive retrieval solution. The combination
of these technologies enables our system to understand both visual semantics
and textual information present in video frames, significantly improving search
accuracy and relevance. Second, our temporal-aware score combination and re-
ranking approach demonstrates the importance of considering temporal rela-
tionships[11] between video frames. This novel scoring mechanism, enhanced by
OCR-based refinement, helps deliver more contextually relevant search results
while maintaining computational efficiency. Third, the system’s architecture,
built on Milvus for vector similarity search and Elasticsearch for text indexing,
proves to be both scalable and efficient. The preprocessing pipeline, utilizing
FFmpeg and perceptual hashing, effectively handles the challenge of extracting
and managing representative keyframes while eliminating redundancy. Overall,
ArtemisSearch represents a significant step forward in making video content
more accessible and searchable, particularly beneficial for applications requiring
precise moment retrieval in large video collections.
5 Future Work
While our current approach demonstrates promising results, several compelling
research directions remain to be explored. One particularly intriguing avenue in-
volves the integration of audio-based search capabilities into our existing frame-
work. As we have observed in our preliminary investigations, many queries in-
herently contain voice-related information that could potentially enhance both
search speed and precision. Building upon the groundbreaking work of Le et
al. [12] in audio-based information retrieval, we envision developing a multimodal
search system that seamlessly combines textual and audio features. Furthermore,
we recognize the potential for enhancing query-information relationships through
advanced query reformulation strategies. Drawing inspiration from the innova-
tive approach proposed by Lokoč et al. [13], we plan to implement a context-
aware query expansion mechanism. Preliminary experiments suggest that such
reformulation strategies could significantly improve search accuracy.
References
1. Newton Spolaôr, Huei Diana Lee, Weber Shoity Resende Takaki, Leandro Augusto
Ensina, Claudio Saddy Rodrigues Coy, and Feng Chung Wu. A systematic review
ArtemisSearch: A Multimodal Search Engine 11