0% found this document useful (0 votes)

40 views11 pages

Artemis

Uploaded by

Nguyễn Phúc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views11 pages

Artemis

Uploaded by

Nguyễn Phúc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

ArtemisSearch: A Multimodal Search Engine for

Efficient Video Log-Life Event Retrieval Using

Time-Segmented Queries and Vision
Transformer-based Feature Extraction

Hoang-Phuc Nguyen1,2 , Thuy-Nga Ho1,2 , Minh-Dai Tran-Duong1,2 , The-Luan

Nguyen1,2 , Duc-Hao Truong1,2 , Nguyen Huu Quyen1,2 , Phan The
Duy1,2[0000−0002−5945−3712] , and Van-Hau Pham1,2[0000−0003−3147−3356]
1
Information Security Laboratory, University of Information Technology, Ho Chi
Minh city, Vietnam
2
Vietnam National University, Ho Chi Minh city, Vietnam
{22521129,22520926,22520183,23520899,22520407}@gm.uit.edu.vn,
{quyennh,duypt,haupv}@uit.edu.vn

Abstract. In this century, search engines have emerged as a crucial com-

ponent of the technological landscape. Enterprises require a search engine
to retrieve specific information within a particular field. However, they
face various challenges due to the rapidly increasing volume of data and
the need for effective database management to handle diverse data types.
Additionally, the search for data is hindered by difficulties in matching
queries with key frames or the limitations in understanding query con-
text. In this paper, we introduce ArtemisSearch, a text-based multimodal
search engine designed for temporal event retrieval in videos. In the pro-
posed system, an efficient algorithm for Content-Based Image Retrieval
(CBIR) using ViT-H/14 and BEiT3 for feature extraction and an open-
source vector database, Milvus, our system efficiently retrieves events by
leveraging temporal segmentation of queries and matching embeddings
for Artificial Intelligence (AI) applications. Additionally, we developed a
web application that allows end users to easily create temporally-aware
descriptive queries, efficiently explore top results, and view precise video
previews at relevant timestamps. The ArtemisSearch method represents
a significant advancement in temporal video retrieval, with potential ap-
plications across diverse fields, leading to a smoother and more accurate
video search experience.

Keywords: CBIR · ViT-H/14 · BEiT3 · Temporal · Event Retrieval ·

Search Engine.

1 Introduction

The surge in multimedia content, especially online video, has driven the devel-
opment of AI models for faster, more efficient data querying. Retrieving video
2 N.H.Phuc et al.

frames based on textual descriptions of specific moments has become a key re-
search area, advancing global information retrieval [1]. As people increasingly
wish to revisit specific scenes from the vast amounts of video they consume,
the need for advanced, rapid retrieval systems has grown. This demand calls for
solutions that not only offer faster query speeds, but are also adaptable across
various platforms [2], [3].
A text-based multimodal search engine is a system that enables users to
search for multimedia content, such as images, videos, or audio, using text
queries. Unlike traditional search engines that rely solely on textual metadata,
a text-based multimodal search engine processes a combination of textual in-
formation (such as descriptions, captions, or keywords) and other content-based
features like visual, audio, or even contextual data [4], [5], [6]. For instance, in
a video retrieval scenario, users might input a text query like "a cat playing
piano", and the system would search for relevant videos not only by matching
textual metadata but also by analyzing the visual and audio content of videos.
The engine might identify visual elements (the cat and the piano) and detect
piano sounds. This makes the search more powerful and contextually accurate,
even for content that may not be fully or accurately labeled with metadata.
According to research on Vision-Language Pre-training (VLP)[7], numer-
ous advanced vision-language models have emerged, narrowing the gap between
pre-trained textual and visual modalities. The advent of Transformer models
has demonstrated superior capabilities in processing both language and images.
Transformers can learn deep representations from data, fully exploiting the con-
nections between features in images and text, thereby achieving high efficiency in
combining and synchronizing information from two different data types. In this
context, ViT H/14 [8] and BEiT3 [9] emerge as powerful and versatile choices
for vision-language tasks. ViT H/14 [8], a variant of the Contrastive Language-
Image Pretraining (CLIP) [10] model, effectively balances performance and ac-
cessibility. By learning from a large dataset of image-text pairs and optimizing
the cosine similarity between text and image embedding vectors, ViT H/14 [8]
not only shows excellent suitability for text-based video retrieval tasks but also
excels in computational efficiency. With its optimized design, this model can
operate smoothly on diverse hardware systems, from low-configuration comput-
ers to those without dedicated GPUs, expanding technology accessibility to a
broader audience. Meanwhile, BEiT3 [9] augments its power by deeply integrat-
ing language and visual modalities. Using a "masked image modeling" mech-
anism similar to how language models learn representations of masked words,
BEiT3 [9] can learn high-semantic image representations, effectively combining
image context with text. This makes BEiT3 [9] an optimal solution for problems
requiring complex multimodal processing.
Taking inspiration from these studies, our research introduces an advanced
system ArtemisSearch leveraging the potentials of the CLIP-ViT-H/14 [8] and
BEiT3 [9] models to extract abstract, semantically rich features from videos in
the dataset, while ensuring widespread deployment capability. Although highly
performance, this powerful model has not yet reached its full potential for accu-
ArtemisSearch: A Multimodal Search Engine 3

rate information retrieval. To further enhance the model’s performance, we pro-

pose using EasyOCR for Optical Character Recognition (OCR)-based queries on
text in images or videos. This method is particularly valuable for large datasets
containing small details written in multiple languages worldwide or specific tradi-
tional characters of a particular country. The extracted characters are arranged
into word groups, allowing users to effectively leverage specific contextual at-
tributes such as street names, vehicle license plates, or renowned brand labels
to identify the context they are searching for. The accuracy of this data is sig-
nificantly high when the input is a high-resolution image with a well-defined
feature matrix, all optimized to operate efficiently on resource-constrained sys-
tems. To fully unlock the model’s capabilities and enhance search performance,
we augment its functionality with Milvus-based search. Milvus, an advanced
open-source vector database management platform designed for large-scale sim-
ilarity search tasks, helps narrow down the search space for feature vectors,
especially for smaller or less accessible images in our current configuration. It
provides high scalability, fast search performance, and support for multiple index
types, allowing our system to efficiently process complex queries on large video
datasets. This strategic integration enables our system to provide faster and
more reliable video retrieval, ensuring users receive the best results. Specifically,
within the context of the LifeLogs Retrieval challenges at the AI Challenge Ho
Chi Minh City 2024, our approach demonstrates its effectiveness in accurately
resolving all 30 queries of the final round.
Our approach encompasses the entire preparation process for both text-based
and video preview queries, addressing the multifaceted nature of multimedia con-
tent retrieval. By combining advanced AI models, efficient indexing, and user-
friendly interfaces, our system aims to make the vast sea of video content more
navigable and meaningful for users seeking to relive their memories or explore
visual information. The following sections of this paper will delve into the ar-
chitecture and core components of the system, as well as the deployment of
applications that support user interaction and easy access to large volumes of
video, including the retrieval of events within frames. We explore how our in-
tegrated approach tackles the challenges posed by the ever-growing volume of
video data, making it easier for users to find and revisit the moments that matter
most to them.

2 Methodology

This section offers an overview of video processing and the architecture employed
in our ArtemisSearch system.
4 N.H.Phuc et al.

Fig. 1: The Architecture overview of ArtemisSearch System.

2.1 System Architecture Overview

As shown in Fig. 1, our ArtemisSearch system showing three main components:
(1) Preprocessing module for extracting keyframes from videos, (2) Embedding
module utilizing Milvus for image embedding storage and Elasticsearch for OCR
text indexing, and (3) Retrieving module that combines image similarity search
with text-based retrieval to produce top-K keyframes based on user queries. The
results from these channels are combined using our scoring mechanism to produce
the final ranked list of relevant keyframes. This integrated approach ensures
comprehensive coverage of both visual and textual content while maintaining
efficient retrieval performance.

2.2 Video Preprocessing

FFmpeg is an open-source software framework which provides a suite of tools
for handling multimedia data. At the first stage of our system pipeline, known
as video preprocessing, we utilize FFmpeg to extract keyframes based on the
level of scene changes within the video. This method employs a filter to identify
frames that exhibit significant changes compared to previous frames, enabling
us to select representative frames that capture these transformations.
After extracting the keyframes, we resolve the issue of duplicate frames by us-
ing perceptual hashing to compare the visual characteristics of each keyframe. If
the distance between hash values is within a predefined threshold, the keyframes
are deemed duplicates and removed. The remaining frames will be processed in
the following indexing stages.

2.3 Multimodal Retrieval Models

Our study employs two state-of-the-art models for multimodal retrieval, which
are utilized independently to compare their respective performances in video
content analysis:
ArtemisSearch: A Multimodal Search Engine 5

CLIP-ViT-H/14 The CLIP-ViT-H/14 [8] model combines a vision transformer

architecture with a contrastive learning framework to enable the model to un-
derstand and relate text descriptions to visual content effectively. The model is
trained on a diverse dataset containing image-text pairs, allowing it to learn
rich representations of visual features. In our approach, we utilize the pre-
trained CLIP-ViT-H/14 [8] model to extract semantically rich features from
video frames.

To implement this, we follow these steps:

1. Video Frame Extraction: We extract frames from videos in our dataset
at a predefined interval (e.g., every second) to ensure we capture a represen-
tative sample of the video’s content.
2. Feature Extraction: Each extracted frame is fed into the CLIP-ViT-H/14 [8]
model, generating a high-dimensional feature vector that captures the visual
semantics of the frame.
3. Embedding Storage: The resulting feature vectors are stored in a Milvus
database for efficient retrieval during the query phase.

BEiT3 As an alternative approach, we evaluate the BEiT3 (Bidirectional En-

coder representation from Image Transformers) model excels in understanding
contextual relationships within images.

The implementation process includes:

1. Feature Encoding: Processing of video frames through the BEiT3 [9]
model to obtain feature vectors. sample of the video’s content.
2. Embedding Storage: Storage of BEiT3-generated feature vectors in a sep-
arate Milvus collection for comparative analysis.
Both models are used independently to extract features from the same set of
video frames, allowing for a direct comparison of their effectiveness in our re-
trieval tasks.

2.4 Score Combination And Re-Ranking Results

Our novel approach to score combination and result re-ranking integrates tem-
poral[11] relevance and textual information to enhance retrieval accuracy:

Temporal-aware Score Combination To combine the scores of retrieval re-

sults from multiple query descriptions[11] for each event frame based on specific
temporal conditions, we should establish relationships for frames that are close
together in a video during the frame extraction process. We propose an advanced
re-ranking algorithm that incorporates temporal context[11] into the retrieval
process, as shown in Algorithm 1. This algorithm dynamically adjusts scores
based on temporal proximity, enhancing the relevance of temporally coherent
results.
6 N.H.Phuc et al.

OCR-Enhanced Score Refinement To further refine our retrieval results, we

integrate OCR capabilities using EasyOCR. This addition allows us to incorpo-
rate textual information present within video frames into our scoring mechanism:
1. Text Extraction:Application of EasyOCR to extract textual content from
keyframes.
2. Score Fusion:Integration of OCR-derived textual relevance with visual sim-
ilarity scores using a weighted combination:
FinalScore = 0.7 × QueryScore + 0.3 × OCRScore (1)

Algorithm 1 Enhanced Re-ranking with Temporal Integration[11]

Require: List of retrieval keyframes A, List of query descriptions Q
Ensure: Re-ranked list of keyframes B
1: Initialize dictionary D for frame scores
2: Set temporal relevance window T
3: for each query qi in Q do
4: Perform Milvus similarity search for qi , retrieve top k frames R with scores
5: for each frame a in R do
6: if a in D then
7: D[a] = 32 × (D[a] + new_score(a))
8: else
9: D[a] = new_score(a)
10: end if
11: end for
12: end for
13: Initialize list B
14: for each frame a in D do
15: if a is within time window T of frames in B then
16: D[a]+ = δ // Temporal relevance boost
17: end if
18: Apply OCR-based score adjustment (see Section OCR filter)
19: Append a to B
20: end for
21: Sort B in descending order by score
22: return top 100 frames from B

3 ArtemisSearch System Overview

This section describes the implementation details of our system’s service layer
and user interface components.

3.1 Data Service Layer

In our approach, Data Service Layer acts as the core infrastructure component,
managing the input and output data processing. We have chosen the FastAPI
ArtemisSearch: A Multimodal Search Engine 7

framework to optimize the performance and throughput of the API interactions

between the system and the user interface within this service layer.

3.2 User Interface

The UI of ArtemisSearch consists of several key components, as shown in Fig. 2:

1. Input area for search requirements

2. OCR Filter section for text-based filtering
3. Results area displaying matched video frames
4. Frame Information section showing detailed frame information and context
5. Video Player component for playback from selected frames

These components work together to form a comprehensive tool for searching and
analyzing video content based on textual or image queries.

Fig. 2: The User Interface of ArtemisSearch.

3.3 Scenarios of Usage

In this section, we showcase examples of how our system retrieves relevant videos
from a large collection using text queries from the 2024 Ho Chi Minh City AI
8 N.H.Phuc et al.

Challenge, which focuses on AI solutions for real-world issues in Ho Chi Minh

City. The queries cover topics like culture, tourism, education, health, and the
environment.

The 2024 Ho Chi Minh City AI Challenge dataset The dataset used for
this year’s competition comprises news and events reported across various media
channels within the past 18 months. The dataset includes:

– Video: The total duration of video content is 500 hours, divided into three
batches. Batch 1 consists of 100 hours, while both Batch 2 and Batch 3
contain 200 hours each.
– Keyframe: These frames serve as representative snapshots capturing spe-
cific events at particular time points.
– Metadata: The dataset also includes descriptive, spatial, and temporal in-
formation that corresponds to the videos.

OCR Integration in ArtemisSearch: The integration of OCR into our im-

age search process has led to a substantial improvement in performance. Our
comparative analysis reveals a marked enhancement in search accuracy and re-
sult ranking when OCR is employed. As demonstrated in Fig. 3(a), without
OCR, the target image is ranked at position 28, indicating suboptimal search per-
formance. In Fig. 3(b), Frame 7131 is shown, which contains the OCR-extracted
text .This frame appears before the top 28 results at frame 7253 and plays a
critical role in the OCR-enhanced search. In contrast, Fig. 3(c) shows that with
OCR integration, the system accurately identifies the correct image as the top
result. This significant improvement highlights the essential role of OCR in en-
hancing search relevance and overall image retrieval accuracy.

(a) Without OCR: Target (c) With OCR: Target

image at rank 28 (b) Frame 7131 contains image at rank 1
the OCR text, which is a
previous frame of the top
28st - frame 7253.

Fig. 3: Comparison of search results with and without OCR integration.

ArtemisSearch: A Multimodal Search Engine 9

The OCR integration allows the search engine to effectively parse and utilize
textual information embedded in images, thereby providing a more comprehen-
sive and context-aware search capability.

Utilizing image-based input with ArtemisSearch: As shown in Fig. 4,

video search process using image input with ArtemisSearch: (1) Google search
to select an image similar to the desired video content. (2) Using the selected
image as input for search on ArtemisSearch. (3) Search results display video
frames deemed closest to the descriptive image. This process allows users to
search video content based on a reference image, enhancing efficiency in locating
specific scenes or moments within videos.

Fig. 4: An example of utilizing image-based input with ArtemisSearch.

Enhanced Temporal Search Relevance through Multiple Queries: Our

research demonstrates a significant improvement in video frame search perfor-
mance and relevance through the implementation of multiple queries. This ap-
proach enhances the temporal context of the search from the sudy of Zhang
Gengyuan et al. [11], leading to more accurate results. The comparison in Fig. 5
demonstrates that using multiple queries results in better performance compared
to a single query.

(a) Single query: Target (b) Multiple queries: Tar- (c) Query descriptions for
frame at rank 28 get frame at rank 1 sequential events

Fig. 5: Comparison of search results: Single vs. Multiple queries

10 N.H.Phuc et al.

4 Conclusion
This paper presents ArtemisSearch, an innovative multimodal video retrieval sys-
tem that effectively addresses the growing challenges in managing and search-
ing large-scale video content. Our system makes several key contributions to
the field of multimedia information retrieval. First, we successfully integrated
state-of-the-art vision-language models (CLIP-ViT-H/14 [8] and BEiT3 [9]) with
OCR capabilities to create a comprehensive retrieval solution. The combination
of these technologies enables our system to understand both visual semantics
and textual information present in video frames, significantly improving search
accuracy and relevance. Second, our temporal-aware score combination and re-
ranking approach demonstrates the importance of considering temporal rela-
tionships[11] between video frames. This novel scoring mechanism, enhanced by
OCR-based refinement, helps deliver more contextually relevant search results
while maintaining computational efficiency. Third, the system’s architecture,
built on Milvus for vector similarity search and Elasticsearch for text indexing,
proves to be both scalable and efficient. The preprocessing pipeline, utilizing
FFmpeg and perceptual hashing, effectively handles the challenge of extracting
and managing representative keyframes while eliminating redundancy. Overall,
ArtemisSearch represents a significant step forward in making video content
more accessible and searchable, particularly beneficial for applications requiring
precise moment retrieval in large video collections.

5 Future Work
While our current approach demonstrates promising results, several compelling
research directions remain to be explored. One particularly intriguing avenue in-
volves the integration of audio-based search capabilities into our existing frame-
work. As we have observed in our preliminary investigations, many queries in-
herently contain voice-related information that could potentially enhance both
search speed and precision. Building upon the groundbreaking work of Le et
al. [12] in audio-based information retrieval, we envision developing a multimodal
search system that seamlessly combines textual and audio features. Furthermore,
we recognize the potential for enhancing query-information relationships through
advanced query reformulation strategies. Drawing inspiration from the innova-
tive approach proposed by Lokoč et al. [13], we plan to implement a context-
aware query expansion mechanism. Preliminary experiments suggest that such
reformulation strategies could significantly improve search accuracy.

Acknowledgements This research was supported by The VNUHCM-University

of Information Technology’s Scientific Research Support Fund.

References
1. Newton Spolaôr, Huei Diana Lee, Weber Shoity Resende Takaki, Leandro Augusto
Ensina, Claudio Saddy Rodrigues Coy, and Feng Chung Wu. A systematic review
ArtemisSearch: A Multimodal Search Engine 11

on content-based video retrieval. Engineering Applications of Artificial Intelligence,

90:103557, 2020.
2. Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, and Yu Liu. Deep learning for
video-text retrieval: a review. International Journal of Multimedia Information
Retrieval, 12(1):3, 2023.
3. Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, and Yong Rui. A survey on
video moment localization. ACM Computing Surveys, 55(9):1–37, 2023.
4. Guanfeng Wu, Abbas Haider, Ivor Spence, and Hui Wang. Multi modal fusion for
video retrieval based on clip guide feature alignment. In Proceedings of 2024 ACM
ICMR Workshop on Multimodal Video Retrieval, pages 45–50, 2024.
5. Tayfun Alpay, Sven Magg, Philipp Broze, and Daniel Speck. Multimodal video
retrieval with clip: a user study. Information Retrieval Journal, 26(1):6, 2023.
6. Ye Zhu, Yu Wu, Nicu Sebe, and Yan Yan. Vision+ x: A survey on multimodal
learning in the light of data. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2024.
7. Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al.
Vision-language pre-training: Basics, recent advances, and future trends. Founda-
tions and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022.
8. Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recog-
nition at scale. arXiv preprint arXiv:2010.11929, 2020.
9. Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu,
Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al.
Image as a foreign language: Beit pretraining for all vision and vision-language
tasks. arXiv preprint arXiv:2208.10442, 2022.
10. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand-
hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.
Learning transferable visual models from natural language supervision. In Inter-
national conference on machine learning, pages 8748–8763. PMLR, 2021.
11. Gengyuan Zhang, Jisen Ren, Jindong Gu, and Volker Tresp. Multi-event video-text
retrieval. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 22113–22123, 2023.
12. Thanh-Thien Le, Linh The Nguyen, and Dat Quoc Nguyen. Phowhisper: Auto-
matic speech recognition for vietnamese. arXiv preprint arXiv:2406.02555, 2024.
13. Jakub Lokoč, Zuzana Vopálková, Patrik Dokoupil, and Ladislav Peška. Video
search with clip and interactive text query reformulation. In International Con-
ference on Multimedia Modeling, pages 628–633. Springer, 2023.

The Microsoft Fabric Handbook: Simplifying Data Engineering and Analytics
From Everand
The Microsoft Fabric Handbook: Simplifying Data Engineering and Analytics
Robert Johnson
No ratings yet
Data Queue in AS400
No ratings yet
Data Queue in AS400
1 page
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
The State of The Art in Image and Video Retrieval
No ratings yet
The State of The Art in Image and Video Retrieval
7 pages
MML Language
No ratings yet
MML Language
11 pages
Multi-Modal Inductive Framework For Text-Video Retrieval
No ratings yet
Multi-Modal Inductive Framework For Text-Video Retrieval
10 pages
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
From Everand
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
Margaux Masson-Forsythe
No ratings yet
Enhancing
No ratings yet
Enhancing
8 pages
The VISIONE Video Search System Explotting Off The Shelf Text Search Engines For Large Scale Video Retrieval
No ratings yet
The VISIONE Video Search System Explotting Off The Shelf Text Search Engines For Large Scale Video Retrieval
26 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
CLIP Systems and Applications: The Complete Guide for Developers and Engineers
From Everand
CLIP Systems and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Cloud-Based Multi-Modal Information Analytics
From Everand
Cloud-Based Multi-Modal Information Analytics
Tanushri Kaniyar
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
From Everand
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
From Everand
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
Manish Soni
No ratings yet
Content Based Image Retrieval: Unlocking Visual Databases
From Everand
Content Based Image Retrieval: Unlocking Visual Databases
Fouad Sabry
No ratings yet
Generative AI – An Overview: Software, #1
From Everand
Generative AI – An Overview: Software, #1
Editor IJSMI
No ratings yet
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
No ratings yet
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
4 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
A Survey of An Adaptive Weighted Spatio-Temporal Pyramid Matching For Video Retrieval
100% (1)
A Survey of An Adaptive Weighted Spatio-Temporal Pyramid Matching For Video Retrieval
3 pages
Beyond Text: Optimizing RAG With Multimodal Inputs For Industrial Applications
No ratings yet
Beyond Text: Optimizing RAG With Multimodal Inputs For Industrial Applications
14 pages
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
From Everand
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
Editor IJSMI
No ratings yet
Practical Guide to H2O.ai: Definitive Reference for Developers and Engineers
From Everand
Practical Guide to H2O.ai: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet
Developing Applications with Kivy: Definitive Reference for Developers and Engineers
From Everand
Developing Applications with Kivy: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Resoto for Cloud Resource Automation: The Complete Guide for Developers and Engineers
From Everand
Resoto for Cloud Resource Automation: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
From Everand
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Concept-Based Video Retrieval: by Cees G. M. Snoek and Marcel Worring
No ratings yet
Concept-Based Video Retrieval: by Cees G. M. Snoek and Marcel Worring
110 pages
Funnel.io for Data Integration and Automation: Definitive Reference for Developers and Engineers
From Everand
Funnel.io for Data Integration and Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Video Content Representation With Grammar For Semantic Retrieval
No ratings yet
Video Content Representation With Grammar For Semantic Retrieval
6 pages
Automatic Creative Selection With Cross-Modal Matching
No ratings yet
Automatic Creative Selection With Cross-Modal Matching
3 pages
Multimedia Answer Generation For Community Question Answering
No ratings yet
Multimedia Answer Generation For Community Question Answering
17 pages
Saying What You're Looking For: Linguistics Meets Video Search
No ratings yet
Saying What You're Looking For: Linguistics Meets Video Search
14 pages
A State-Of-The-Art Review On Multimodal Video Indexing: August 2002
No ratings yet
A State-Of-The-Art Review On Multimodal Video Indexing: August 2002
10 pages
ThoughtSpot Analytics and Administration: Definitive Reference for Developers and Engineers
From Everand
ThoughtSpot Analytics and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
From Everand
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
Georgio Daccache
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models
No ratings yet
Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models
8 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Computer Vision: Fundamentals and Applications
From Everand
Computer Vision: Fundamentals and Applications
Fouad Sabry
No ratings yet
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
From Everand
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Search Revealed: Mastering the Algorithm for Search Dominance
From Everand
Google Search Revealed: Mastering the Algorithm for Search Dominance
Azhar ul Haque Sario
No ratings yet
Irs Sem Unit 5
No ratings yet
Irs Sem Unit 5
8 pages
2019 - VideoBERT - A Joint Model For Video and Language Representation Learning
No ratings yet
2019 - VideoBERT - A Joint Model For Video and Language Representation Learning
13 pages
Pinecone Hybrid Search Engineering: The Complete Guide for Developers and Engineers
From Everand
Pinecone Hybrid Search Engineering: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Detectron2 in Practice: Definitive Reference for Developers and Engineers
From Everand
Detectron2 in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Extism Plugin Development with WebAssembly: The Complete Guide for Developers and Engineers
From Everand
Extism Plugin Development with WebAssembly: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
MDMMT: Multidomain Multimodal Transformer For Video Retrieval
No ratings yet
MDMMT: Multidomain Multimodal Transformer For Video Retrieval
18 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
From Everand
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
Aniruddha Deswandikar
No ratings yet
Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system
From Everand
Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system
Miroslaw Staron
No ratings yet
Micropython Essentials: Definitive Reference for Developers and Engineers
From Everand
Micropython Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cohesity Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Cohesity Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Up Next: Retrieval Methods For Large Scale Related Video Suggestion
No ratings yet
Up Next: Retrieval Methods For Large Scale Related Video Suggestion
10 pages
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit 1
No ratings yet
Unit 1
29 pages
7.7.2.2 Common Problems and Solutions For Laptops
100% (1)
7.7.2.2 Common Problems and Solutions For Laptops
3 pages
Activity 3: The or Gate Objective:: A B (A) Y A+B
No ratings yet
Activity 3: The or Gate Objective:: A B (A) Y A+B
3 pages
OptiXstar B650 en
No ratings yet
OptiXstar B650 en
3 pages
Nvidia Fundamentals of Deep Learning PPT 1
No ratings yet
Nvidia Fundamentals of Deep Learning PPT 1
40 pages
Final Java..
No ratings yet
Final Java..
13 pages
Explanation: Answer: A B C D
No ratings yet
Explanation: Answer: A B C D
56 pages
May Jun 2024
No ratings yet
May Jun 2024
2 pages
PANDUAN VidIQ
No ratings yet
PANDUAN VidIQ
3 pages
The Evolution of A Programmer
100% (1)
The Evolution of A Programmer
7 pages
VTU Provisional Results Sheet
No ratings yet
VTU Provisional Results Sheet
1 page
5th Sem Syllabus BME
No ratings yet
5th Sem Syllabus BME
17 pages
MAHARAJ.M FlowCV Resume 20241105
No ratings yet
MAHARAJ.M FlowCV Resume 20241105
4 pages
Format Project Report
No ratings yet
Format Project Report
51 pages
How Does The Software Work?: Jugar - Toyota Smart Key Solution
No ratings yet
How Does The Software Work?: Jugar - Toyota Smart Key Solution
4 pages
Sucgang - Week5 - Programming Session
No ratings yet
Sucgang - Week5 - Programming Session
4 pages
Downloadtechpubscurrent1419pulse Connect Securepcs8.3rxps Pcs Sa 8.3r7 Releasenotes PDF
No ratings yet
Downloadtechpubscurrent1419pulse Connect Securepcs8.3rxps Pcs Sa 8.3r7 Releasenotes PDF
47 pages
Richard Raposa (Author) - Understanding C++ For MFC-CRC Press (2001)
No ratings yet
Richard Raposa (Author) - Understanding C++ For MFC-CRC Press (2001)
258 pages
A7 Firmware Installation Instructions 2
No ratings yet
A7 Firmware Installation Instructions 2
2 pages
6854 Proj
No ratings yet
6854 Proj
7 pages
Objective
No ratings yet
Objective
6 pages
Windows Programming Chapter Two
No ratings yet
Windows Programming Chapter Two
59 pages
Infoscale HCL 8x Unix 19032024
No ratings yet
Infoscale HCL 8x Unix 19032024
52 pages
Core Aruba 1 New
No ratings yet
Core Aruba 1 New
12 pages
Os BCS303 Lab Manual DR - Ttit
No ratings yet
Os BCS303 Lab Manual DR - Ttit
36 pages
Common Report SMEC For PDF
No ratings yet
Common Report SMEC For PDF
29 pages
Ch-8 Dynamic Memory Allocation
No ratings yet
Ch-8 Dynamic Memory Allocation
18 pages
COEN 320 Midterm 2 Cheat Sheet FINAL
No ratings yet
COEN 320 Midterm 2 Cheat Sheet FINAL
2 pages
F 3 D Solver Runner Debug Log Preview
No ratings yet
F 3 D Solver Runner Debug Log Preview
11 pages

Artemis

Uploaded by

Artemis

Uploaded by

ArtemisSearch: A Multimodal Search Engine for

Efficient Video Log-Life Event Retrieval Using

Hoang-Phuc Nguyen1,2 , Thuy-Nga Ho1,2 , Minh-Dai Tran-Duong1,2 , The-Luan

Abstract. In this century, search engines have emerged as a crucial com-

Keywords: CBIR · ViT-H/14 · BEiT3 · Temporal · Event Retrieval ·

rate information retrieval. To further enhance the model’s performance, we pro-

Fig. 1: The Architecture overview of ArtemisSearch System.

2.1 System Architecture Overview

2.2 Video Preprocessing

2.3 Multimodal Retrieval Models

CLIP-ViT-H/14 The CLIP-ViT-H/14 [8] model combines a vision transformer

To implement this, we follow these steps:

BEiT3 As an alternative approach, we evaluate the BEiT3 (Bidirectional En-

The implementation process includes:

2.4 Score Combination And Re-Ranking Results

Temporal-aware Score Combination To combine the scores of retrieval re-

OCR-Enhanced Score Refinement To further refine our retrieval results, we

Algorithm 1 Enhanced Re-ranking with Temporal Integration[11]

3 ArtemisSearch System Overview

3.1 Data Service Layer

framework to optimize the performance and throughput of the API interactions

3.2 User Interface

The UI of ArtemisSearch consists of several key components, as shown in Fig. 2:

1. Input area for search requirements

Fig. 2: The User Interface of ArtemisSearch.

3.3 Scenarios of Usage

Challenge, which focuses on AI solutions for real-world issues in Ho Chi Minh

OCR Integration in ArtemisSearch: The integration of OCR into our im-

(a) Without OCR: Target (c) With OCR: Target

Fig. 3: Comparison of search results with and without OCR integration.

Utilizing image-based input with ArtemisSearch: As shown in Fig. 4,

Fig. 4: An example of utilizing image-based input with ArtemisSearch.

Enhanced Temporal Search Relevance through Multiple Queries: Our

Fig. 5: Comparison of search results: Single vs. Multiple queries

Acknowledgements This research was supported by The VNUHCM-University

on content-based video retrieval. Engineering Applications of Artificial Intelligence,

You might also like