0% found this document useful (0 votes)
43 views8 pages

Enhancing

Uploaded by

Nguyễn Phúc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views8 pages

Enhancing

Uploaded by

Nguyễn Phúc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Enhancing Video Retrieval with Robust CLIP-Based Multimodal

System
Minh-Dung Le-Quynh∗ Anh-Tuan Nguyen∗ Anh-Tuan Quang-Hoang∗
Lazada Vietnam University of Science Ford Motor
Ho Chi Minh, Viet Nam Ho Chi Minh, Viet Nam Los Angeles, United States

Van-Huy Dinh∗ Tien-Huy Nguyen∗ Hoang-Bach Ngo


HUTECH University University of Information Technology University of Science
Ho Chi Minh, Viet Nam Ho Chi Minh, Viet Nam Ho Chi Minh, Viet Nam

Minh-Hung An
FPT Telecom
Ho Chi Minh, Viet Nam
ABSTRACT 1 INTRODUCTION
In the rapidly evolving landscape of multimedia data, the need The exponential growth of multimedia data, particularly video con-
for efficient content-based video retrieval has become increasingly tent on the internet, has ushered in an era where the effective and
vital. To tackle this challenge, we introduce an interactive video efficient retrieval of relevant information presents a pressing chal-
retrieval system designed to retrieve data from vast online video lenge. Content-based video retrieval, the task of retrieving video
collections efficiently. Our solution encompasses rich textual to frames based on textual queries, has gained significant attention in
visual descriptions, advanced human detection capabilities, and recent years. The demand from users is steadily increasing, requir-
a novel Sketch-Text retrieval mechanism, rendering the search ing faster query speeds and reduced time to locate a specific frame
process comprehensive and precise. At its core, the system lever- within a vast collection of videos based on the provided textual
ages the Contrastive Language-Image Pretraining (CLIP) model, query.
renowned for its proficiency in bridging the gap between visual In recent times, groundbreaking advancements in multimodal re-
and textual data. Our user-friendly web application allows users search have emerged in the form of many powerful vision-language
to create queries, explore top results, find similar images, preview models [7, 11, 16], bridging the gap between textual and visual
short video clips, and select and export pertinent data, enhancing modalities. Among them, Contrastive Language-Image Pretraining
the effectiveness and accessibility of content-based video retrieval. (CLIP) [14] has emerged as a powerful pretraining backbone for a
broad range of applications in text-image-related tasks. By learning
CCS CONCEPTS from a vast collection of text-image pairs and minimizing the cosine
• Information systems → Information retrieval diversity. similarity between the text and image embedding vectors, CLIP has
proven to be particularly well-suited for retrieving videos based
KEYWORDS on textual queries. The advent of CLIP has opened up many pos-
sibilities for developing robust and powerful content-based video
multimodal retrieval, text-based image retrieval, sketch-based im-
retrieval pipelines.
age retrieval, interactive video retrieval
In this paper, we introduce our system, which harnesses the
ACM Reference Format: power of the CLIP model to extract abstract, content-rich features
Minh-Dung Le-Quynh, Anh-Tuan Nguyen, Anh-Tuan Quang-Hoang, Van- from the videos in the dataset. To further enhance the capabilities
Huy Dinh, Tien-Huy Nguyen, Hoang-Bach Ngo, and Minh-Hung An. 2023. of the CLIP model, we’ve integrated various supporting models and
Enhancing Video Retrieval with Robust CLIP-Based Multimodal System.
techniques, including human detection and retrieval from sketches
In The 12th International Symposium on Information and Communication
and text. To facilitate rapid and robust retrieval, we efficiently index
Technology (SOICT 2023), December 7–8, 2023, Ho Chi Minh, Vietnam. ACM,
New York, NY, USA, 8 pages. https://doi.org/10.1145/3628797.3629011 these features using Faiss, a state-of-the-art similarity search library,
in conjunction with our database systems. This strategic integration
∗ All authors contributed equally to this paper. empowers our system to provide faster and more reliable video
retrieval, ensuring users receive accurate and relevant results. We
Permission to make digital or hard copies of all or part of this work for personal or participated in the Ho Chi Minh AI 2023 competition, a formidable
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
challenge that requires the retrieval of pertinent video data from a
on the first page. Copyrights for components of this work owned by others than the massive 500+ hour video collection. This undertaking presents a
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or significant challenge for any participating team.
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. In the following sections, we delve into the architecture and
SOICT 2023, December 7–8, 2023, Ho Chi Minh, Vietnam components of our system, shedding light on how it addresses the
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. challenges posed by the burgeoning volume of video data on the
ACM ISBN 979-8-4007-0891-6/23/12. . . $15.00
https://doi.org/10.1145/3628797.3629011 internet.
SOICT 2023, December 7–8, 2023, Ho Chi Minh, Vietnam Minh-Dung, Anh-Tuan Nguyen, Anh-Tuan Quang-Hoang, Van-Huy, Tien-Huy, Hoang-Bach and Minh-Hung

2 RELATED RESEARCH be queried with text alone. Developed based on CLIP, it enables
In our daily lives, this encompasses actions captured through im- users to rediscover past images or forgotten memories. Moreover,
ages, events we’ve participated in, knowledge we’ve acquired, data our system allows users to search for portions before or after the
related to personal information, and information about various image in question. The system’s architecture is depicted in Figure
organizations. This data serves to aid our memory—enabling us to 1, illustrating the overall process, and will be discussed in detail in
search for past knowledge, retrace our paths traveled, and revisit the following section.
images from specific events [3]. This remarkable concept has signif-
icantly contributed to the global increase in information retrieval. 3 VIDEO PREPROCESSING
People are increasingly inclined to retrieve and revisit data from The Transnet model is a powerful tool in the field of shot transitions.
their past. In the initial stage of our system pipeline (Figure 1), known as video
In Content-based image retrieval using the Tesseract OCR en- preprocessing, we utilize the Transnet model to transform long
gine and Levenshtein algorithm [1], the author presents a technical video sequences into a list of scenes. Following this, we employ
proposal for conducting text-based OCR queries within images or FFmpeg 1 , an open-source software framework, to extract three
videos. This approach is particularly valuable for large datasets con- main keyframes, denoted as 𝑝𝑖 , for each scene using the formula
taining small details written in multiple languages worldwide or 1 where s and e represent the first and last frame positions of the
traditional characters specific to a particular nation. The extracted scene, respectively.
characters are organized into word groups, resembling a Bag of
Words (BOW) [17] structure. This enables users to effectively lever- 𝑝𝑖 ∈ {𝑠 + (𝑒 − 𝑠) × (0.5𝑖 − 0.5)|𝑖 ∈ [1, 2, 3]} (1)
age scene-specific attributes like street names or license plates to Keyframe extraction from videos is a crucial stage, serving as a
determine the context they are searching for. The accuracy of this cornerstone for the effectiveness of our retrieval system. It enables
data is significantly high when the input is a high-resolution image concise information filtering and minimizes memory usage within
with clearly represented feature matrices. the retrieval system. These keyframes will undergo processing in
For users searching for songs but only remembering the lyrics, the subsequent indexing stages, as elaborated in Section 4.
and sometimes unable to recall the song title or the artist’s name,
the ASR model [5] transforms video data into speech data of human 4 MULTIMODAL RETRIEVAL
subjects, extracting them into word clusters or sentences. These
are then constructed into a text-based search tool from speech to 4.1 Text-based Retrieval
optimize language data. In the paper by Liu et al. [8], the authors The remarkable feature of the CLIP model [14] lies in its exceptional
focus on researching transfer learning methods to enhance the effi- ability to connect text comprehension with image recognition, ef-
ciency of the automatic speech recognition system in Vietnamese. fectively bridging this gap. This attribute has elevated CLIP into
Specifically, they delve into pre-training and fine-tuning (PT/FT) a potent tool for addressing the significant challenges associated
methods [18, 20, 21], Prognets architecture, and bottleneck features. with locating images and videos within extensive online multimedia
The AI Challenge (AIC) has organized a competition focused on content. Users can input text queries in a natural language format,
data retrieval, emphasizing creativity and implementation among enabling them to achieve their search objectives through a more
participants. In this paper, we leverage OpenAI’s latest state-of-the- intuitive and user-centric approach.
art model, CLIP [10, 14], to establish semantic congruence between Our system applies the CLIP model for text-based retrieval through
text and images. CLIP maps text and images into a matrix to make the following steps:
feature predictions. Despite its high performance, this powerful • All keyframes collected from videos are embedded into vec-
model hasn’t yet reached its maximum potential for accurately tors to store features in Faiss before retrieval (indexing part
retrieving required information. To unlock CLIP’s capabilities fully in Figure 1).
and enhance its performance, we augment it with Faiss [4]-based • When users input a text description query, it is embedded
search functionality. Faiss helps narrow the search space for feature using the CLIP model to create a text-embedding vector.
vectors, especially for small or less accessible images within our This step involves converting text into numerical vectors,
current configuration. enabling the collection of query features, and proceeding to
This encompasses our entire preparation process for both text- the next stage of image retrieval.
based and video preview queries. Since over half of the participating • After obtaining the text embedding from the query, we per-
teams in the competition are utilizing CLIP as their primary model, form a cosine similarity measure between the text embed-
we propose a solution: leveraging our memory to sketch details to ding vector and all the keyframe embedding vectors stored
deliver results. Image search through sketching on modern touch in Faiss (retrieving part in Figure 1). The returned result
devices has gained traction in recent years, and with Sketch [2, 9, 13, consists of the closest vectors in Faiss, limited to a specified
19], it has become a hot topic. However, this type of search yields a number based on the top-K closest vectors with the highest
vast dataset, making it challenging to select the desired results. In similarity score.
our system, we combine sketching with text to narrow the search • To improve the performance of the embedding model, we uti-
scope, aiming to develop a more compact representation, increase lize the latest large CLIP model, specifically the Vision Trans-
matching speed and enhance the system’s search performance former pretrained at a 336-pixel resolution (ViT-L/14@336p).
[15]. Sketching proves adept at handling image details that cannot
1 https://github.com/FFmpeg/FFmpeg
Enhancing Video Retrieval with Robust CLIP-Based Multimodal System SOICT 2023, December 7–8, 2023, Ho Chi Minh, Vietnam

Figure 1: Video retrieval system architecture. Our video retrieval system architecture comprises three distinct phases. The initial
phase, known as video preprocessing, involves scene detection using the TransNet model. Subsequently, FFmpeg is utilized
to extract keyframes from the video, and these keyframes are stored in a MongoDB database. In the indexing phase, each
keyframe in the database undergoes vector embedding. These embeddings are then indexed using the Faiss library. Additionally,
we employ a Human Detection and Gender Recognition model to gather crucial information for re-ranking. Lastly, in the
retrieval phase, we encode the user’s query into an embedding using specialized models. Leveraging the k-nearest neighbors
(kNN) algorithm, we identify the most relevant videos. To optimize retrieval accuracy, we implement a simple yet effective
re-ranking algorithm based on human detection.

CLIP revolutionizes image retrieval with easy application, cost- Although relying solely on text queries is a common approach in
efficiency, and high effectiveness. Its natural language query ca- image retrieval systems, it depends heavily on the effectiveness of
pability simplifies the process, reducing complexity and making it the pre-trained CLIP model and the characteristics of user query
user-friendly for everyone. The low implementation cost makes it generation. When provided with specific image information about
accessible, amplifying its utility across various applications. Addi- any event, each user may compose the sentence in a written query
tionally, CLIP’s efficiency in processing natural language queries in various ways. Each type of query can yield a multitude of results
ensures rapid and accurate image retrieval, underscoring its trans- with varying levels of effectiveness based on the user. To address
formative impact. this, our system has introduced an innovative retrieval approach
by combining text and sketch queries. This integration allows for
4.2 Image Retrieval With Text And Sketch a richer representation of information, enhancing the accuracy of
retrieving images related to the event being searched. In the case of
Retrieving images of events through text-based queries has been a
sketch queries, users input a specific shape representing the image
fundamental aspect of image retrieval systems, enabling users to
they aim to retrieve into the system. These sketched images remain
describe events using words and phrases, facilitated by the pow-
erful state-of-the-art CLIP model in vision language processing.
SOICT 2023, December 7–8, 2023, Ho Chi Minh, Vietnam Minh-Dung, Anh-Tuan Nguyen, Anh-Tuan Quang-Hoang, Van-Huy, Tien-Huy, Hoang-Bach and Minh-Hung

consistent regardless of the user, providing a more precise means Algorithm 1 Re-ranking on female filter
of conveying retrieval intentions. Require: List of retrieval keyframe 𝐴, query number 𝑛
We employ a pre-trained model called TASK-former [15] to con- Ensure: List of re-ranking keyframe 𝐵
struct this query method. TASK-former is trained on the CLIP model 1: Initialize three empty lists: the equal-list 𝐸, the larger-list 𝐿,
(ViT-B16) and is designed for combined query operations, taking and the smaller-list 𝑆
two inputs: a sketch image and text. The model is optimized to 2: for each keyframe 𝑎 in 𝐴 do
handle even poorly drawn sketches, proving to be more effective 3: if the female number in keyframe 𝑎 is equal to query number
than traditional text-based image retrieval methods. 𝑛 then
To implement this query method within the system, we follow 4: Add 𝑎 to 𝐸
a specific process. Firstly, we extract features for all images in the 5: else if the female number in keyframe 𝑎 is larger than query
database, saving them to a numpy file and compiling them into a number 𝑛 then
binary file for storage in Faiss (indexing part in Figure 1). When 6: Add 𝑎 to 𝐿
inputting a query, we encode both the text and sketch queries using 7: else if the female number in keyframe 𝑎 is smaller than
the encoding function in the TASK-former model. This yields two query number 𝑛 then
corresponding feature representation vectors for each query type. 8: Add 𝑎 to 𝑆
Subsequently, we combine these two vectors into one using an 9: end if
existing function in TASK-former designed for this purpose. Let 10: end for
𝑧𝑖 and 𝑧 𝑗 represent the embedding vectors obtained from the text 11: Concatenate three lists 𝐸, 𝐿, and 𝑆 together into a single list 𝐵
query and sketch, respectively. We apply the formula 2 provided in the order.
by TASK-former.

1
𝑦= (𝑧𝑖 + 𝑧 𝑗 ) (2) a single list (B) that is displayed on the screen. We have similar
2
filters for males and both genders, employing the same approach
The resulting merged vector, representing the queries, is then
as the female filter mentioned earlier. This reordering technique is
searched in Faiss to find the closest vector and provide the appro-
simple but significantly improves the order of CLIP retrieval results,
priate result (retrieving part in Figure 1).
particularly when users precisely know the number of people (male,
This merger signifies a promising paradigm shift in image re-
female, or both) appearing in the frame.
trieval, aiming to bridge the gap between linguistic and visual
aspects of image search. By harnessing the power of natural lan-
guage processing for text-based queries and computer vision for
5 SYSTEM OVERVIEW
sketch-based queries, we strive to pave a new avenue for event Our content-based video retrieval system is built upon three funda-
image retrieval. mental building blocks: the system architecture dedicated to data
processing and retrieval, a data service block managing data storage
4.3 Human Filter And Re-Ranking The Results and API operations, and a user interface facilitating interactions
with the search system. To initiate the retrieval process, videos
While the CLIP model demonstrates promising performance in
are initially processed and their frames compressed for optimal
video retrieval tasks, it does have certain limitations. Notably, it
performance. We have opted for MongoDB as our preferred data
struggles with object counting and accurately measuring distances
storage solution due to its efficient key-value access to the stored
between objects in an image. In the AIC contest, there are text
data. Once all video frames are stored in the database, our system
queries that provide minimal contextual information, focusing pri-
leverages this infrastructure to retrieve video frame information
marily on describing the number of people, their gender, and their
using either a sketch or a clip model via a PyMongo connection.
appearance in a short video clip.
These video frames are then converted into a binary format and
To mitigate these challenges while still capitalizing on the strengths
presented as results on the user interface through API calls.
of the embedding model, we have implemented a human filter to re-
rank the results generated by the CLIP model. We employ MiVOLO
2 [6], a state-of-the-art model for gender recognition pre-trained 5.1 System Architecture
on the IMDB-Cleaned dataset, to detect and calculate the number The infrastructure serves as the primary component for handling
of humans, males, and females appearing in keyframes. This in- data inputs and outputs. We’ve selected the Flask framework to
formation is stored locally. Subsequently, users can select a filter optimize performance and API traffic between the system and the
(Female, Male, or Both) and specify a particular number to re-rank user interface within the infrastructure service layer. After the
the results of text or sketch-text retrieval accordingly. frames are extracted and their information is converted into binary
Taking the female filter as an example, see Algorithm ??. The files using sketch and CLIP models, these files are stored within the
re-ranking algorithm involves comparing the query number with system to compute and obtain the correct image indices.
the female count (n) detected by the MiVOLO model and assigning Before returning the image indices, we need to process informa-
that keyframe to one of three lists: ’equal list’ (E), ’large list’ (L), or tion from the user interface, namely the text input and the binary
’small list’ (S). These lists are then concatenated in order to create sketch data input. Depending on the text input, the system will
translate it into English for the CLIP model. These information
2 https://github.com/WildChlamydia/MiVOLO inputs are then vectorized for computation, aiding in the retrieval
Enhancing Video Retrieval with Robust CLIP-Based Multimodal System SOICT 2023, December 7–8, 2023, Ho Chi Minh, Vietnam

Figure 2: The user interface consists of three major components. Component A allows users to input text, use a filter system,
and input sketch data. Once values are provided in Component A, Component B displays the corresponding images based
on these inputs. Users can expand the images in Component B for more details and utilize the K-Nearest Neighbors (KNN)
algorithm to search for similar images by clicking on a preferred image. Additionally, Component B includes a feature to
display subsequent frames (Component D), both preceding and following the selected image when using the KNN search.
Component C is designed for selected images, which can be chosen by clicking checkboxes in Component B.

of indices from the binary files (sketch and CLIP). The system ar- search engine for image indices. Acting as a gateway to the k-
chitecture utilizes CLIP with backbone ViT-L/14@336px for text Nearest Neighbors Algorithm (KNN) search functionality, the get-
and ViT-B-16 for sketches, converting readable information into image-search route allows users to input an image index, prompting
vectorized numbers for Faiss processing. Faiss to execute a search operation and return image indices closely
Faiss performs the critical task of searching and computing the resembling the input image.
similarities between the input text, sketch data, and binary files,
returning the image indices. These image indices serve as keys 5.2 Data Management
to retrieve images from the MongoDB data server. The Pymongo All extracted frames and their associated metadata are stored in
framework, acting as a connection between the MongoDB server MongoDB. The ’id’ serves as the key for Pymongo to efficiently
and the system architecture, fetches the images from the database search and return an array of objects based on the provided input
based on the input indices. The binary images are retrieved then IDs. ’Filename’ functions as the key for storing the video name
stored in JSON format and pushed to the user interface for display. corresponding to the respective frame index value. Similarly, ’Im-
The filter system also referred to as the re-ranking system, filters agedata’ is the key for storing the binary format of an image linked
the number of males, females, or both in the video keyframes. Once to its frame index value. The keyframes are stored consecutively,
the indices are received from Faiss, they pass through this filter indexed from 0 to the total length of all keyframes, and extracted
layer before the final indices are returned for processing. This filter using TransNet and FFmpeg. (Table 1)
layer detects the number of males, females, or both based on the
values provided by the user interface. In Algorithm ??, there are
two input parameters: a list of retrieval keyframes and the query Type Key Value (Example)
number n (representing the count of males, females, or both). After Number id 0
passing through various conditions in the filter, a final list of indices String FileName L01_V001/000021
B in a specific order is returned. These ordered indices are then Binary ImageData data:image/png;base64,iVBORw0KG
used to retrieve images from the MongoDB database.
Table 1: Data format in MongoDB
The API traffic system is structured using the Flask framework,
featuring three primary API routes: get-text-search, get-sketch-
search, and get-image-search. Each route serves a specific function
within the system. The get-text-search route processes input text by
encoding it into vectors for further processing, ultimately returning
the respective image indices. Similarly, the get-sketch-search route
5.3 Graphical User Interface
handles both sketch data and input text. In this data processing The graphical user interface comprises three main components.
pipeline, the data undergoes preprocessing, followed by encoding Part A is designated for inputting queries or sketching images with
into numerical vectors. These vectors then undergo a normalization an integrated face recognition filter system. Part B is responsible
step to ensure consistent scaling before leveraging Faiss, a powerful for displaying image results based on either text queries or sketches
provided in the input query. The number of images displayed is
influenced by the ’k’ value, as mentioned in the system architecture.
SOICT 2023, December 7–8, 2023, Ho Chi Minh, Vietnam Minh-Dung, Anh-Tuan Nguyen, Anh-Tuan Quang-Hoang, Van-Huy, Tien-Huy, Hoang-Bach and Minh-Hung

Figure 3: In text-based search with filters, we start by entering a text query in (1) and then clicking the search button to view
the retrieval results in (2). Initially, the target scene is ranked 53rd. To improve its ranking, we apply the "Male" and "Both"
filters and specify the presence of only one man in the scene by choosing the option "1." Clicking the "Search" button again, we
see the results in (4), and now the target scene is ranked ninth.

Figure 4: Search based on text and sketch includes 2 steps. Firstly, we input text query and sketch (1). Secondly, the search
results are displayed and the target frame is selected (2).

Part C is dedicated to selected keyframes, allowing users to select and analyze visual data. The ’subsequent frames’ feature, as pre-
keyframes by clicking checkboxes in Part B. viously mentioned, allows users to effortlessly view not only the
In Part A, users have the option to input text queries in either current frame but also the surrounding context. By presenting 20
Vietnamese or English for image retrieval using text. Additionally, frames before and 20 frames after the selected frame during a KNN
there is a filter system capable of detecting the number of males, search, it offers an extensive chronological perspective. This inno-
females, or both by utilizing the re-ranking method. Upon clicking vative functionality ensures that the context of the chosen frame is
the search button, the input text and filter values are transmitted presented with the utmost clarity and precision.
to the system architecture for processing. The system architecture Part C is designated for the selection of frames for submission.
then returns an array of image objects, all of which are displayed After clicking checkboxes to select images in Part B, the chosen
in Part B. In Figure 2, the filter system in the user interface accepts frames will be displayed in Part C for review before submission. If
values for males, females, or both, with checkboxes provided next users wish to remove frames, the graphical user interface allows
to each value to enable or disable them for processing and search- them to click and remove frames in Part C. Users can then choose
ing. Regarding sketches, users can toggle the sketch feature on or to submit all selected frames to the system or download CSV files
off, providing additional information for CLIP to re-rank similar to their local storage.
images based on the sketch. Image, text, and filter values are sent
to the system architecture for processing, and in turn, the system
architecture returns an array of image objects for display in Part 5.4 System Utilization
B. Apart from the aforementioned capabilities, our system offers a Text retrieval is the most commonly employed function in Textual
seamless experience for users, enhancing their ability to explore KIS (Known-Item Search). Figure 3 provides an example of this
Enhancing Video Retrieval with Robust CLIP-Based Multimodal System SOICT 2023, December 7–8, 2023, Ho Chi Minh, Vietnam

Figure 5: Image-by-example search comprises three steps: Step 1 involves making a text query or a text-sketch query, as shown
in (1). In Step 2, we double-click the reference scene in the retrieval results displayed in (2). Step 3 involves checking if the
target scene appears on the screen, as depicted in (3). If it does not, we return to Step 1.

function. In the first step, we input the text query: "A market con- 6 FUTURE WORK
troller is looking at information about a pair of sports shoes. This We plan to implement voice queries as a new feature to reduce
person is looking at the information under the shoe insole" The typing time for text query inputs. This presents an opportunity for
target frame is initially ranked 53rd. To achieve better results with users to access information as seamlessly as possible. Additionally,
higher rankings, we employ a human filter in step (3) to re-rank we can apply this feature for convenient mobile searching.
the results based on the number of men and the total number of Furthermore, some keyframes contain a substantial amount of
people (equal to 1). As a result, the target frame now occupies the text, a challenge known as the Scene-Text problem. Leveraging the
ninth position, which is a significantly improved placement. information within these texts through optical character recogni-
When retrieving a short video clip from the database to solve tion techniques [12] enhances query performance.
video KIS, you can employ the sketch function to describe the Finally, there is the concept of a system that suggests classes for
video using text and sketches. To illustrate how to use this function, the set of results of the current query being displayed [10]. This
refer to Figure 4. The video features three Samsung phones during system allows users to identify missing words and ideas for describ-
a product launch. In step (1), enter the query "The video shows ing video frames, ultimately facilitating the generation of improved
3 phones," and in step (2), sketch the basic outlines of the three queries. This concept may be a subject for future consideration.
phones, comprising three rectangles stacked on top of each other.
Ensure that the ratio of the outline lines on the canvas in step
(2) matches the positioning of the three phones according to the ACKNOWLEDGMENTS
video’s aspect ratio. After clicking the search button, the results This research is supported by AI VIETNAM.
will display images that closely resemble the sketched image and
the entered description. REFERENCES
Through the two query methods mentioned above, we propose [1] Charles Adjetey and Kofi Sarpong Adu-Manu. 2021. Content-based image re-
an additional method of querying through Figure 5. After describing trieval using Tesseract OCR engine and levenshtein algorithm. International
the query "A girl with her hair tied up stands against a yellow pillar. Journal of Advanced Computer Science and Applications 12, 7 (2021).
[2] Ayan Kumar Bhunia, Yongxin Yang, Timothy M Hospedales, Tao Xiang, and
Nearly half of the frame is covered by the wall. On the ground Yi-Zhe Song. 2020. Sketch less for more: On-the-fly fine-grained sketch-based
are grids woven from leaves, stacked on top of each other into image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 9779–9788.
many piles" in step (1), the results returned in step (2) only provide [3] Mariona Carós, Maite Garolera, Petia Radeva, and Xavier Giro-i Nieto. 2020. Auto-
related objects mentioned in the query but do not actually yield the matic reminiscence therapy for dementia. In Proceedings of the 2020 International
correct results. To find the exact answer, we select images that are Conference on Multimedia Retrieval. 383–387.
[4] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity
likely to contain the answer and base this on examining Subsequent search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
Frames in Component D at Figure 2. If not found, we will repeat [5] Peter Kitzing, Andreas Maier, and Viveka Lyberg Åhlander. 2009. Automatic
step (1) until we find the correct answer containing the image of speech recognition (ASR) and its use as a tool for assessment or therapy of voice,
speech, and language disorders. Logopedics Phoniatrics Vocology 34, 2 (2009),
the girl standing next to the yellow pillar and the stacked leaf grids 91–96.
as described in step (3). [6] Maksim Kuprashevich and Irina Tolstykh. 2023. MiVOLO: Multi-input Trans-
former for Age and Gender Estimation. (2023). arXiv:arXiv:2307.04616
SOICT 2023, December 7–8, 2023, Ho Chi Minh, Vietnam Minh-Dung, Anh-Tuan Nguyen, Anh-Tuan Quang-Hoang, Van-Huy, Tien-Huy, Hoang-Bach and Minh-Hung

[7] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping et al. 2021. Learning transferable visual models from natural language supervision.
Language-Image Pre-training for Unified Vision-Language Understanding and In International conference on machine learning. PMLR, 8748–8763.
Generation. In ICML. [15] Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, and James Hays. 2022. A sketch
[8] Danyang Liu, Ji Xu, Pengyuan Zhang, and Yonghong Yan. 2019. Investigation of is worth a thousand words: Image retrieval with text and sketch. In European
knowledge transfer approaches to improve the acoustic modeling of Vietnamese Conference on Computer Vision. Springer, 251–267.
ASR system. IEEE/CAA Journal of Automatica Sinica 6, 5 (2019), 1187–1195. [16] Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder
[9] Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. 2017. Deep Representations from Transformers. In Proceedings of the 2019 Conference on
sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of Empirical Methods in Natural Language Processing.
the IEEE conference on computer vision and pattern recognition. 2862–2871. [17] Chih-Fong Tsai. 2012. Bag-of-words representation in image annotation: A
[10] Jakub Lokoč, Zuzana Vopálková, Patrik Dokoupil, and Ladislav Peška. 2023. Video review. International Scholarly Research Notices 2012 (2012).
Search with CLIP and Interactive Text Query Reformulation. In International [18] Keiji Yanai and Yoshiyuki Kawano. 2015. Food image recognition using deep con-
Conference on Multimedia Modeling. Springer, 628–633. volutional network with pre-training and fine-tuning. In 2015 IEEE International
[11] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1–6.
task-agnostic visiolinguistic representations for vision-and-language tasks. In [19] Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal.
Advances in Neural Information Processing Systems. 13–23. 2018. A zero-shot framework for sketch based image retrieval. In Proceedings of
[12] Ravina Mithe, Supriya Indalkar, and Nilam Divekar. 2013. Optical character the European Conference on Computer Vision (ECCV). 300–317.
recognition. International journal of recent technology and engineering (IJRTE) 2, [20] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable
1 (2013), 72–75. are features in deep neural networks? Advances in neural information processing
[13] Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based im- systems 27 (2014).
age retrieval via siamese convolutional neural network. In 2016 IEEE international [21] Dong Yu, Li Deng, and George Dahl. 2010. Roles of pre-training and fine-tuning
conference on image processing (ICIP). IEEE, 2460–2464. in context-dependent DBN-HMMs for real-world speech recognition. In Proc.
[14] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, NIPS Workshop on Deep Learning and Unsupervised Feature Learning. sn.
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,

You might also like