Major Project (F)
Major Project (F)
Project Report
on
Submitted for partial fulfilment of the requirements for the award of the degree
of
BACHELOR OF ENGINEERING
In
By
Certificate
This is to certify that the project work entitled “Smart Event Timestamping: Query-
Driven Video Temporal Grounding” is a bonafide work carried out by
Mr. D Mohammad Saquib (2451-20-733-132) and Mr. Surya Danturty (2451-20-733-
163) in partial fulfilment of the requirements for the award of degree of Bachelor of
Engineering in Computer Science and Engineering from Maturi Venkata Subba Rao
(MVSR) Engineering College, affiliated to OSMANIA UNIVERSITY, Hyderabad,
during the Academic Year 2023-24 under our guidance and supervision.
The results embodied in this report have not been submitted to any other university or
institute for the award of any degree or diploma to the best of our knowledge and belief.
External Examiner
i
DECLARATION
This is to certify that the work reported in the present project entitled “Smart Event
Timestamping: Query-Driven Video Temporal Grounding” is a record of
bonafide work done by us in the Department of Computer Science and Engineering,
Maturi Venkata Subba Rao (MVSR) Engineering College, Osmania University
during the Academic Year 2023-24. The reports are based on the project work done
entirely by us and not copied from any other source. The results embodied in this
project report have not been submitted to any other University or Institute for the
award of any degree or diploma.
ii
ACKNOWLEDEGEMENTS
We would like to express our sincere gratitude and indebtedness to our project guide
Mr. K Murali Krishna for his valuable suggestions and interest throughout the course
of this project.
We are also thankful to our principal Dr. Vijaya Gunturu and Mr. J Prasanna
Kumar, Professor and Head, Department of Computer Science and Engineering,
Maturi Venkata Subba Rao Engineering College, Hyderabad for providing excellent
infrastructure for completing this project successfully as a part of our B.E. Degree
(CSE). We would like to thank our project coordinator for his constant monitoring,
guidance and support.
We convey our heartfelt thanks to the lab staff for allowing us to use the required
equipment whenever needed. We sincerely acknowledge and thank all those who gave
directly or indirectly their support in the completion of this work.
iii
VISION
• To impart technical education of the highest standards, producing competent
v
COURSE OBJECTIVES AND OUTCOMES
Course Outcomes:
CO1: Summarize the survey of the recent advancements to infer the problem
statements with applications towards society.
CO2: Design a software based solution within the scope of the project.
CO3: Implement test and deploy using contemporary technologies and tools.
CO4: Demonstrate qualities necessary for working in a team.
CO5: Generate a suitable technical document for the project.
vi
ABSTRACT
Our approach redefines the VTG landscape by integrating diverse labels and tasks
into a unified formulation, enabling robust model training across different VTG
applications. We utilize progressive data annotation techniques and scalable pseudo-
supervision methods to create extensive and versatile training datasets. This strategy
allows our models to generalize effectively across multiple VTG tasks, enhancing their
capability to handle various query types and video content.
Extensive experiments on three key VTG tasks across seven diverse datasets,
including QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights,
TVSum, and QFVS, validate the effectiveness and flexibility of our framework. Our
project aims to revolutionize video content accessibility, providing users with a
powerful tool to navigate and discover relevant video segments effortlessly.
vii
TABLE OF CONTENTS
PAGE NOS.
Certificate .......................................................................................................... i
Declaration ........................................................................................................ ii
Acknowledgements ........................................................................................... iii
Vision & Mission , PEOs, POs and PSOs.......................................................... iv
Abstract.............................................................................................................. vii
Table of contents................................................................................................ viii
List of Figures ................................................................................................... x
List of Figures ................................................................................................... x
CONTENTS
CHAPTER I
1. INTRODUCTION 01 – 05
1.1 PROBLEM STATEMENT 02
1.2 OBJECTIVE 02
1.3 MOTIVATION 02
1.4 SCOPE OF SMART EVENT TIMESTAMPING 03
1.5 SOFTWARE AND HARDWARE REQUIREMENTS 04
CHAPTER II
2. LITERATURE SURVEY 06 – 07
2.1 SURVEY OF SMART EVENT TIMESTAMPING 06
CHAPTER III
3. SYSTEM DESIGN 08 – 16
3.1 FLOW CHARTS 08
3.2 SYSTEM ARCHITECTURE 08
3.3 UML DIAGRAMS 11
3.4 PROJECT PLAN 13
viii
CHAPTER IV
4. SYSTEM IMPLEMENTATION & METHODOLOGIES 17 – 35
4.1 SYSTEM IMPLEMENTATION 17
4.2 DATASET UTILIZATION 18
4.3 VIDEO TEMPORAL GROUNDING 20
4.4 VISION LANGUAGE PRE-TRAINING 21
4.5 UNIFIED FORMULATION 22
4.6 UNIFIED MODEL 26
4.7 INFERENCE 29
4.8 INTEGRATION AND DEPLOYMENT 29
4.9 USER INTERFACE 33
CHAPTER V
5. TESTING AND RESULTS 36 – 44
5.1 EVALUATION METRICS 36
5.2 COMPARISION 37
5.3 TEST CASES 38
CHAPTER VI 45 – 46
6. CONCLUSION & FUTURE ENHANCEMENTS 45
REFERENCES 47
APPENDIX 48
ix
LIST OF FIGURES
LIST OF TABLES
x
Smart Event Timestamping: Query-Driven VTG
CHAPTER 1
INTRODUCTION
In today's digital world, where social media reigns supreme, video content has
become a staple, providing a treasure trove of knowledge and entertainment. From
educational how-to to elaborate vlogs, the variety of online videos is immense and
constantly growing. However, this bounty also presents a challenge: how do users
efficiently sift through the vastness of videos to find ones that align with their interests
and questions?
The ability to pinpoint relevant moments within videos based on user queries has
emerged as a critical capability in enhancing the video browsing experience. This
capability, known as Video Temporal Grounding (VTG), enables users to seamlessly
access specific segments of video content aligned with their unique preferences and
queries. Tasks such as moment retrieval, highlight detection, and video summarization
have been developed to address this need, each focusing on different aspects of
temporal grounding.
This project represents the culmination of our efforts to improve video browsing.
Our goal is to enhance the experience, making it more fluid and personalized based on
user queries and preferences. Ultimately, we aim to revolutionize video browsing,
providing users with a seamless and customized experience that meets their unique
needs.
1.2 Objective
The primary goal of this project is to transform the way users interact with video
content on social media platforms. By implementing Unified Video Temporal
Grounding, we aim to provide users with a browsing experience that is both seamless
and highly personalized. This objective stems from the challenges users encounter
when trying to sift through the vast amount of video content available online, often
leading to frustration and a sense of being overwhelmed. Through advanced model
training techniques and the integration of innovative query-based technology, our
project seeks to address these challenges head-on. Our ultimate aim is to empower users
to input their specific queries and have the system dynamically search through video
content to deliver precise moments of events that match their interests and preferences.
By achieving this objective, we aspire to create a tool that enhances the overall user
experience on social media platforms, making video browsing more efficient,
enjoyable, and engaging for all users.
1.3 Motivation
Our motivation springs from a fascinating blend of real-world applications and
formidable challenges inherent in the realm of social media video browsing. Picture
this: a digital universe where users effortlessly unearth content that speaks directly to
their passions, seamlessly navigating through a seemingly boundless sea of videos. This
vision serves as the driving force behind our endeavour to harness the potential of
Unified Video Temporal Grounding, effectively bridging the gap between users' intent
and the vast array of video content available online.
In essence, our motivation is rooted in the desire to empower users, offering them a
bespoke video browsing experience that not only meets but exceeds their expectations.
We are committed to surmounting the challenges posed by the abundance of video
content and the fluid nature of social media trends, ultimately enhancing user
engagement and satisfaction in the digital sphere.
• Deep Learning Frameworks: TensorFlow and PyTorch are widely used deep
learning frameworks for building and training neural networks. Both frameworks
offer support for GPU acceleration and provide extensive documentation and
community support.
• Text Processing Libraries: Natural Language Processing (NLP) tasks, such as query
interpretation, may require libraries like NLTK (Natural Language Toolkit) or
spaCy for text processing and analysis.
• Visual Studio Code: Visual Studio Code is a versatile code editor with a rich set of
features and extensions. It provides an integrated development environment for
writing, debugging, and managing the project codebase.
• Git: Git is a distributed version control system that facilitates collaboration, code
management, and tracking of changes in the project. It is used to maintain the
project code repository, manage branches, and track the project's evolution over
time.
CHAPTER 2
LITERATURE SURVEY
2.1 Survey of Smart Event Timestamping
The literature survey for this project reviewed research papers, scholarly
articles, and conference proceedings in computer vision, machine learning, and
information retrieval. Key advancements in video understanding, query-based
browsing, event timestamping, personalization, efficiency, and scalability were
identified.
In video understanding, significant work by Hendricks et al. and Gao et al. has
enhanced moment retrieval using natural language queries. Studies by Cao et al. and
Badamdorj et al. advanced highlight detection through joint visual and audio learning
techniques. Unified frameworks, like the Slowfast networks proposed by Feichtenhofer
et al., leverage spatial and temporal information for superior video recognition.
Escorcia et al. extended this by integrating natural language processing for temporal
localization.
Query-based video browsing, explored by Chen et al. and Ghosh et al., improves
user interaction by matching natural language queries with video segments, enhancing
personalization. Alwassel et al. introduced temporally-sensitive pretraining for better
localization, while Bain et al. proposed joint video and image encoders for efficient
multimedia retrieval.
Yr. of
S.No Author Technique Summary Limitation
Pub
Lisa Anne
Proposed a
Hendricks, Localizing Requires
method to
Oliver Wang, moments in precise and
map textual
1 2017 Eli Shechtman, video with clear natural
descriptions
Josef Sivic, natural language
to video
Trevor Darrell, language input.
segments.
Bryan Russell
Temporal
Victor Escorcia, Advanced the
localization Scalability
Mattia Soldan, understanding
of moments and efficiency
Josef Sivic, of moment
2 2019 in video issues in
Bernard retrieval in
collections larger
Ghanem, Bryan large video
with natural datasets.
Russell collections.
language
Demonstrated
Frozen in the potential
Complex
Max Bain, Time: Joint of
training
Arsha Nagrani, video and multimodal
process and
3 2021 Gul Varol, image pretraining
high
Andrew encoder for for improved
computational
Zisserman end-to-end video and
requirements.
retrieval image
retrieval.
EXCL: Enhanced the Limited by
Soham Ghosh, Extractive efficiency the quality of
Anuva Agarwal, clip and accuracy natural
4 2019 Zarana Parekh, localization of finding language
Alexander G. using natural specific video descriptions
Hauptmann language clips based on provided by
descriptions user queries. users.
Pushed the
boundaries of
Meng Cao, LOCVTP: temporal High
Tianyu Yang, Video-Text localization dependency
Junwu Weng, Pretraining tasks by on the quality
5 2022
Can Zhang, Jue for leveraging and diversity
Wang, Yuexian Temporal extensive of pretraining
Zou Localization pretraining on data.
video-text
pairs.
CHAPTER 3
SYSTEM DESIGN
3.1 Flow Chart
Flowchart will serve as visual representations of the sequential process in our
project, Smart Event Timestamping: Query-Driven Video Temporal Grounding. The
flowcharts will depict the step-by-step workflow, commencing from user queries input
to the retrieval of relevant video segments. Each stage, including query interpretation,
video search, event timestamping, and result presentation, will be delineated using
standard symbols and arrows, elucidating the logical flow and decision points. These
visual aids will facilitate comprehension of the system's operation, pinpoint potential
bottlenecks, and streamline communication among project stakeholders, fostering a
cohesive development process.
1. Video Encoder
Purpose: Converts input video clips into a rich set of feature representations.
Components: Utilizes convolutional neural networks (CNNs) which are adept at
capturing spatial and temporal features in video data.
Process: Each clip V is passed through the CNN layers to extract features that capture
the visual content, including motion and objects present in the clip.
Output: Produces a set of feature vectors that represent different aspects of the visual
data.
2. Text Encoder
Purpose: Processes the free-form text query to understand its semantic content.
Components: Composed of feed-forward networks and multi-head self-attention
layers.
Process: The query Q, which consists of Lq tokens, is embedded and passed through
the encoder to generate a contextualized representation of the text.
Output: Generates a text feature vector that encapsulates the meaning and context of
the query.
3. Feature Fusion
Purpose: Integrates the video and text features to create a unified representation for
further processing.
Components: Involves operations like dot product and concatenation.
Process:
i. Saliency Path Combines video and text features using a dot product to calculate the
relevance of each video clip to the query.
ii. Offsets Path: Concatenates video and text features, providing input for the module
calculating temporal boundaries.
iii. Indicator Path: Similarly concatenates features to determine if a clip is relevant to
the query.
Output: Produces fused features that are contextually aligned with the query for each
video clip.
4. Attention Pooler:
Purpose: Enhances the model's focus on relevant portions of the video clips based on
the query.
Components: Uses multi-head self-attention mechanisms.
Process: Applies attention pooling to the fused features, allowing the model to weigh
the importance of different parts of the video in relation to the query.
Output: Refined feature vectors that emphasize the most relevant aspects of the video
clips.
5. Output Modules
i. Foreground Indicator
Purpose: Binary classifier to indicate whether a clip is part of the foreground (relevant
to the query).
Components: Utilizes feed-forward neural networks.
Output: A binary value (0 or 1) for each clip.
ii. Boundary Offsets
Purpose: Calculates the temporal boundaries (start and end times) of the relevant video
segments.
Components: Composed of neural networks that estimate the distance of the clip's
timestamp to the interval boundaries.
Output: A pair of values representing the distances to the start and end boundaries.
State diagram:
The state diagram for "Smart Event Timestamping: Query-Driven Video
Temporal Grounding" illustrates the various states and transitions within the system. It
encapsulates the system's behavior and the conditions under which it transitions
between different states. By visually representing the system's dynamic behavior, the
state diagram provides insights into how the system responds to user inputs, processes
queries, searches for relevant video segments, and presents results. Through a series of
defined states and transitions, the diagram offers a comprehensive overview of the
system's operational flow, facilitating the understanding of its functionality and
behavior.
Sequence diagram:
A sequence diagram visually represents the interaction between various
components of the 'Smart Event Timestamping: Query-Driven Video Temporal
Grounding' system. It begins with the user inputting a query through the user interface.
The diagram then illustrates the sequential flow of events as the system interprets the
query, conducts searches for relevant video segments, detects events within these
segments, extracts timestamps, and finally presents the results back to the user. Each
step in the process is represented by a lifeline corresponding to the involved
components, and arrows indicate the messages exchanged between them. The sequence
diagram captures the flow of method calls, message exchanges, and data flows between
components, providing a clear visualization of the system's operational workflow.
Objective: Establish the project foundation and prepare the development environment.
Tasks:
• Conduct a project kickoff meeting to align the team on objectives, roles, and
responsibilities.
• Create a new Conda environment and install necessary dependencies from the
`requirements.txt`.
• Configure the development environment for GPU usage to ensure optimal
performance.
• Set up version control with Git and establish a repository for the project.
Objective: Organize the datasets and prepare the data for processing.
Tasks:
• Download and extract datasets including QVHighlights, Charades-STA,
TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS.
• Organize the extracted data into a structured directory format (e.g., metadata,
text clips, video clips, and video features).
• Implement scripts to preprocess and standardize the datasets.
• Verify the integrity and correctness of the datasets by performing initial
exploratory data analysis.
Tasks:
• Define the model architecture and configure the parameters (e.g., model
version, output feature size, clip length).
• Implement argument parsing to handle command-line inputs and
configurations.
• Integrate necessary libraries and modules for model setup (e.g., PyTorch,
Gradio).
• Perform initial testing of the model configuration to ensure it is correctly set up.
Objective: Extract features from the datasets and initiate model training.
Tasks:
• Implement and test feature extraction functions for both video and text data.
• Extract features from the prepared datasets and store them in the designated
directories.
• Begin the initial phase of model training using the extracted features.
• Monitor training progress, evaluate model performance, and fine-tune
hyperparameters as necessary.
Objective: Evaluate and optimize the trained model for better performance.
Tasks:
Tasks:
• Design the layout and user interface components using Gradio.
• Implement functionalities for video input, feature extraction, and text query
submission.
• Ensure the interface is intuitive and easy to navigate for end-users.
• Integrate the model with the user interface to enable real-time processing and
feedback.
Objective: Integrate all components and conduct thorough testing of the system.
Tasks:
• Integrate the feature extraction, model, and user interface components into a
cohesive system.
• Perform end-to-end testing to ensure all components work seamlessly together.
• Identify and fix any bugs or issues that arise during integration testing.
• Conduct user acceptance testing with a small group of users to gather feedback
and make necessary adjustments.
Objective: Deploy the Smart Event Timestamping system and conclude the project.
Tasks:
• Prepare the deployment environment and configure server settings for hosting
the application.
• Deploy the system to a cloud platform or local server for public access.
• Provide documentation for system usage, including a user manual and technical
documentation.
• Conduct a final project review meeting to discuss achievements, challenges, and
future improvements.
CHAPTER 4
SYSTEM IMPLEMENTATION & METHODOLOGIES
4.1 System Implementation
With the increasing interest in sharing daily lives, video has emerged as the most
informative yet diverse visual form on social media. These videos are collected in a
variety of settings, including untrimmed instructional videos, and well-edited vlogs.
With massive scales and diverse video forms, automatically identifying relevant
moments based on user queries has become a critical capability in the industry for
efficient video browsing. This significant demand has given rise to a number of video
understanding tasks, including moment retrieval, highlight detection, and video
summarization. As depicted, moment retrieval tends to localize consecutive temporal
windows (interval-level) by giving natural sentences; highlight detection aims to pick
out the key segment with highest worthiness (curve-level) that best reflects the video
gist; video summarization collects a set of disjoint shots (point-level) to summarize the
video, with general or user-specific queries. Despite task-specific datasets and models
have been developed, these tasks are typically studied separately. In general, these tasks
share a common objective of grounding various scale clips based on customized user
queries, which we refer to as Video Temporal Grounding (VTG).
utilizing TACoS, researchers can further validate the effectiveness of their models
in localizing and retrieving relevant moments from videos.
• Ego4D: This dataset focuses on moment retrieval tasks, particularly emphasizing
point-level annotations within videos. By incorporating Ego4D, researchers gain
insights into the fine-grained temporal dynamics within videos, allowing for more
nuanced analysis and evaluation of moment retrieval algorithms.
• YouTube Highlights: Tailored for highlight detection tasks, the YouTube
Highlights dataset comprises a diverse range of videos with annotated key
segments. Researchers utilize this dataset to develop algorithms capable of
automatically identifying and extracting salient highlights from videos, catering to
user preferences and interests.
• TVSum: Similar to YouTube Highlights, the TVSum dataset is utilized for
highlight detection tasks and features annotated key segments within videos. By
leveraging TVSum, researchers can further evaluate the performance of highlight
detection algorithms in identifying and summarizing important segments within
video content.
• QFVS: Serving as a dataset for video summarization tasks, QFVS provides
annotated summaries of video content. Researchers utilize QFVS to develop
algorithms capable of generating concise and informative summaries of videos
based on user queries or preferences, facilitating efficient video browsing and
content analysis.
i. Moment Retrieval
Moment Retrieval aims to localize target moments i.e., one or many continuous
intervals within a video by a language query, as shown in Fig. 4.1 (b). Previous methods
fall into two categories: proposal-based and proposal free. The proposal-based methods
employ a two-stage process of scanning the entire video to generate candidate
proposals, which are then ranked based on their matching to the text query. In contrast,
the proposal-free methods learn to regress the start and end boundaries directly without
requiring proposal candidates. Our VTG borrows from proposal-free approaches but
extends it by incorporating diverse temporal labels and tasks with a concise design.
iii. Summarization
Video Summarization aims to summarize the whole video by a set of shots to
provide a quick overview e.g., Fig. 4.1 (a), which contains two forms: Generic video
summarization that captures the important scene using visual clues merely, while
Query-focused video summarization that allows users to customize the summary by
specifying text keywords (e.g., tree and cars). The latter is closer to practical usage
hence we focus on it. Recently, Intent Vizor proposes an interactive approach allowing
users to adjust their intents to obtain a superior summary. In general, each of the three
tasks represents a specific form of VTG that grounds different scales of clips from
videos (e.g., a consecutive clip set, a single clip or a disjoint clip set) by offering
customized text queries (e.g., sentences, titles or keywords). However, previous
methods address some subtasks solely. Based on this insight, our goal is to develop a
unified framework to handle all of them.
extrapolating from existing elements, we can infer and predict unknown components
with a high degree of accuracy. This approach allows us to effectively handle various
VTG tasks and labels, ensuring comprehensive coverage and adaptability in addressing
user queries across diverse video content.
The temporal interval with specific target boundaries serves as a common label
for moment retrieval. However, annotating these intervals necessitates manual review
of the entire video, incurring significant cost. While Automated Speech Recognition
(ASR) can provide start and end timestamps, it often suffers from noise and poor
alignment with visual content, making it suboptimal. Alternatively, visual captions tend
to be descriptive and suitable as grounding queries. Leveraging this, VideoCC emerges
as a viable option, initially developed for video-level pretraining but explored here for
temporal grounding pretraining.
Once intervals are obtained, they are converted into the proposed formulation.
Clips not within the target interval are defined as fi = 0 and si = 0, while those within
the target interval are assigned fi = 1, with assumed si > 0.
to their scalability, these point labels are suitable for large-scale pretraining. Recent
efforts have focused on leveraging point-wise annotations to enhance video-text
representation and augment natural language query (NLQ) baselines. However, these
methods primarily operate within the same domain.
4.6.1. Overview
As shown in Fig. 4.3, our model mainly comprises a frozen video encoder, a
frozen text encoder, and a multi-modal encoder. The video and text encoders are keep
consistent with Moment-DETR, which employs the concatenation of CLIP (ViT-B/32)
and SlowFast (R-50) features as video representation, and use the CLIP text encoder to
extract token level features. Our multi-modal encoder contains k self-attention blocks
that followed by three specific heads to decode the prediction. Given an input video V
with Lv clips and a language query Q with Lq tokens, we first apply the video encoder
and the text encoder to encode the video and text respectively, then project them to the
same dimension D by two Feed-Forward Networks (FFN), and thus obtain video
features V = {vi}Lvi=1 ∈ RLv×D and text features Q ={qj} Lqj=1 ∈ RLq×D. Next, we design
two pathways for cross-modal alignment and cross-modal interaction.
ii. For cross-modal interaction, learnable position embeddings Epos and modality-
type embeddings Etype are added to each modality to retain both positional and
modality information:
V˜
Q˜ = Q + EposT + EtypeT
Next, the text and video tokens are concatenated and get a joint input Z0 = [˜V ;
˜Q] ∈ RL×D, where L = Lv +Lq. Further, Z0 is fed into the multi-modal encoder,
which contains k transformer layers with each layer consisting of a Multi-
headed Self-Attention and FFN blocks.
We take the video tokens ˜Vk ∈ RLv×D from the multimodal encoder Em as output
Zk = [V˜k;Q˜k] ∈ R, and feed Zk into the following heads for prediction.
Notably, this regression objective is only devised for foreground clips i.e., fi = 1.
To this end, our total training objective is the combination of each head loss overall
clips in the training set.
4.7 Inference
During inference, given a video V and a language query Q, we first feed
forward the model to obtain { ˜fi,˜bi, ˜si}Lvi=1 for each clip vi from three heads. Next,
we describe how we carry out output for individual VTG tasks respectively.
Moment Retrieval
We rank clips predicted boundaries {˜bi}Lvi=1 based on their { ˜fi}Lvi=1
probabilities. Since the predicted Lv boundaries are dense, we adopt a 1-d Non-Max
Suppression (NMS) with a threshold 0.7 to remove highly overlapping boundary boxes,
yielding a final prediction.
Highlight Detection:
For each clip, to fully utilize the foreground and saliency terms, we rank all clips
based on their { ˜fi + ˜si}Lvi=1 scores, and then return the few top clip (e.g.,Top-1) as
predictions.
Video Summarization:
Using the same preprocessing settings, the videos are first divided as multiple
segments via KTS algorithm. Then the clip scores from each segment are computed,
and these scores are integrated. We rank all clips based on their foreground
{ ˜fi}Lvi=1 and return the Top-2% clips as a video summary.
Once the environment is created, activate it and proceed to install the necessary
Python packages. These packages are listed in a file named requirements.txt. Installing
these packages will set up all the dependencies required to run Smart Event
Timestamping, including libraries for machine learning, data processing, and other
utilities.
This step is crucial because it isolates the project's dependencies from the system's
global environment, reducing the risk of version conflicts and ensuring that the project
uses the exact versions of libraries it was developed and tested with. This isolation
makes the development process more manageable and helps avoid common pitfalls
associated with dependency management.
Step 2: Preparing the Data
The next step involves preparing the dataset required for Smart Event
Timestamping. Begin by unzipping the downloaded tar file containing the dataset. After
extracting the files, move the data into the appropriate directories as specified by the
Smart Event Timestamping structure. This step is crucial for ensuring that the program
can locate and access the data efficiently.
For those using VideoCC Slowfast features, additional steps are required. You
need to group multiple sub-zips into a single file and then unzip it to prepare the features
for use. This process consolidates the data into a format that Smart Event Timestamping
can handle effectively.
The organization of data into a specific directory structure allows the system to
easily access and manage the different components of the dataset. Proper data
organization is vital for efficient data handling, ensuring that all necessary files are
available in the expected locations when the system is executed.
Beyond the initial setup, you may need to install additional packages to fully prepare
the environment for executing Smart Event Timestamping. These packages might not
be covered by requirements.txt but are necessary for specific functionalities within the
system.
A function to load the model should be implemented, ensuring that the model is
set up with the appropriate configurations and is ready to process data. This function
typically involves setting up logging for monitoring the process, configuring CUDA
settings for GPU acceleration, and initializing the model with the specified parameters.
The model configuration includes defining the architecture and parameters that
the model will use to process video and text data. This setup is crucial for ensuring that
the model operates correctly and efficiently, providing accurate results based on the
input data.
Processing the data also involves creating masks and timestamps, which are
used by the model to understand the temporal aspects of the video data. Properly
loading and processing the data is crucial for accurate and efficient model performance.
Normalization and preparation of data ensure that the model receives input in a
standardized format, reducing variability and improving the reliability of the model's
outputs. This step is critical for achieving high accuracy in the model's predictions.
Similarly, text feature extraction involves processing the text query to generate
feature vectors that the model can compare with video features. This allows the system
to understand the query and identify relevant video segments.
Feature extraction is a critical step as it converts raw video and text data into a
form that the model can interpret and analyze. This step involves using pre-trained
models and algorithms to extract meaningful representations from the data, which are
then used for querying and analysis.
The system processes the user input, extracts necessary features, and uses the
model to find relevant video segments based on the query. The results are then presented
to the user, including the top intervals and highlights identified by the model.
summarization. The layout is structured to provide clear sections for video input,
feature extraction, and query interaction, making it easy to navigate and utilize the
system's capabilities.
When a video is uploaded or a YouTube link is provided, the video will appear
in the placeholder, ready for feature extraction and analysis. This visual feedback
ensures that users can verify the correct video has been selected before proceeding with
further actions.
The button's placement and clear labeling ensure that users can easily find and
use it, facilitating a smooth workflow from video input to feature extraction. This step
is essential for preparing the video data for subsequent queries, making the feature
extraction button a pivotal element of the interface.
Below the text field, there are two buttons: "Submit" and "Clear."
Submit Button: After entering their query into the text field, users can click the
"Submit" button to send their query to the system. The system then processes the
request and provides relevant results, which are displayed within the chatbot
conversation window. This interaction allows users to receive detailed and specific
information about the video based on their queries.
Clear Button: The "Clear" button is provided to erase the contents of the text field.
This is particularly useful if the user wishes to rephrase their question or start a new
query without any remnants of previous input. The clear functionality ensures that the
input area is reset quickly, maintaining a clean and organized interaction space.
The interface uses simple, intuitive controls with clear labels, making it
accessible to users with varying levels of technical expertise. The use of familiar web
elements like buttons, text fields, and chat windows ensures that users can quickly
understand and navigate the application. By placing the video input section on the left
and the chatbot interaction area on the right, the layout utilizes the screen space
effectively, providing clear and distinct areas for different tasks. This separation helps
users focus on one task at a time without feeling overwhelmed.
CHAPTER 5
TESTING AND RESULTS
5.1 Evaluation Metrics
The effectiveness of the methods is evaluated across four video temporal
grounding (VTG) tasks using seven datasets. For joint moment retrieval and highlight
detection tasks, the QVHighlights dataset is utilized. The evaluation metrics include
Recall@1 with IoU thresholds of 0.5 and 0.7, mean average precision (mAP) with IoU
thresholds of 0.5 and 0.75, and the average mAP over a series of IoU thresholds ranging
from 0.5 to 0.95. For highlight detection, metrics such as mAP and HIT@1 are
employed, considering a clip as a true positive if it has a saliency score of "Very Good."
For the moment retrieval task, datasets such as Charades-STA, NLQ, and
TACoS are assessed using Recall@1 with IoU thresholds of 0.3, 0.5, and 0.7, as well
as mIoU. For highlight detection tasks on YouTube Highlights and TVSum, mAP and
Top-5 mAP metrics are used, respectively. Lastly, for the video summarization task on
the QFVS dataset, the evaluation is based on the F1-score per video as well as the
average F1-score.
Table 5.1 Dataset Statistics
Timestamping utilize the same video and text features. Results for highlight detection
and video summarization are reported according to established benchmarks, ensuring
consistency in evaluation.
5.2 Comparison with State-of-the-Arts
i. Joint Moment Retrieval and Highlight Detection
The Smart Event Timestamping system is evaluated on the QVHighlights test
split, showing comparable performance to Moment-DETR and UMT without
pretraining, demonstrating its superior design for joint task optimization. With large-
scale pretraining, it exhibits significant improvements across all metrics, such as an
increase of +8.16 in Avg. mAP and +5.32 in HIT@1, surpassing all baselines by a
substantial margin. Notably, even with the introduction of audio modality and ASR
pretraining by UMT, it outperforms UMT by an Avg. mAP of 5.55 and HIT@1 of 3.89.
Furthermore, its large-scale pretraining allows for effective zero-shot grounding,
outperforming several supervised baselines without any training samples.
Description: This test case evaluates the system's performance in identifying a specific
interval and highlight within a 9-minute video where a buffet tray with Mexican
hamburgers is observed.
Description: This test case evaluates the system's performance in identifying a specific
interval and highlight within a 9-minute video where a ship is observed in the water.
Description: This test case evaluates the system's performance in identifying a specific
interval and highlight within a video where the countries Sweden and Norway are
observed on a map.
Description: This test case evaluates the system's performance in identifying a specific
interval and highlight within a 9-minute video where a girl is observed opening and
reading a book.
v. Test Case V
Objective: Verify the system's ability to identify the specific interval and highlight
where a girl is seen coming inside a room.
Description: This test case evaluates the performance in identifying a specific interval
and highlight within a 30-seconds video where a girl is observed coming inside a room.
Test Results
Table 5.2 Results
Test
Objective Expected Output Our Output Pass/Fail
Case
To identify the
specific interval
Interval: Interval:
and highlight
"8:44 to 8:52 min" "8:44 to 8:52 min"
I where a buffet Pass
Highlight: Highlight:
tray with Mexican
"8:46 min" "8:46 min"
hamburgers is
seen.
To identify the
Interval: Interval:
specific interval
"0:45 to 0:57 sec" "0:45 to 0:57 sec"
II and highlight Pass
Highlight: Highlight:
where a ship is
"0:54 sec" "0:54 sec"
seen in the water.
To identify the
specific interval
and highlight Interval: Interval:
where the "5:40 to 5:59 min" "5:40 to 5:59 min"
III Pass
countries Sweden Highlight: Highlight:
and Norway are "2:12 min" "2:12 min"
labeled on a map
in a video
To identify the
specific interval Interval: Interval:
and highlight "7:35 to 7:40 min" "7:35 to 7:40 min"
IV Pass
where a girl is Highlight: Highlight:
seen opening and "7:38 min" "7:38 min"
reading a book.
To identify the
specific interval Interval: Interval:
and highlight "0:00 to 0:06 min" "0:00 to 0:07 min"
V Pass
where a girl is Highlight: Highlight:
seen coming "0:00 min" "0:00 min"
inside a room.
CHAPTER 6
CONCLUSION AND FUTURE ENCHANCEMENTS
The development of this web-based application marks a significant milestone
in the field of video analysis and user interaction. The system's ability to process video
content, extract relevant features, and respond to user queries with precise intervals and
highlights demonstrates the potential of integrating advanced machine learning models
with practical user interfaces.
The integration of video feature extraction with natural language processing has
yielded a highly user-friendly web application, facilitating seamless video analysis
tasks across various domains. By enabling easy video upload, precise content
extraction, and efficient navigation, the system has demonstrated tangible
advancements in enhancing digital content accessibility and engagement. With a
practical interface and effective machine learning integration, it has significantly
boosted productivity and streamlined processes in fields ranging from media production
to education.
Future Enhancements
As we look ahead, there are several exciting enhancements that can be made to
the project to expand its capabilities, improve its accuracy, and provide an even more
seamless user experience.
1. Multi-Language Support:
Currently, the system primarily supports queries in English. Introducing multi-language
support would make the system accessible to a broader audience. Leveraging advanced
natural language processing models capable of understanding and processing multiple
languages can cater to users worldwide. This enhancement would require integrating
translation services or multi-language NLP models, ensuring the system maintains its
accuracy and efficiency across various languages.
4. Offline Capabilities:
Developing offline capabilities would ensure that users can use the system even without
an internet connection. This would be particularly useful for users in remote locations
or with limited internet access. Offline mode could involve downloading essential
components of the system and allowing local video analysis, with results synchronized
once the connection is restored.
By focusing on these future enhancements, the project can evolve into a more
robust, versatile, and user-friendly tool, catering to a wide range of applications and
user needs.
REFERENCES
[1] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and
Bryan Russell. Localizing moments in video with natural language. In ICCV, pages
5803–5812, 2017.
[2] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity
localization via language query. In ICCV, pages 5267–5275, 2017.
[3] Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. Temporally
grounding natural sentence in video. In EMNLP, pages 162–171, 2018.
[4] Humam Alwassel, Silvio Giancola, and Bernard Ghanem. Tsp: Temporally-
sensitive pretraining of video encoders for localization tasks. In ICCV, pages 3173–
3183, 2021.
[5] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast
networks for video recognition. In ICCV, pages 6202–6211, 2019.
[6] Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell.
Temporal localization of moments in video collections with natural language. arXiv
preprint arXiv:1907.12763, 2019.
[7] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles.
Activitynet: A large-scale video benchmark for human activity understanding. In
CVPR, pages 961–970, 2015.
[8] Max Bain, Arsha Nagrani, Gul Varol, and Andrew Zisserman. Frozen in time: A
joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738,
2021.
[9] Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, and Yuexian Zou.
Locvtp: Video-text pretraining for temporal localization. In ECCV, pages 38–56, 2022.
[10] Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. Joint visual
and audio learning for video highlight detection. In ICCV, pages 8127–8137, 2021.
[11] Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander G. Hauptmann.
Excl: Extractive clip localization using natural language descriptions. In NAACL-HLT,
pages 1984–1990, 2019.
APPENDIX
File name:main.py
import os
import pdb
import time
import torch
import gradio as gr
import numpy as np
import argparse
import subprocess
from run_on_video import clip, vid2clip, txt2clip
parser = argparse.ArgumentParser(description='')
parser.add_argument('--save_dir', type=str, default='./tmp')
parser.add_argument('--resume', type=str,
default='./results/omni/model_best.ckpt')
parser.add_argument("--gpu_id", type=int, default=2)
args = parser.parse_args()
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_id)
#################################
model_version = "ViT-B/32"
output_feat_size = 512
clip_len = 2
overwrite = True
num_decoding_thread = 4
half_precision = False
clip_model, _ = clip.load(model_version, device=args.gpu_id,
jit=False)
import logging
import torch.backends.cudnn as cudnn
from main.config import TestOptions, setup_model
from utils.basic_utils import l2_normalize_np_array
logger = logging.getLogger(__name__)
logging.basicConfig(format="%(asctime)s.%(msecs)03d:%(levelname)s
:%(name)s - %(message)s",datefmt="%Y-%m-%d %H:%M:%S",
level=logging.INFO)
def load_model():
logger.info("Setup config, data and model...")
opt = TestOptions().parse(args)
# pdb.set_trace()
cudnn.benchmark = True
cudnn.deterministic = False
if opt.lr_warmup > 0:
total_steps = opt.n_epoch
warmup_steps = opt.lr_warmup if opt.lr_warmup > 1 else
int(opt.lr_warmup * total_steps)
opt.lr_warmup = [warmup_steps, total_steps]
vtg_model = load_model()
def convert_to_hms(seconds):
return time.strftime('%H:%M:%S', time.gmtime(seconds))
def load_data(save_dir):
vid =
np.load(os.path.join(save_dir,'vid.npz'))['features'].astype
(np.float32)
txt = np.load(os.path.join(save_dir,'txt.npz'))['features'].
astype(np.float32)
vid = torch.from_numpy(l2_normalize_np_array(vid))
txt = torch.from_numpy(l2_normalize_np_array(txt))
clip_len = 2
ctx_l = vid.shape[0]
if True:
tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
tef_ed = tef_st + 1.0 / ctx_l
tef = torch.stack([tef_st, tef_ed], dim=1) # (Lv, 2)
vid = torch.cat([vid, tef], dim=1) # (Lv, Dv+2)
src_vid = vid.unsqueeze(0).cuda()
src_txt = txt.unsqueeze(0).cuda()
src_vid_mask = torch.ones(src_vid.shape[0],
src_vid.shape[1]).cuda()
src_txt_mask = torch.ones(src_txt.shape[0],
src_txt.shape[1]).cuda()
src_vid = src_vid.cuda(args.gpu_id)
src_txt = src_txt.cuda(args.gpu_id)
src_vid_mask = src_vid_mask.cuda(args.gpu_id)
src_txt_mask = src_txt_mask.cuda(args.gpu_id)
model.eval()
with torch.no_grad():
output = model(src_vid=src_vid, src_txt=src_txt,
src_vid_mask=src_vid_mask, src_txt_mask=src_txt_mask)
# grounding
top1_window = pred_windows[torch.argmax(pred_confidence)].
tolist()
top5_values, top5_indices = torch.topk(pred_confidence.
flatten(), k=5)
top5_windows = pred_windows[top5_indices].tolist()
hl_res = convert_to_hms(torch.argmax(pred_saliency) *
clip_len)
hl_response = f"The Top-1 highlight is: {hl_res}"
return '\n'.join([q_response, mr_response, hl_response])
def extract_txt(txt):
txt_features = txt2clip(clip_model, txt, args.save_dir)
return
if not os.path.exists(save_path):
try:
subprocess.call(cmd, shell=True)
except:
return None
return save_path
def get_empty_state():
return {"total_tokens": 0, "messages": []}
if not prompt:
return gr.update(value=''), [(history[i]['content'],
history[i+1]['content'])
for i in range(0, len(history)-1, 2)], state
try:
history.append(prompt_msg)
# answer = vlogger.chat2video(prompt)
# answer = prompt
extract_txt(prompt)
answer = forward(vtg_model, args.save_dir, prompt)
history.append({"role": "system", "content": answer})
except Exception as e:
history.append(prompt_msg)
history.append({
"role": "system",
"content": f"Error: {e}"
})
chat_messages = [(history[i]['content'],
history[i+1]['content']) for i in range(0,len(history)-1, 2)]
return '', chat_messages, state
def clear_conversation():
return gr.update(value=None, visible=True),
gr.update(value=None,
interactive=True), None, gr.update(value=None, visible=True),
get_empty_state()
def subvid_fn(vid):
save_path = download_video(vid)
return gr.update(value=save_path)
css = """
#col-container {max-width: 80%; margin-left: auto; margin-
right:
auto;}
#video_inp {min-height: 100px}
#chatbox {min-height: 100px;}
#header {text-align: center;}
#hint {font-size: 1.0em; padding: 0.5em; margin: 0;}
.message { font-size: 1.2em; }
"""
state = gr.State(get_empty_state())
with gr.Column(elem_id="col-container"):
gr.Markdown("""## 🤖🤖 Smart Event Timestamping: Query-
Driven Video temporal Grounding Given a video and text
query,return relevant window and highlight.""",
elem_id="header")
with gr.Row():
with gr.Column():
video_inp = gr.Video(label="video_input")
gr.Markdown("👋👋 **Step1**: Select a video in
Examples
(bottom) or input youtube video_id in this textbox,
*e.g.* *G7zJK6lcbyU* for
https://www.youtube.com/watch?v=G7zJK6lcbyU",
elem_id="hint")
with gr.Row():
video_id = gr.Textbox(value="",
placeholder="Youtube video url",
show_label=False)
vidsub_btn = gr.Button("(Optional) Submit
Youtube id")
with gr.Column():
total_tokens_str=r.Markdown(elem_id="total_tokens
_str")
chatbot = gr.Chatbot(elem_id="chatbox")
input_message = gr.Textbox(show_label=False,
placeholder="Enter text query and press enter",
visible=True).style(container=False)
btn_submit = gr.Button("Step3: Enter your text
query")
btn_clear_conversation = gr.Button("🔃🔃 Clear")
examples = gr.Examples(
examples=[
["./examples/charades.mp4"],
],
inputs=[video_inp],
)
demo.load(queur=False)
demo.queue(concurrency_count=10)
demo.launch(height='800px', server_port=2253, debug=True,
share=True)