0% found this document useful (0 votes)
57 views64 pages

Major Project (F)

The project report titled 'Smart Event Timestamping: Query-Driven Video Temporal Grounding' presents a framework aimed at enhancing video browsing experiences by accurately aligning video segments with user queries. It integrates various video temporal grounding tasks into a unified model, utilizing advanced machine learning techniques and extensive datasets to improve accessibility and personalization of video content. The project aims to revolutionize how users navigate and discover relevant video segments on social media platforms.

Uploaded by

vedavamsitha1110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views64 pages

Major Project (F)

The project report titled 'Smart Event Timestamping: Query-Driven Video Temporal Grounding' presents a framework aimed at enhancing video browsing experiences by accurately aligning video segments with user queries. It integrates various video temporal grounding tasks into a unified model, utilizing advanced machine learning techniques and extensive datasets to improve accessibility and personalization of video content. The project aims to revolutionize how users navigate and discover relevant video segments on social media platforms.

Uploaded by

vedavamsitha1110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

A

Project Report
on

SMART EVENT TIMESTAMPING: QUERY-DRIVEN


VIDEO TEMPORAL GROUNDING

Submitted for partial fulfilment of the requirements for the award of the degree
of

BACHELOR OF ENGINEERING

In

COMPUTER SCIENCE AND ENGINEERING

By

Mr. D Mohammad Saquib (2451-20-733-132)


Mr. Surya Danturty (2451-20-733-163)

Under the guidance of

Mr. K. Murali Krishna


Assistant Professor
Department of CSE

MATURI VENKATA SUBBA RAO (MVSR) ENGINEERING COLLEGE


Department of Computer Science and Engineering
(Affiliated to Osmania University & Recognized by AICTE)
Nadergul, Saroor Nagar Mandal, Hyderabad – 501 510
Academic Year: 2023-24
Maturi Venkata Subba Rao Engineering College
(Affiliated to Osmania University, Hyderabad)
Nadergul(V), Hyderabad-501510

Certificate
This is to certify that the project work entitled “Smart Event Timestamping: Query-
Driven Video Temporal Grounding” is a bonafide work carried out by
Mr. D Mohammad Saquib (2451-20-733-132) and Mr. Surya Danturty (2451-20-733-
163) in partial fulfilment of the requirements for the award of degree of Bachelor of
Engineering in Computer Science and Engineering from Maturi Venkata Subba Rao
(MVSR) Engineering College, affiliated to OSMANIA UNIVERSITY, Hyderabad,
during the Academic Year 2023-24 under our guidance and supervision.

The results embodied in this report have not been submitted to any other university or
institute for the award of any degree or diploma to the best of our knowledge and belief.

Internal Guide Head of the Department

Mr. K Murali Krishna Prof. J Prasanna Kumar


Assistant Professor Professor
Department of CSE Department of CSE
MVSREC. MVSREC.

External Examiner

i
DECLARATION

This is to certify that the work reported in the present project entitled “Smart Event
Timestamping: Query-Driven Video Temporal Grounding” is a record of
bonafide work done by us in the Department of Computer Science and Engineering,
Maturi Venkata Subba Rao (MVSR) Engineering College, Osmania University
during the Academic Year 2023-24. The reports are based on the project work done
entirely by us and not copied from any other source. The results embodied in this
project report have not been submitted to any other University or Institute for the
award of any degree or diploma.

Mr. D Mohammad Saquib Mr. Surya Danturty


2451-20-733-132 2451-20-733-163

ii
ACKNOWLEDEGEMENTS

We would like to express our sincere gratitude and indebtedness to our project guide
Mr. K Murali Krishna for his valuable suggestions and interest throughout the course
of this project.

We are also thankful to our principal Dr. Vijaya Gunturu and Mr. J Prasanna
Kumar, Professor and Head, Department of Computer Science and Engineering,
Maturi Venkata Subba Rao Engineering College, Hyderabad for providing excellent
infrastructure for completing this project successfully as a part of our B.E. Degree
(CSE). We would like to thank our project coordinator for his constant monitoring,
guidance and support.

We convey our heartfelt thanks to the lab staff for allowing us to use the required
equipment whenever needed. We sincerely acknowledge and thank all those who gave
directly or indirectly their support in the completion of this work.

Mr. D Mohammad Saquib (2451-20-733-132)


Mr. Surya Danturty (2451-20-733-163)

iii
VISION
• To impart technical education of the highest standards, producing competent

and confident engineers with an ability to use computer science knowledge to


solve societal problems.
MISSION
• To make learning process exciting, stimulating and interesting.
• To impart adequate fundamental knowledge and soft skills to students.
• To expose students to advanced computer technologies in order to excel in
engineering practices by bringing out the creativity in students.
• To develop economically feasible and socially acceptable software.
PEOs:

PEO-1: Achieve recognition through demonstration of technical competence for


successful execution of software projects to meet customer business objectives..
PEO-2: Practice life-long learning by pursuing professional certifications, higher
education or research in the emerging areas of information processing and intelligent
systems at a global level.
PEO-3: Contribute to society by understanding the impact of computing using a
multidisciplinary and ethical approach.

PROGRAM OUTCOMES (POs)


At the end of the program the students (Engineering Graduates) will be able to:
1. Engineering knowledge: Apply the knowledge of mathematics, science,
engineering fundamentals, and an engineering specialisation for the solution of
complex engineering problems.
2. Problem analysis: Identify, formulate, research literature, and analyse complex
engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified
needs with appropriate consideration for public health and safety, and cultural,
societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge
and research methods including design of experiments, analysis and
interpretation of data, and synthesis of the information to provide valid
conclusions.
iv
5. Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and
modelling to complex engineering activities with an understanding of the
limitations.
6. The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal, and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate
the knowledge of, and the need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
9. Individual and teamwork: Function effectively as an individual, and as a
member or leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities
with the engineering community and with the society at large, such as being
able to comprehend and write effective reports and design documentation, make
effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and
understanding of the engineering and management principles and apply these to
one’s work, as a member and leader in a team, to manage projects and in
multidisciplinary environments.
12. Lifelong learning: Recognise the need for, and have the preparation and ability
to engage in independent and life-long learning in the broadest context of
technological change.

PROGRAM SPECIFIC OUTCOMES (PSOs)


13. (PSO-1) Demonstrate competence to build effective solutions for computational
real-world problems using software and hardware across multi-disciplinary
domains.
14. (PSO-2) Adapt to current computing trends for meeting the industrial and societal
needs through a holistic professional development leading to pioneering careers
or entrepreneurship.

v
COURSE OBJECTIVES AND OUTCOMES

Course Code: PW 861 CS


Course Objectives:
 To enhance practical and professional skills.
 To familiarize tools and techniques of systematic literature survey and
documentation.
 To expose the students to industry practices and teamwork.
 To encourage students to work with innovative and entrepreneurial ideas.

Course Outcomes:
CO1: Summarize the survey of the recent advancements to infer the problem
statements with applications towards society.
CO2: Design a software based solution within the scope of the project.
CO3: Implement test and deploy using contemporary technologies and tools.
CO4: Demonstrate qualities necessary for working in a team.
CO5: Generate a suitable technical document for the project.

vi
ABSTRACT

The widespread availability of video content on social media platforms poses a


significant challenge for users seeking to find relevant segments that match their
interests. To tackle this issue, we introduce a project focused on Smart Event
Timestamping: Query-Driven Video Temporal Grounding (VTG), an innovative
solution designed to enhance the video browsing experience by accurately aligning
video segments with custom language queries. This project aims to unify various VTG
tasks, such as moment retrieval, highlight detection, and video summarization, within
a single flexible framework.

Our approach redefines the VTG landscape by integrating diverse labels and tasks
into a unified formulation, enabling robust model training across different VTG
applications. We utilize progressive data annotation techniques and scalable pseudo-
supervision methods to create extensive and versatile training datasets. This strategy
allows our models to generalize effectively across multiple VTG tasks, enhancing their
capability to handle various query types and video content.

The project leverages an effective and flexible grounding model capable of


addressing each VTG task, making full use of the unified labels. The model's
architecture is designed to harness large-scale, diverse datasets, facilitating advanced
features such as zero-shot grounding, where the model can identify relevant video
segments without prior exposure to specific query types.

To make this technology accessible and user-friendly, we are developing a dynamic


website that allows users to input their queries and seamlessly search through vast
amounts of video content. The system's advanced capabilities ensure precise
identification of specific moments corresponding to the queries, delivering a highly
personalized and efficient browsing experience.

Extensive experiments on three key VTG tasks across seven diverse datasets,
including QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights,
TVSum, and QFVS, validate the effectiveness and flexibility of our framework. Our
project aims to revolutionize video content accessibility, providing users with a
powerful tool to navigate and discover relevant video segments effortlessly.

vii
TABLE OF CONTENTS

PAGE NOS.
Certificate .......................................................................................................... i
Declaration ........................................................................................................ ii
Acknowledgements ........................................................................................... iii
Vision & Mission , PEOs, POs and PSOs.......................................................... iv
Abstract.............................................................................................................. vii
Table of contents................................................................................................ viii
List of Figures ................................................................................................... x
List of Figures ................................................................................................... x

CONTENTS

CHAPTER I
1. INTRODUCTION 01 – 05
1.1 PROBLEM STATEMENT 02
1.2 OBJECTIVE 02
1.3 MOTIVATION 02
1.4 SCOPE OF SMART EVENT TIMESTAMPING 03
1.5 SOFTWARE AND HARDWARE REQUIREMENTS 04

CHAPTER II
2. LITERATURE SURVEY 06 – 07
2.1 SURVEY OF SMART EVENT TIMESTAMPING 06

CHAPTER III
3. SYSTEM DESIGN 08 – 16
3.1 FLOW CHARTS 08
3.2 SYSTEM ARCHITECTURE 08
3.3 UML DIAGRAMS 11
3.4 PROJECT PLAN 13

viii
CHAPTER IV
4. SYSTEM IMPLEMENTATION & METHODOLOGIES 17 – 35
4.1 SYSTEM IMPLEMENTATION 17
4.2 DATASET UTILIZATION 18
4.3 VIDEO TEMPORAL GROUNDING 20
4.4 VISION LANGUAGE PRE-TRAINING 21
4.5 UNIFIED FORMULATION 22
4.6 UNIFIED MODEL 26
4.7 INFERENCE 29
4.8 INTEGRATION AND DEPLOYMENT 29
4.9 USER INTERFACE 33

CHAPTER V
5. TESTING AND RESULTS 36 – 44
5.1 EVALUATION METRICS 36
5.2 COMPARISION 37
5.3 TEST CASES 38

CHAPTER VI 45 – 46
6. CONCLUSION & FUTURE ENHANCEMENTS 45

REFERENCES 47
APPENDIX 48

ix
LIST OF FIGURES

Figure No. Figure Name Page No.


Fig 3.1 Data Flow 08
Fig 3.2 System Architecture 09
Fig 3.3 Use Case Diagram 11
Fig 3.4 State Diagram 12
Fig 3.5 Sequence Diagram 13
Fig 4.1 Video Temporal Grounding 20
Fig 4.2 Unified Formulation 22
Fig 4.3 VTG Model 23
Fig 4.4 Process Of Using CLIP To Produce Labels 24
Fig 4.5 User Interface 33
Fig 5.1 Feature Extraction Of The Video 38
Fig 5.2 Test Case I 39
Fig 5.3 Test Case II 40
Fig 5.4 Test Case III 41
Fig 5.5 Test Case IV 42
Fig 5.6 Test Case V 43

LIST OF TABLES

Table No. Table Name Page No.


Table 2.1 Survey of the project 06
Table 5.1 Dataset Statistics 36
Table 5.2 Result 44

x
Smart Event Timestamping: Query-Driven VTG

CHAPTER 1
INTRODUCTION

In today's digital world, where social media reigns supreme, video content has
become a staple, providing a treasure trove of knowledge and entertainment. From
educational how-to to elaborate vlogs, the variety of online videos is immense and
constantly growing. However, this bounty also presents a challenge: how do users
efficiently sift through the vastness of videos to find ones that align with their interests
and questions?

The ability to pinpoint relevant moments within videos based on user queries has
emerged as a critical capability in enhancing the video browsing experience. This
capability, known as Video Temporal Grounding (VTG), enables users to seamlessly
access specific segments of video content aligned with their unique preferences and
queries. Tasks such as moment retrieval, highlight detection, and video summarization
have been developed to address this need, each focusing on different aspects of
temporal grounding.

In response to these challenges, our project, titled "Smart Event Timestamping:


Query-Driven Video Temporal Grounding," aims to develop a unified framework for
efficiently timestamping events within videos based on user queries. By leveraging
advanced machine learning techniques and comprehensive datasets, our framework
seeks to bridge the gap between user intent and the vast landscape of available video
content. Through the implementation of progressive data annotation techniques and
scalable model training methods, we endeavour to create a robust and user-friendly tool
that empowers users to effortlessly access relevant video content tailored to their
preferences.

This project represents the culmination of our efforts to improve video browsing.
Our goal is to enhance the experience, making it more fluid and personalized based on
user queries and preferences. Ultimately, we aim to revolutionize video browsing,
providing users with a seamless and customized experience that meets their unique
needs.

Dept. of CSE, MVSREC 1


Smart Event Timestamping: Query-Driven VTG

1.1 Problem Statement


With the rise of the internet, a surge in video content has transformed the way we
share knowledge and enjoy ourselves. However, this abundance of videos presents a
significant challenge: how to efficiently navigate through this vast array to find content
that aligns with individual interests and preferences. Users often find themselves
overwhelmed by the sheer volume and diversity of videos, making it difficult to
pinpoint specific moments or events within them. Traditional video browsing methods
fall short in providing the necessary granularity and personalization to tackle this
challenge effectively. Hence, there is a pressing need for a robust solution that offers
personalized video browsing tailored to individual preferences.

1.2 Objective
The primary goal of this project is to transform the way users interact with video
content on social media platforms. By implementing Unified Video Temporal
Grounding, we aim to provide users with a browsing experience that is both seamless
and highly personalized. This objective stems from the challenges users encounter
when trying to sift through the vast amount of video content available online, often
leading to frustration and a sense of being overwhelmed. Through advanced model
training techniques and the integration of innovative query-based technology, our
project seeks to address these challenges head-on. Our ultimate aim is to empower users
to input their specific queries and have the system dynamically search through video
content to deliver precise moments of events that match their interests and preferences.
By achieving this objective, we aspire to create a tool that enhances the overall user
experience on social media platforms, making video browsing more efficient,
enjoyable, and engaging for all users.

1.3 Motivation
Our motivation springs from a fascinating blend of real-world applications and
formidable challenges inherent in the realm of social media video browsing. Picture
this: a digital universe where users effortlessly unearth content that speaks directly to
their passions, seamlessly navigating through a seemingly boundless sea of videos. This
vision serves as the driving force behind our endeavour to harness the potential of
Unified Video Temporal Grounding, effectively bridging the gap between users' intent
and the vast array of video content available online.

Dept. of CSE, MVSREC 2


Smart Event Timestamping: Query-Driven VTG

In today's digital landscape, video content reigns supreme, captivating audiences


across diverse platforms with its immersive storytelling and engaging visuals. Yet,
amidst this abundance lies a significant challenge: users often find themselves adrift,
struggling to discover videos that truly resonate with their interests. The sheer volume
of available videos can be overwhelming, leading to frustration and disengagement.
Our motivation lies in confronting this challenge head-on, seeking to enhance the user
experience by providing personalized and efficient video browsing solutions that cut
through the noise.

Moreover, our project is driven by a deep-seated recognition of the need to innovate


in response to the ever-evolving dynamics of social media. As user preferences shift
and content trends evolve, staying relevant becomes increasingly vital. By embracing
cutting-edge technologies and methodologies, we endeavour to navigate the ever-
changing landscape of social media video browsing, ensuring that users continue to
enjoy seamless access to captivating content tailored to their interests and preferences.

In essence, our motivation is rooted in the desire to empower users, offering them a
bespoke video browsing experience that not only meets but exceeds their expectations.
We are committed to surmounting the challenges posed by the abundance of video
content and the fluid nature of social media trends, ultimately enhancing user
engagement and satisfaction in the digital sphere.

1.4 Scope of the Smart Event Timestamping


The scope of this project includes the development of a comprehensive solution for
personalized video browsing, focusing on addressing various video understanding tasks
within a unified framework. Central to this endeavour is the implementation of
advanced algorithms for event timestamping, enabling users to efficiently pinpoint
specific moments or events within videos. Moreover, the system will support natural
language queries, allowing users to search for video content using intuitive and user-
friendly interfaces.
Personalization will be a key aspect of the project, with a primary goal of tailoring
search results and recommendations based on user preferences. Additionally, the project
will prioritize efficiency in video browsing to ensure fast and seamless access to
relevant video content.

Dept. of CSE, MVSREC 3


Smart Event Timestamping: Query-Driven VTG

The project will conclude with the development of a user-friendly interface


designed to enhance the video browsing experience. Through these efforts, the project
aims to revolutionize the way users interact with video content, offering a seamless and
personalized browsing experience that caters to individual preferences and needs.

1.5 Software and Hardware Requirements


The project requires the following hardware requirements:
• Processor: A multi-core processor with sufficient processing power is
recommended to handle video processing tasks efficiently. An Intel Core i7 or AMD
Ryzen processor would be suitable for most applications.
• Memory (RAM): Video processing and machine learning tasks often require a large
amount of memory to handle datasets and model training. A minimum of 16GB of
RAM is recommended, although 32GB or more may be beneficial for larger
projects.
• Graphics Processing Unit (GPU): For accelerated deep learning tasks, a dedicated
GPU is essential. NVIDIA GPUs, such as the GeForce RTX or Quadro series, are
commonly used for deep learning applications due to their parallel processing
capabilities.
• Storage: Adequate storage space is necessary to store video datasets and trained
models. Solid-state drives (SSDs) are preferred for faster data access, especially
during model training and inference.
• Internet Connection: A stable internet connection is required for downloading
datasets, software packages, and updates, as well as for accessing cloud computing
resources if needed.
The project relies on the following software tools and technologies:
• Python: Python programming language provides a versatile and extensive
ecosystem for machine learning, data analysis, and web development. It serves as
the primary language for implementing the project components.
• Gradio: Gradio is a user-friendly library for creating interactive web interfaces for
machine learning models. It simplifies the process of deploying models and allows
users to interact with them through a web browser.
• Video Processing Libraries: OpenCV (Open Source Computer Vision Library) is a
popular open-source library for video processing tasks such as video loading, frame
manipulation, and feature extraction.

Dept. of CSE, MVSREC 4


Smart Event Timestamping: Query-Driven VTG

• Deep Learning Frameworks: TensorFlow and PyTorch are widely used deep
learning frameworks for building and training neural networks. Both frameworks
offer support for GPU acceleration and provide extensive documentation and
community support.
• Text Processing Libraries: Natural Language Processing (NLP) tasks, such as query
interpretation, may require libraries like NLTK (Natural Language Toolkit) or
spaCy for text processing and analysis.
• Visual Studio Code: Visual Studio Code is a versatile code editor with a rich set of
features and extensions. It provides an integrated development environment for
writing, debugging, and managing the project codebase.
• Git: Git is a distributed version control system that facilitates collaboration, code
management, and tracking of changes in the project. It is used to maintain the
project code repository, manage branches, and track the project's evolution over
time.

These tools and technologies were chosen based on their effectiveness,


compatibility with the project requirements, and the availability of extensive
community support and resources. They provide a robust and efficient foundation for
implementing the Smart Event Timestamping: Query-Driven Video Temporal
Grounding system, enabling effective video understanding, query-based video
browsing, and event timestamping. With these tools and technologies, our project aims
to revolutionize the way users interact with video content, offering a seamless and
personalized video browsing experience driven by user queries and preferences.

Dept. of CSE, MVSREC 5


Smart Event Timestamping: Query-Driven VTG

CHAPTER 2
LITERATURE SURVEY
2.1 Survey of Smart Event Timestamping
The literature survey for this project reviewed research papers, scholarly
articles, and conference proceedings in computer vision, machine learning, and
information retrieval. Key advancements in video understanding, query-based
browsing, event timestamping, personalization, efficiency, and scalability were
identified.

In video understanding, significant work by Hendricks et al. and Gao et al. has
enhanced moment retrieval using natural language queries. Studies by Cao et al. and
Badamdorj et al. advanced highlight detection through joint visual and audio learning
techniques. Unified frameworks, like the Slowfast networks proposed by Feichtenhofer
et al., leverage spatial and temporal information for superior video recognition.
Escorcia et al. extended this by integrating natural language processing for temporal
localization.

Query-based video browsing, explored by Chen et al. and Ghosh et al., improves
user interaction by matching natural language queries with video segments, enhancing
personalization. Alwassel et al. introduced temporally-sensitive pretraining for better
localization, while Bain et al. proposed joint video and image encoders for efficient
multimedia retrieval.

Table 2.1 Survey of the project

Yr. of
S.No Author Technique Summary Limitation
Pub
Lisa Anne
Proposed a
Hendricks, Localizing Requires
method to
Oliver Wang, moments in precise and
map textual
1 2017 Eli Shechtman, video with clear natural
descriptions
Josef Sivic, natural language
to video
Trevor Darrell, language input.
segments.
Bryan Russell
Temporal
Victor Escorcia, Advanced the
localization Scalability
Mattia Soldan, understanding
of moments and efficiency
Josef Sivic, of moment
2 2019 in video issues in
Bernard retrieval in
collections larger
Ghanem, Bryan large video
with natural datasets.
Russell collections.
language

Dept. of CSE, MVSREC 6


Smart Event Timestamping: Query-Driven VTG

Demonstrated
Frozen in the potential
Complex
Max Bain, Time: Joint of
training
Arsha Nagrani, video and multimodal
process and
3 2021 Gul Varol, image pretraining
high
Andrew encoder for for improved
computational
Zisserman end-to-end video and
requirements.
retrieval image
retrieval.
EXCL: Enhanced the Limited by
Soham Ghosh, Extractive efficiency the quality of
Anuva Agarwal, clip and accuracy natural
4 2019 Zarana Parekh, localization of finding language
Alexander G. using natural specific video descriptions
Hauptmann language clips based on provided by
descriptions user queries. users.
Pushed the
boundaries of
Meng Cao, LOCVTP: temporal High
Tianyu Yang, Video-Text localization dependency
Junwu Weng, Pretraining tasks by on the quality
5 2022
Can Zhang, Jue for leveraging and diversity
Wang, Yuexian Temporal extensive of pretraining
Zou Localization pretraining on data.
video-text
pairs.

The literature survey underscores the multifaceted nature of video


understanding research, encompassing moment retrieval, highlight detection, unified
frameworks, query-based browsing, event timestamping, and scalability. Through a
synthesis of diverse methodologies and approaches, researchers continue to push the
boundaries of video analysis and browsing, paving the way for more intelligent,
personalized, and efficient video retrieval systems. Moving forward, continued
exploration and innovation in these areas hold the promise of further enriching the
landscape of video understanding, empowering users to navigate and interact with
video content in increasingly seamless and intuitive ways.

Dept. of CSE, MVSREC 7


Smart Event Timestamping: Query-Driven VTG

CHAPTER 3
SYSTEM DESIGN
3.1 Flow Chart
Flowchart will serve as visual representations of the sequential process in our
project, Smart Event Timestamping: Query-Driven Video Temporal Grounding. The
flowcharts will depict the step-by-step workflow, commencing from user queries input
to the retrieval of relevant video segments. Each stage, including query interpretation,
video search, event timestamping, and result presentation, will be delineated using
standard symbols and arrows, elucidating the logical flow and decision points. These
visual aids will facilitate comprehension of the system's operation, pinpoint potential
bottlenecks, and streamline communication among project stakeholders, fostering a
cohesive development process.

Fig 3.1 Data flow

3.2 System Architecture


The system architecture for "Smart Event Timestamping: Query-Driven Video
Temporal Grounding" is carefully designed to facilitate precise and personalized video
browsing experiences. At its core, the architecture comprises several interconnected
components aimed at seamlessly guiding users from query input to the retrieval of
relevant video segments. The user interface serves as the entry point, providing an
intuitive platform for users to input their queries. These queries are then processed by
the Query Interpretation Module, which extracts key terms and phrases to guide the
search process effectively. The Video Search Engine conducts searches across a
comprehensive database of videos, retrieving segments aligned with user preferences.
Subsequently, the Event Detection and Timestamping Module analyzes these segments,

Dept. of CSE, MVSREC 8


Smart Event Timestamping: Query-Driven VTG

accurately identifying specific events or moments of interest. Timestamps are extracted


to enable precise navigation, allowing users to seamlessly explore the desired content.

Fig 3.2 System Architecture

1. Video Encoder
Purpose: Converts input video clips into a rich set of feature representations.
Components: Utilizes convolutional neural networks (CNNs) which are adept at
capturing spatial and temporal features in video data.
Process: Each clip V is passed through the CNN layers to extract features that capture
the visual content, including motion and objects present in the clip.
Output: Produces a set of feature vectors that represent different aspects of the visual
data.

2. Text Encoder
Purpose: Processes the free-form text query to understand its semantic content.
Components: Composed of feed-forward networks and multi-head self-attention
layers.
Process: The query Q, which consists of Lq tokens, is embedded and passed through
the encoder to generate a contextualized representation of the text.
Output: Generates a text feature vector that encapsulates the meaning and context of
the query.

Dept. of CSE, MVSREC 9


Smart Event Timestamping: Query-Driven VTG

3. Feature Fusion
Purpose: Integrates the video and text features to create a unified representation for
further processing.
Components: Involves operations like dot product and concatenation.
Process:
i. Saliency Path Combines video and text features using a dot product to calculate the
relevance of each video clip to the query.
ii. Offsets Path: Concatenates video and text features, providing input for the module
calculating temporal boundaries.
iii. Indicator Path: Similarly concatenates features to determine if a clip is relevant to
the query.
Output: Produces fused features that are contextually aligned with the query for each
video clip.

4. Attention Pooler:
Purpose: Enhances the model's focus on relevant portions of the video clips based on
the query.
Components: Uses multi-head self-attention mechanisms.
Process: Applies attention pooling to the fused features, allowing the model to weigh
the importance of different parts of the video in relation to the query.
Output: Refined feature vectors that emphasize the most relevant aspects of the video
clips.

5. Output Modules
i. Foreground Indicator
Purpose: Binary classifier to indicate whether a clip is part of the foreground (relevant
to the query).
Components: Utilizes feed-forward neural networks.
Output: A binary value (0 or 1) for each clip.
ii. Boundary Offsets
Purpose: Calculates the temporal boundaries (start and end times) of the relevant video
segments.
Components: Composed of neural networks that estimate the distance of the clip's
timestamp to the interval boundaries.
Output: A pair of values representing the distances to the start and end boundaries.

Dept. of CSE, MVSREC 10


Smart Event Timestamping: Query-Driven VTG

iii. Saliency Score


Purpose: Determines the relevance score of each clip to the query on a continuous
scale.
Components: Uses a scoring function based on the dot product of fused features.
Output: A score between 0 and 1 indicating the degree of relevance.

This architecture is designed to efficiently handle the Video Temporal


Grounding (VTG) task by integrating advanced neural network techniques to align
video content with textual queries, identify relevant segments, and accurately
timestamp them.

3.3 UML Diagrams

Use case diagram:


The use case diagram for "Smart Event Timestamping: Query-Driven Video
Temporal Grounding" visually depicts user interactions and system functionalities. It
showcases actors such as users initiating queries and essential use cases like query
interpretation, video searching, event detection, timestamp extraction, and result
presentation, ensuring a comprehensive overview of the system's operations.

Fig 3.3 Use Case Diagram

Dept. of CSE, MVSREC 11


Smart Event Timestamping: Query-Driven VTG

State diagram:
The state diagram for "Smart Event Timestamping: Query-Driven Video
Temporal Grounding" illustrates the various states and transitions within the system. It
encapsulates the system's behavior and the conditions under which it transitions
between different states. By visually representing the system's dynamic behavior, the
state diagram provides insights into how the system responds to user inputs, processes
queries, searches for relevant video segments, and presents results. Through a series of
defined states and transitions, the diagram offers a comprehensive overview of the
system's operational flow, facilitating the understanding of its functionality and
behavior.

Fig 3.4 State Diagram

Dept. of CSE, MVSREC 12


Smart Event Timestamping: Query-Driven VTG

Sequence diagram:
A sequence diagram visually represents the interaction between various
components of the 'Smart Event Timestamping: Query-Driven Video Temporal
Grounding' system. It begins with the user inputting a query through the user interface.
The diagram then illustrates the sequential flow of events as the system interprets the
query, conducts searches for relevant video segments, detects events within these
segments, extracts timestamps, and finally presents the results back to the user. Each
step in the process is represented by a lifeline corresponding to the involved
components, and arrows indicate the messages exchanged between them. The sequence
diagram captures the flow of method calls, message exchanges, and data flows between
components, providing a clear visualization of the system's operational workflow.

Fig 3.5 Sequence Diagram


3.4 Project Plan

Our 8-week project aims to develop Smart Event Timestamping: Query-Driven


Video Temporal Grounding, an innovative system enhancing video browsing by
aligning video segments with text queries. We'll unify various VTG tasks, configure
and train the model, organize datasets, and develop a user-friendly interface, ensuring
efficient video content accessibility and navigation.

Dept. of CSE, MVSREC 13


Smart Event Timestamping: Query-Driven VTG

Week 1: Project Initiation and Environment Setup

Objective: Establish the project foundation and prepare the development environment.

Tasks:
• Conduct a project kickoff meeting to align the team on objectives, roles, and
responsibilities.
• Create a new Conda environment and install necessary dependencies from the
`requirements.txt`.
• Configure the development environment for GPU usage to ensure optimal
performance.
• Set up version control with Git and establish a repository for the project.

Week 2: Data Preparation and Directory Organization

Objective: Organize the datasets and prepare the data for processing.

Tasks:
• Download and extract datasets including QVHighlights, Charades-STA,
TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS.
• Organize the extracted data into a structured directory format (e.g., metadata,
text clips, video clips, and video features).
• Implement scripts to preprocess and standardize the datasets.
• Verify the integrity and correctness of the datasets by performing initial
exploratory data analysis.

Week 3: Model Configuration and Initial Setup

Objective: Set up and configure the Smart Event Timestamping model.

Tasks:
• Define the model architecture and configure the parameters (e.g., model
version, output feature size, clip length).
• Implement argument parsing to handle command-line inputs and
configurations.
• Integrate necessary libraries and modules for model setup (e.g., PyTorch,
Gradio).
• Perform initial testing of the model configuration to ensure it is correctly set up.

Dept. of CSE, MVSREC 14


Smart Event Timestamping: Query-Driven VTG

Week 4: Feature Extraction and Model Training

Objective: Extract features from the datasets and initiate model training.

Tasks:
• Implement and test feature extraction functions for both video and text data.
• Extract features from the prepared datasets and store them in the designated
directories.
• Begin the initial phase of model training using the extracted features.
• Monitor training progress, evaluate model performance, and fine-tune
hyperparameters as necessary.

Week 5: Model Evaluation and Optimization

Objective: Evaluate and optimize the trained model for better performance.

Tasks:

• Implement evaluation metrics (e.g., mAP, HIT@1, Recall@1) to assess model


performance.
• Conduct comprehensive testing on validation datasets to evaluate model
accuracy and robustness.
• Optimize the model by adjusting hyperparameters and refining the training
process.
• Document the evaluation results and identify areas for improvement.

Week 6: User Interface Development

Objective: Develop a user-friendly web-based interface for the Smart Event


Timestamping system.

Tasks:
• Design the layout and user interface components using Gradio.
• Implement functionalities for video input, feature extraction, and text query
submission.
• Ensure the interface is intuitive and easy to navigate for end-users.
• Integrate the model with the user interface to enable real-time processing and
feedback.

Dept. of CSE, MVSREC 15


Smart Event Timestamping: Query-Driven VTG

Week 7: System Integration and Testing

Objective: Integrate all components and conduct thorough testing of the system.

Tasks:
• Integrate the feature extraction, model, and user interface components into a
cohesive system.
• Perform end-to-end testing to ensure all components work seamlessly together.
• Identify and fix any bugs or issues that arise during integration testing.
• Conduct user acceptance testing with a small group of users to gather feedback
and make necessary adjustments.

Week 8: Deployment and Project Closure

Objective: Deploy the Smart Event Timestamping system and conclude the project.

Tasks:
• Prepare the deployment environment and configure server settings for hosting
the application.
• Deploy the system to a cloud platform or local server for public access.
• Provide documentation for system usage, including a user manual and technical
documentation.
• Conduct a final project review meeting to discuss achievements, challenges, and
future improvements.

Dept. of CSE, MVSREC 16


Smart Event Timestamping: Query-Driven VTG

CHAPTER 4
SYSTEM IMPLEMENTATION & METHODOLOGIES
4.1 System Implementation
With the increasing interest in sharing daily lives, video has emerged as the most
informative yet diverse visual form on social media. These videos are collected in a
variety of settings, including untrimmed instructional videos, and well-edited vlogs.
With massive scales and diverse video forms, automatically identifying relevant
moments based on user queries has become a critical capability in the industry for
efficient video browsing. This significant demand has given rise to a number of video
understanding tasks, including moment retrieval, highlight detection, and video
summarization. As depicted, moment retrieval tends to localize consecutive temporal
windows (interval-level) by giving natural sentences; highlight detection aims to pick
out the key segment with highest worthiness (curve-level) that best reflects the video
gist; video summarization collects a set of disjoint shots (point-level) to summarize the
video, with general or user-specific queries. Despite task-specific datasets and models
have been developed, these tasks are typically studied separately. In general, these tasks
share a common objective of grounding various scale clips based on customized user
queries, which we refer to as Video Temporal Grounding (VTG).

In the implementation of Unified Video Temporal Grounding, our


comprehension of the video content landscape on social media platforms lays the
groundwork for our approach. Understanding the importance of automated methods for
identifying relevant moments in videos based on user queries, which constitutes the
central focus of our system, directs the development and implementation of our
components. This insight guides the creation of our system's components, ensuring that
our solution addresses the critical demand for efficient video browsing in the industry.

i. Video Temporal Grounding (VTG):


The goal of our system, Smart Event Timestamping, is to ground target clips
from videos, such as consecutive intervals or disjoint shots, based on custom language
queries. This functionality is essential for enhancing video browsing experiences on
social media platforms. Unlike traditional approaches, which often rely on task-specific
models trained with type-specific labels, our system is designed to generalize across
various VTG tasks and labels, thus offering more versatility and adaptability.

Dept. of CSE, MVSREC 17


Smart Event Timestamping: Query-Driven VTG

ii. Unification of VTG Labels and Tasks:


In our proposal, Smart Event Timestamping aims to unify diverse VTG labels
and tasks along three key directions. Firstly, we revisit a wide range of VTG labels and
tasks to define a unified formulation. This allows us to develop data annotation schemes
for creating scalable pseudo supervision.
iii. Development of an Effective Grounding Model:
Secondly, we develop an effective and flexible grounding model capable of
addressing each VTG task and making full use of each label. This model is designed to
decode key elements of the query-conditional elements and effectively address each
VTG task.

iv. Temporal Grounding Pretraining:


Lastly, leveraging the unified framework, we unlock temporal grounding
pretraining from large-scale diverse labels. This allows us to develop stronger
grounding abilities, including zero-shot grounding. Extensive experiments conducted
on three tasks (moment retrieval, highlight detection, and video summarization) across
seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights,
TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed
framework.

4.2 Dataset Utilization:


In our system implementation, we leverage a diverse range of datasets to
validate the effectiveness and flexibility of Smart Event Timestamping across different
tasks and domains. These datasets include:
• QVHighlights: This dataset, introduced as the first unified benchmark for moment
retrieval and highlight detection, provides a standardized platform for evaluating
algorithms in these tasks. By encompassing both moment retrieval and highlight
detection, QVHighlights enables researchers to assess the performance of their
models comprehensively across multiple related tasks.
• Charades-STA: Designed specifically for moment retrieval tasks, the Charades-
STA dataset offers a diverse collection of annotated videos. Researchers leverage
this dataset to develop and evaluate algorithms that can accurately retrieve specific
moments within videos based on user queries or predefined criteria.
• TACoS: Similar to Charades-STA, the TACoS dataset is utilized for moment
retrieval tasks and provides annotated temporal segments within videos. By

Dept. of CSE, MVSREC 18


Smart Event Timestamping: Query-Driven VTG

utilizing TACoS, researchers can further validate the effectiveness of their models
in localizing and retrieving relevant moments from videos.
• Ego4D: This dataset focuses on moment retrieval tasks, particularly emphasizing
point-level annotations within videos. By incorporating Ego4D, researchers gain
insights into the fine-grained temporal dynamics within videos, allowing for more
nuanced analysis and evaluation of moment retrieval algorithms.
• YouTube Highlights: Tailored for highlight detection tasks, the YouTube
Highlights dataset comprises a diverse range of videos with annotated key
segments. Researchers utilize this dataset to develop algorithms capable of
automatically identifying and extracting salient highlights from videos, catering to
user preferences and interests.
• TVSum: Similar to YouTube Highlights, the TVSum dataset is utilized for
highlight detection tasks and features annotated key segments within videos. By
leveraging TVSum, researchers can further evaluate the performance of highlight
detection algorithms in identifying and summarizing important segments within
video content.
• QFVS: Serving as a dataset for video summarization tasks, QFVS provides
annotated summaries of video content. Researchers utilize QFVS to develop
algorithms capable of generating concise and informative summaries of videos
based on user queries or preferences, facilitating efficient video browsing and
content analysis.

The implementation of Smart Event Timestamping, a unified video temporal


grounding system, represents a transformative advancement in video browsing. By
seamlessly integrating moment retrieval, highlight detection, and video summarization
tasks, Smart Event Timestamping offers users a comprehensive solution for navigating
video content. Leveraging a diverse array of datasets including QVHighlights and
Charades-STA, Smart Event Timestamping demonstrates robust performance across
various domains. Its innovative model development in temporal grounding and
utilization of pretraining techniques significantly enhance its adaptability and
effectiveness in real-world scenarios. The system's remarkable generalization capability
across diverse temporal labels highlights its potential for broader applications and
advancements in video understanding technology, promising users a more streamlined
and personalized video browsing experience.

Dept. of CSE, MVSREC 19


Smart Event Timestamping: Query-Driven VTG

4.3 Video Temporal Grounding


We review three VTG tasks: moment retrieval, highlight detection, and video
summarization, and compare them as different variations of a common problem.

i. Moment Retrieval
Moment Retrieval aims to localize target moments i.e., one or many continuous
intervals within a video by a language query, as shown in Fig. 4.1 (b). Previous methods
fall into two categories: proposal-based and proposal free. The proposal-based methods
employ a two-stage process of scanning the entire video to generate candidate
proposals, which are then ranked based on their matching to the text query. In contrast,
the proposal-free methods learn to regress the start and end boundaries directly without
requiring proposal candidates. Our VTG borrows from proposal-free approaches but
extends it by incorporating diverse temporal labels and tasks with a concise design.

ii. Highlight Detection


Highlight Detection aims to assign a worthiness score to each video segment
e.g., Fig. 4.1 (c), and then return the top highest scoring segment as the highlight.
Previous highlight detection datasets tend to be domain specific and query-agnostic, in
which many treat this task as a visual or visual-audio scoring problem. Nevertheless,
video highlights typically have a theme, which is often reflected in the video titles or
topics e.g., “surfing”. Recently, proposes a joint moment retrieval and highlight
detection benchmark QVHighlights that enables users to produce various highlights for
one video conditional on different text queries.

Fig 4.1 Video Temporal Grounding

Dept. of CSE, MVSREC 20


Smart Event Timestamping: Query-Driven VTG

iii. Summarization
Video Summarization aims to summarize the whole video by a set of shots to
provide a quick overview e.g., Fig. 4.1 (a), which contains two forms: Generic video
summarization that captures the important scene using visual clues merely, while
Query-focused video summarization that allows users to customize the summary by
specifying text keywords (e.g., tree and cars). The latter is closer to practical usage
hence we focus on it. Recently, Intent Vizor proposes an interactive approach allowing
users to adjust their intents to obtain a superior summary. In general, each of the three
tasks represents a specific form of VTG that grounds different scales of clips from
videos (e.g., a consecutive clip set, a single clip or a disjoint clip set) by offering
customized text queries (e.g., sentences, titles or keywords). However, previous
methods address some subtasks solely. Based on this insight, our goal is to develop a
unified framework to handle all of them.

4.4 Vision-Language Pretraining


The emergence of large-scale vision-language datasets has paved the way for
the development of VLP to enhance video-text representation for various vision-
language tasks. The representative CLIP has shown that image-level visual
representations can be effectively learned using largescale noisy image-text pairs.
Furthermore, GLIP makes an effort along the spatial axis, which leverages various
image annotations, such as image labels, captions, and bounding boxes, to develop
strong region-level understanding capacity for spatial grounding tasks. However, due
to the expensive manual cost of fine-grained temporal-level annotations i.e., temporal
bounding box, this grounding pretraining has not been extended to the temporal axis in
videos, limiting its progress to match the spatial counterparts. To address this limitation,
we explore alternative approaches that leverage accessible timestamp narrations and
derive pseudo supervision as the pretraining corpus. On the other hand, there are several
efforts have been made to perform temporal-friendly video pretraining to pursue a better
video representation for grounding tasks. But the resulting pretraining model still
requires an additional grounding model such as 2D-TAN to perform video grounding.
In contrast, powered by our unified framework and scalable pseudo annotations, we can
directly conduct VLP with grounding as a pretraining task. This way eliminates the
need for additional grounding models and enables zero-shot grounding capacity.

Dept. of CSE, MVSREC 21


Smart Event Timestamping: Query-Driven VTG

4.5 Unified Formulation


The Unified VTG pipeline is displayed in Fig. 4.2. In this section, we start by
introducing the unified formulation.

Fig 4.2 Unified Formulation

Towards Unified VTG: Tasks and Labels


Given a video V and a language query Q, we first divide V into a sequence of Lv
fixed-length clips {v1,··· ,vLv}, where each clip vi is of length l and has a centered
timestamp ti. The free-form text query Q has Lq tokens, denoted as Q ={q1, · · · , qLq }.
We then define three elements for each clip vi = (fi, di, si), described as follows:
• Foreground indicator fi ∈ {0, 1}: a binary value indicating whether the i-th clip vi
belongs to the foreground or not. If clip vi is the foreground of Q, then fi = 1,
otherwise fi = 0.
• Boundary offsets di = [dsi, dei] ∈ R2: the temporal distance that converts the clip
timestamp ti to its interval boundaries. Here, di is valid when fi = 1. The dsi is the
distance between the starting of the interval and ti, whereas dei is the distance
between the ending and ti. Thus, the whole temporal interval bi of vi can be
represented as
bi = [ti − dsi, ti + dei]
• Saliency score si ∈ [0, 1]: a continuous score determining the relevance between
the visual content of clip vi and the query Q. If the clip and query are highly
correlated, si = 1; If they are totally irrelevant, then si = 0. Notably, it is reasonable
to assume that si > 0 if a clip is in the foreground of Q, otherwise si = 0. In Fig.4.2
(a), we draw a schematic diagram to represent these three elements of clip vi in our
definition.

Dept. of CSE, MVSREC 22


Smart Event Timestamping: Query-Driven VTG

Fig 4.3 VTG model

4.5.1 Revisiting Various VTG Tasks and Labels


In the context of Unified Video Temporal Grounding, we conceptualize video
clips as the fundamental building blocks, forming the atom composition of a video. The
Video Temporal Grounding (VTG) problem is defined as the task of collecting a target
clip set M = {vi ∈ V | Q} from the video set V, conditioned on a language query Q.
Extending this definition to various tasks and labels involves addressing two key
questions:

i. Scalable Label Corpus for Pretraining


To collect a scalable label corpus for pretraining, we employ innovative data
annotation schemes. These schemes enable the creation of large-scale, diverse pseudo
labels, facilitating effective pretraining of VTG models. By leveraging these annotated
labels, we can enhance the robustness and generalization capabilities of our models
across different video understanding tasks.

ii. Obtaining Unknown Elements with Unified Formulation


When using the unified formulation, obtaining unknown elements based on the
available ones involves leveraging the principles of our formulation. By decoding and

Dept. of CSE, MVSREC 23


Smart Event Timestamping: Query-Driven VTG

extrapolating from existing elements, we can infer and predict unknown components
with a high degree of accuracy. This approach allows us to effectively handle various
VTG tasks and labels, ensuring comprehensive coverage and adaptability in addressing
user queries across diverse video content.

Fig 4.4 Process of using CLIP to produce temporal labels

4.5.2 Moment Retrieval and Interval-wise Label


Moment retrieval aims to localize one or many intervals in a video
corresponding to a given sentence query represented in Figure 4.2 (Right blue), moment
retrieval selects m consecutive clip sets, denoted as M = M1 ∪ · · · ∪ Mm, where m ≥
1, and Mj represents the jth target moment. Mathematically, M can be simplified as the
boundary set of foreground clips, denoted as {bi | fi = 1}, where fi = 1 indicates inclusion
in the target moment.

The temporal interval with specific target boundaries serves as a common label
for moment retrieval. However, annotating these intervals necessitates manual review
of the entire video, incurring significant cost. While Automated Speech Recognition
(ASR) can provide start and end timestamps, it often suffers from noise and poor
alignment with visual content, making it suboptimal. Alternatively, visual captions tend
to be descriptive and suitable as grounding queries. Leveraging this, VideoCC emerges
as a viable option, initially developed for video-level pretraining but explored here for
temporal grounding pretraining.

Once intervals are obtained, they are converted into the proposed formulation.
Clips not within the target interval are defined as fi = 0 and si = 0, while those within
the target interval are assigned fi = 1, with assumed si > 0.

Dept. of CSE, MVSREC 24


Smart Event Timestamping: Query-Driven VTG

4.5.3 Highlight Detection and Curve-wise Label


Highlight detection involves assigning an importance score to each video clip,
effectively creating annotations resembling a curve. The task then entails selecting the
few clips with the highest scores as the highlights, with or without provided. In
instances where language queries are absent, video titles or domain names can serve as
substitutes, as they are closely related to the video's topic. Mathematically, this
translates to selecting clips with the top K highest saliency scores, represented as M =
{vi | si ∈ top-K}.

Highlight detection labels, resembling curves, are characterized by subjectivity,


often requiring multiple annotators to eliminate bias. Consequently, curve labels are
both costly and informative. To address this, an efficient method for producing scalable
curve labels is sought. Intuitively, the interestingness of each clip reflects its relevance
to the video's gist. Utilizing an open-world detection class list, a concept bank is
defined, and CLIP is employed as a teacher to compute clip-level cosine similarities for
each concept. The top five concepts are selected as the video's gist, with their CLIP
similarities saved as pseudo curve labels.

Following the acquisition of curve labels, clips are assigned fi = 1 if si exceeds


a threshold τ, otherwise fi = 0. The threshold τ is estimated based on the similarity of
each video, with further details provided in the supplementary materials. Additionally,
offsets di are defined as the distance between foreground clips and their nearest
neighbouring clips where fi = 0.

4.5.4 Video Summarization and Point-wise Label


Query-focused video summarization aims to condense an entire video into a set
of shots, providing a concise overview tailored to user-specific concepts (e.g., trees,
cars). This task, defined by keywords as Q, involves selecting a set of clips
M = {vi | fi = 1}, ensuring that the size of M does not exceed a specified percentage α
of the original video length (e.g., α = 2%). Annotations in datasets like QFVS utilize
point labels, indicating whether each shot belongs to the concept or not. Compared to
interval and curve labels, point labels are more cost-effective, as annotators only need
to identify specific timestamps.

The Ego4D dataset employs point labelling extensively, associating narrations


with exact timestamps (e.g., "I am opening the washing machine" at ti = 2.30 sec). Due

Dept. of CSE, MVSREC 25


Smart Event Timestamping: Query-Driven VTG

to their scalability, these point labels are suitable for large-scale pretraining. Recent
efforts have focused on leveraging point-wise annotations to enhance video-text
representation and augment natural language query (NLQ) baselines. However, these
methods primarily operate within the same domain.

For point labels, si > 0 is derived if clip fi = 1; otherwise, si = 0. During


pretraining, the temporal label bi is estimated based on the average distance between
consecutive narrations within the video.

4.6 Unified Model


We here introduce our unified model which seamlessly inherits our proposed
unified formulation.

4.6.1. Overview
As shown in Fig. 4.3, our model mainly comprises a frozen video encoder, a
frozen text encoder, and a multi-modal encoder. The video and text encoders are keep
consistent with Moment-DETR, which employs the concatenation of CLIP (ViT-B/32)
and SlowFast (R-50) features as video representation, and use the CLIP text encoder to
extract token level features. Our multi-modal encoder contains k self-attention blocks
that followed by three specific heads to decode the prediction. Given an input video V
with Lv clips and a language query Q with Lq tokens, we first apply the video encoder
and the text encoder to encode the video and text respectively, then project them to the
same dimension D by two Feed-Forward Networks (FFN), and thus obtain video
features V = {vi}Lvi=1 ∈ RLv×D and text features Q ={qj} Lqj=1 ∈ RLq×D. Next, we design
two pathways for cross-modal alignment and cross-modal interaction.

i. For cross-modal alignment, we first adopt an attentive pooling operator to


aggregate the query tokens Q ∈ RLq×D into a sentence representation S ∈ R1×D.
Especially,
S = AQ,
where the weight A = Softmax (WQ) ∈ R1×Lq and W1×Lq is a learnable
embedding. Then V and S are sent to perform contrastive learning.

ii. For cross-modal interaction, learnable position embeddings Epos and modality-
type embeddings Etype are added to each modality to retain both positional and
modality information:

Dept. of CSE, MVSREC 26


Smart Event Timestamping: Query-Driven VTG


Q˜ = Q + EposT + EtypeT

Next, the text and video tokens are concatenated and get a joint input Z0 = [˜V ;
˜Q] ∈ RL×D, where L = Lv +Lq. Further, Z0 is fed into the multi-modal encoder,
which contains k transformer layers with each layer consisting of a Multi-
headed Self-Attention and FFN blocks.

We take the video tokens ˜Vk ∈ RLv×D from the multimodal encoder Em as output
Zk = [V˜k;Q˜k] ∈ R, and feed Zk into the following heads for prediction.

4.6.2 Pretraining Objectives


To match the previous unified formulation i.e.,(fi, di, si), we devise three
different heads to decode each element respectively, each one calling a capability.

Foreground head for Matching


Taking the output ˜Vk ∈RLv×D from the multi-modal encoder, this head applies
three 1×3 Conv layers, each with D filters and followed by a ReLU activation. Finally,
sigmoid activations are attached to output the prediction ˜ fi per clip. We use the binary
cross entropy loss as a training objective.

Boundary head for Localization


The design of this head is similar to the foreground head except for the last layer,
which has 2 outputs channel for the left and right offsets. Taking the ˜Vk ∈ RLv×D, this
head outputs offsets {˜ di}Lvi per clip. Then, we devise the predicted boundary ˜ bi and
use the combination of smooth L1 loss [12] and generalized IoU loss as our training
objectives.

Notably, this regression objective is only devised for foreground clips i.e., fi = 1.

Dept. of CSE, MVSREC 27


Smart Event Timestamping: Query-Driven VTG

Saliency head for Contrasting


Since we define saliency as the relevance between visual context and text query,
it is natural to interpret this score as a similar measurement between video and text
modalities. Taking the video tokens V = {vi}Lvi=1 ∈ RLv×D and sentence representation
S ∈R1×D, we define the predicted saliency score ˜si between clip vi and text query Q as
their cosine similarities:

where ∥ · ∥2 represents the L2-norm of a vector.


For each video V, we randomly sample a foreground clip vp with fp = 1 and sp > 0 as a
positive sample; we treat other clips in the same video vj with saliency sj less than sp as
negative samples, i.e., Ω = {j|sj < sp, 1 ≤ j ≤Lv}, and perform intra-video contrastive
learning:

where τ is a temperature parameter and set as 0.07.


Besides, we regard sentences from other samples within batches k ∈ B as negative
samples, and develop the inter video contrastive learning for cross-sample supervision:

where B is the training batch size and ˜s-kp= cos(vi, Sk).


Our saliency score head training loss is the combination of inter- and intra-video
contrastive learning:

To this end, our total training objective is the combination of each head loss overall
clips in the training set.

where N is the clip number of the training set.

Dept. of CSE, MVSREC 28


Smart Event Timestamping: Query-Driven VTG

4.7 Inference
During inference, given a video V and a language query Q, we first feed
forward the model to obtain { ˜fi,˜bi, ˜si}Lvi=1 for each clip vi from three heads. Next,
we describe how we carry out output for individual VTG tasks respectively.
Moment Retrieval
We rank clips predicted boundaries {˜bi}Lvi=1 based on their { ˜fi}Lvi=1
probabilities. Since the predicted Lv boundaries are dense, we adopt a 1-d Non-Max
Suppression (NMS) with a threshold 0.7 to remove highly overlapping boundary boxes,
yielding a final prediction.

Highlight Detection:
For each clip, to fully utilize the foreground and saliency terms, we rank all clips
based on their { ˜fi + ˜si}Lvi=1 scores, and then return the few top clip (e.g.,Top-1) as
predictions.

Video Summarization:
Using the same preprocessing settings, the videos are first divided as multiple
segments via KTS algorithm. Then the clip scores from each segment are computed,
and these scores are integrated. We rank all clips based on their foreground
{ ˜fi}Lvi=1 and return the Top-2% clips as a video summary.

4.8 Integration & Deployment


Integration and deployment of the Smart Event Timestamping system involve several
crucial steps. Smart Event Timestamping is a sophisticated video-language temporal
grounding system designed to deliver relevant video segments and highlights based on
text queries. The process includes setting up the environment, installing necessary
dependencies, organizing the data, and executing the code to ensure seamless operation.

Step 1: Setting Up the Environment


To begin the setup of Smart Event Timestamping, it is essential to create an
isolated environment to manage dependencies effectively. Conda, a package and
environment management system, is recommended for this purpose. You will create a
new environment named smart_event_timestamping with Python version 3.8. This
ensures that the specific versions of libraries required for Smart Event Timestamping
do not conflict with other projects or system-wide installations.

Dept. of CSE, MVSREC 29


Smart Event Timestamping: Query-Driven VTG

Once the environment is created, activate it and proceed to install the necessary
Python packages. These packages are listed in a file named requirements.txt. Installing
these packages will set up all the dependencies required to run Smart Event
Timestamping, including libraries for machine learning, data processing, and other
utilities.

This step is crucial because it isolates the project's dependencies from the system's
global environment, reducing the risk of version conflicts and ensuring that the project
uses the exact versions of libraries it was developed and tested with. This isolation
makes the development process more manageable and helps avoid common pitfalls
associated with dependency management.
Step 2: Preparing the Data
The next step involves preparing the dataset required for Smart Event
Timestamping. Begin by unzipping the downloaded tar file containing the dataset. After
extracting the files, move the data into the appropriate directories as specified by the
Smart Event Timestamping structure. This step is crucial for ensuring that the program
can locate and access the data efficiently.

For those using VideoCC Slowfast features, additional steps are required. You
need to group multiple sub-zips into a single file and then unzip it to prepare the features
for use. This process consolidates the data into a format that Smart Event Timestamping
can handle effectively.

The organization of data into a specific directory structure allows the system to
easily access and manage the different components of the dataset. Proper data
organization is vital for efficient data handling, ensuring that all necessary files are
available in the expected locations when the system is executed.

Step 3: Organizing the Directory Structure


Proper organization of the directory structure is essential for the smooth
operation of Smart Event Timestamping. The extracted data and features should be
arranged in a specific hierarchical structure. This structure includes directories for
evaluation data, various datasets (such as QFVS, TVSum, YouTube, TACoS, Ego4D,
Charades, and QVHighlights), and subdirectories for metadata, text clips, video clips,
and video features.

Dept. of CSE, MVSREC 30


Smart Event Timestamping: Query-Driven VTG

Maintaining this organization ensures that the Smart Event Timestamping


system can easily access the required files and datasets during execution, reducing the
risk of errors and improving efficiency. Each dataset needs to be placed in its designated
directory, with clear demarcations for different types of data like metadata, video clips,
and text clips. This clear separation helps in managing and accessing data without
confusion.

Step 4: Installing Additional Packages

Beyond the initial setup, you may need to install additional packages to fully prepare
the environment for executing Smart Event Timestamping. These packages might not
be covered by requirements.txt but are necessary for specific functionalities within the
system.

This step involves importing necessary libraries, setting up argument parsing to


handle various command-line arguments, and configuring the environment for GPU
usage if available. Properly setting these configurations ensures that Smart Event
Timestamping can leverage hardware acceleration and run optimally.

In particular, setting up for GPU usage can significantly enhance the


performance of Smart Event Timestamping, allowing it to process large volumes of
video data and complex models more efficiently. Ensuring that all necessary packages
and configurations are in place is key to achieving optimal performance.

Step 5: Loading and Setting Up the Model


Loading and setting up the model is a critical step in deploying Smart Event
Timestamping. You need to configure the model settings, including the version of the
model to use, the size of the output features, the length of the video clips, and other
relevant parameters. This involves setting up the environment, preparing the model for
use with specific hardware (such as GPUs), and ensuring that all dependencies are
correctly installed. Proper configuration ensures that the model performs optimally and
efficiently processes video and text data to provide accurate and relevant insights.

A function to load the model should be implemented, ensuring that the model is
set up with the appropriate configurations and is ready to process data. This function
typically involves setting up logging for monitoring the process, configuring CUDA
settings for GPU acceleration, and initializing the model with the specified parameters.

Dept. of CSE, MVSREC 31


Smart Event Timestamping: Query-Driven VTG

The model configuration includes defining the architecture and parameters that
the model will use to process video and text data. This setup is crucial for ensuring that
the model operates correctly and efficiently, providing accurate results based on the
input data.

Step 6: Data Loading and Processing


With the model set up, the next step is to load and process the data. Implement
a function to load video and text data, normalizing and preparing it for model input.
This involves loading features from preprocessed data files, normalizing the data to
ensure consistency, and structuring it in a format that the model can process.

Processing the data also involves creating masks and timestamps, which are
used by the model to understand the temporal aspects of the video data. Properly
loading and processing the data is crucial for accurate and efficient model performance.

Normalization and preparation of data ensure that the model receives input in a
standardized format, reducing variability and improving the reliability of the model's
outputs. This step is critical for achieving high accuracy in the model's predictions.

Step 7: Video and Text Feature Extraction


Extracting features from both video and text data is essential for the functioning
of Smart Event Timestamping. You need to implement functions to extract features
from video files and text queries. The video feature extraction function processes the
video to generate feature vectors that represent different segments of the video.

Similarly, text feature extraction involves processing the text query to generate
feature vectors that the model can compare with video features. This allows the system
to understand the query and identify relevant video segments.

Additionally, you need to implement functionality to handle user input, such as


video files or YouTube video IDs. This includes downloading videos, extracting
features, and preparing the data for querying.

Feature extraction is a critical step as it converts raw video and text data into a
form that the model can interpret and analyze. This step involves using pre-trained
models and algorithms to extract meaningful representations from the data, which are
then used for querying and analysis.

Dept. of CSE, MVSREC 32


Smart Event Timestamping: Query-Driven VTG

Step 8: User Interaction and Query Processing


Smart Event Timestamping includes an interactive component where users can
input text queries and receive relevant video segments and highlights. This involves
setting up a user interface using tools like Gradio, allowing users to upload videos, input
text queries, and receive responses from the system.

The system processes the user input, extracts necessary features, and uses the
model to find relevant video segments based on the query. The results are then presented
to the user, including the top intervals and highlights identified by the model.

Following these steps ensures the successful integration and deployment of


Smart Event Timestamping. By setting up the environment, preparing and organizing
data, configuring the model, extracting features, and implementing user interaction, you
can deploy a robust video-language temporal grounding system that efficiently
processes video and text data to provide relevant insights. This structured approach to
setting up and deploying the system not only ensures efficient operation but also makes
it easier to maintain and extend in the future.

4.9 User Interface


The user interface of the Smart Event Timestamping web-based application is
designed to be intuitive and user-friendly, ensuring a seamless interaction for users who
wish to perform video analysis through moment retrieval, highlight detection, and video

Fig 4.5 User Interface

Dept. of CSE, MVSREC 33


Smart Event Timestamping: Query-Driven VTG

summarization. The layout is structured to provide clear sections for video input,
feature extraction, and query interaction, making it easy to navigate and utilize the
system's capabilities.

i. Video Input Section


The left side of the page is dedicated to the video input section. This section includes a
placeholder where the video is displayed. Users can either upload a local video file
directly into this section or provide a YouTube video link. Below the video display area,
there is a text field where users can paste the YouTube video URL. This functionality
allows users to analyze a wide range of videos from different sources, providing
flexibility and convenience.

When a video is uploaded or a YouTube link is provided, the video will appear
in the placeholder, ready for feature extraction and analysis. This visual feedback
ensures that users can verify the correct video has been selected before proceeding with
further actions.

ii. Feature Extraction Button


On the right side of the video input section, there is a prominently placed button
labeled "Extract Video Features." This button is critical for initiating the analysis
process. When pressed, it triggers the system to extract the video features of the
currently displayed video, whether it is an uploaded file or a YouTube video. The
extraction process involves breaking down the video into analyzable components,
enabling the system to process and respond to user queries effectively.

The button's placement and clear labeling ensure that users can easily find and
use it, facilitating a smooth workflow from video input to feature extraction. This step
is essential for preparing the video data for subsequent queries, making the feature
extraction button a pivotal element of the interface.

iii. Chatbot Interaction Area


The right side of the page hosts the chatbot interaction area. This section is
designed to handle user queries related to video analysis. The chatbot interface includes
a text field where users can type their questions or commands. Users can ask the chatbot
various queries regarding moment retrieval, highlight detection, and video
summarization. For example, users might inquire about specific moments within the
video, request a summary, or ask the system to highlight the most significant segments.

Dept. of CSE, MVSREC 34


Smart Event Timestamping: Query-Driven VTG

Below the text field, there are two buttons: "Submit" and "Clear."

Submit Button: After entering their query into the text field, users can click the
"Submit" button to send their query to the system. The system then processes the
request and provides relevant results, which are displayed within the chatbot
conversation window. This interaction allows users to receive detailed and specific
information about the video based on their queries.

Clear Button: The "Clear" button is provided to erase the contents of the text field.
This is particularly useful if the user wishes to rephrase their question or start a new
query without any remnants of previous input. The clear functionality ensures that the
input area is reset quickly, maintaining a clean and organized interaction space.

The interface uses simple, intuitive controls with clear labels, making it
accessible to users with varying levels of technical expertise. The use of familiar web
elements like buttons, text fields, and chat windows ensures that users can quickly
understand and navigate the application. By placing the video input section on the left
and the chatbot interaction area on the right, the layout utilizes the screen space
effectively, providing clear and distinct areas for different tasks. This separation helps
users focus on one task at a time without feeling overwhelmed.

Dept. of CSE, MVSREC 35


Smart Event Timestamping: Query-Driven VTG

CHAPTER 5
TESTING AND RESULTS
5.1 Evaluation Metrics
The effectiveness of the methods is evaluated across four video temporal
grounding (VTG) tasks using seven datasets. For joint moment retrieval and highlight
detection tasks, the QVHighlights dataset is utilized. The evaluation metrics include
Recall@1 with IoU thresholds of 0.5 and 0.7, mean average precision (mAP) with IoU
thresholds of 0.5 and 0.75, and the average mAP over a series of IoU thresholds ranging
from 0.5 to 0.95. For highlight detection, metrics such as mAP and HIT@1 are
employed, considering a clip as a true positive if it has a saliency score of "Very Good."

For the moment retrieval task, datasets such as Charades-STA, NLQ, and
TACoS are assessed using Recall@1 with IoU thresholds of 0.3, 0.5, and 0.7, as well
as mIoU. For highlight detection tasks on YouTube Highlights and TVSum, mAP and
Top-5 mAP metrics are used, respectively. Lastly, for the video summarization task on
the QFVS dataset, the evaluation is based on the F1-score per video as well as the
average F1-score.
Table 5.1 Dataset Statistics

Dataset Label # Samples Domain


Ego4D Point 1.8M Egocentric
VideoCC Interval 0.9M Web
CLIP teacher Curve 1.5M Open
Interval + VLog,
QVHighlights 10.3K
Curve News
NLQ Interval 15.1K Egocentric
Charades-
Interval 16.1K Indoor
STA
TACoS Interval 18.2K Kitchens
YoutubeHL Curve 600 Web
TVSum Curve 50 Web
QFVS Point 4 Egocentric

The Smart Event Timestamping system employs a multi-modal transformer encoder


with four layers, each having a hidden size of 1024 and eight attention heads. The drop
path rates are set to 0.1 for transformer layers and 0.5 for input FFN projectors. The
pretraining experiments are conducted using eight A100 GPUs, while downstream tasks
are performed on a single GPU. For moment retrieval, all baselines and Smart Event

Dept. of CSE, MVSREC 36


Smart Event Timestamping: Query-Driven VTG

Timestamping utilize the same video and text features. Results for highlight detection
and video summarization are reported according to established benchmarks, ensuring
consistency in evaluation.
5.2 Comparison with State-of-the-Arts
i. Joint Moment Retrieval and Highlight Detection
The Smart Event Timestamping system is evaluated on the QVHighlights test
split, showing comparable performance to Moment-DETR and UMT without
pretraining, demonstrating its superior design for joint task optimization. With large-
scale pretraining, it exhibits significant improvements across all metrics, such as an
increase of +8.16 in Avg. mAP and +5.32 in HIT@1, surpassing all baselines by a
substantial margin. Notably, even with the introduction of audio modality and ASR
pretraining by UMT, it outperforms UMT by an Avg. mAP of 5.55 and HIT@1 of 3.89.
Furthermore, its large-scale pretraining allows for effective zero-shot grounding,
outperforming several supervised baselines without any training samples.

ii. Moment Retrieval


In moment retrieval tasks, the Smart Event Timestamping system is compared
with mainstream methods on three widely-used benchmarks. Without pretraining, it
already outperforms other methods, highlighting the effectiveness of its architecture.
With large-scale grounding pretraining, it shows significant improvements, with mIoU
increases of +2.97 in NLQ, +2.07 in Charades-STA, and +5.03 in TACoS. The zero-
shot results in NLQ outperform all baseline methods due to the close pretraining
domain. However, the zero-shot performance on TACoS is inferior, likely due to the
similar scenes in videos, which pose challenges for zero-shot methods.

iii. Highlight Detection


Highlight detection experiments are conducted on YouTube Highlights and
TVSum datasets. Grounding pretraining enhances the Smart Event Timestamping
system, allowing it to surpass all baselines in Avg. mAP. In TVSum, the gain
discrepancy among domains might stem from its smaller scale (50 samples) and scoring
subjectivity. Conversely, the larger YouTube Highlights dataset (600 videos) yields
more consistent pretraining gains. In a zero-shot setting, the Smart Event Timestamping
system outperforms several video-only baselines, demonstrating its robust
performance.

Dept. of CSE, MVSREC 37


Smart Event Timestamping: Query-Driven VTG

iv. Video Summarization


On the QFVS benchmark, the pretrained Smart Event Timestamping system
achieves a 0.8% higher Avg. F1-score compared to IntentVizor, an interactive method
tailored for video summarization. This result underscores the generalization capability
of the Smart Event Timestamping system in the video summarization task, affirming its
effectiveness across various VTG tasks and datasets.

5.3 Test Cases


i. Test Case I
Objective: Verify the system's ability to identify the specific interval and highlight
where a buffet tray with Mexican hamburgers labeled "Mexican Hamburger" is seen.

Description: This test case evaluates the system's performance in identifying a specific
interval and highlight within a 9-minute video where a buffet tray with Mexican
hamburgers is observed.

Fig 5.1 Feature Extraction of the video

Step 1: Upload/Link Video


Action: Ensure the 9-minute video is displayed in the video placeholder.
Expected Result: The video appears in the video placeholder, ready for analysis.

Step 2: Extract Video Features


Action: Click the "Extract Video Features" button.
Expected Result: A confirmation message appears indicating successful feature
extraction. The chatbot is ready for queries.

Dept. of CSE, MVSREC 38


Smart Event Timestamping: Query-Driven VTG

Step 3: Enter Query


Action: Enter the query "Show the interval and highlight where a buffet tray with
Mexican hamburgers is seen" into the chatbot text field.
Expected Result: The query is accepted by the chatbot.

Step 4: Submit Query


Action: Click the "Submit" button.
Expected Result: The chatbot processes the query and returns the relevant interval and
highlight.

Step 5: Expected Output


Interval: "8:44 to 8:52 min"
Highlight: "8:46 min"

Fig 5.2 Test Case I

ii. Test Case II


Objective: Verify the system's ability to identify the specific interval and highlight
where a ship is seen in the water.

Description: This test case evaluates the system's performance in identifying a specific
interval and highlight within a 9-minute video where a ship is observed in the water.

Step 1: Upload/Link Video


Action: Ensure the 9-minute video is displayed in the video placeholder.
Expected Result: The video appears in the video placeholder, ready for analysis.

Dept. of CSE, MVSREC 39


Smart Event Timestamping: Query-Driven VTG

Step 2: Extract Video Features


Action: Click the "Extract Video Features" button.
Expected Result: A confirmation message appears indicating successful feature
extraction. The chatbot is ready for queries.

Step 3: Enter Query


Action: Enter the query "Show the interval and highlight where a ship is seen in the
water" into the chatbot text field.
Expected Result: The query is accepted by the chatbot.

Step 4: Submit Query


Action: Click the "Submit" button.
Expected Result: The chatbot processes the query and returns the relevant interval and
highlight.

Step 5: Expected Output


Interval: "0:45 to 0:57 sec"
Highlight: "0:54 sec"

Fig 5.3 Test Case II


iii. Test Case III
Objective: Verify the system's ability to identify the specific interval and highlight
where the countries Sweden and Norway are labelled on a map in a 12min video from
YouTube.

Description: This test case evaluates the system's performance in identifying a specific
interval and highlight within a video where the countries Sweden and Norway are
observed on a map.

Dept. of CSE, MVSREC 40


Smart Event Timestamping: Query-Driven VTG

Step 1: Upload/Link Video


Action: Ensure the video is displayed in the video placeholder.
Expected Result: The video appears in the video placeholder, ready for analysis.

Step 2: Extract Video Features


Action: Click the "Extract Video Features" button.
Expected Result: A confirmation message appears indicating successful feature
extraction. The chatbot is ready for queries.

Step 3: Enter Query


Action: Enter the queries "Sweden" and "Norway" into the chatbot text field.
Expected Result: The queries are accepted by the chatbot.
Step 4: Submit Query
Action: Click the "Submit" button.
Expected Result: The chatbot processes the queries and returns the relevant interval and
highlight.
Step 5: Expected Output
Interval: "5:40 to 5:59 min"
Highlight: "2:12 min"

Fig 5.4 Test Case III

iv. Test Case IV


Objective: Verify the system's ability to identify the specific interval and highlight
where a girl is seen opening and reading a book.

Dept. of CSE, MVSREC 41


Smart Event Timestamping: Query-Driven VTG

Description: This test case evaluates the system's performance in identifying a specific
interval and highlight within a 9-minute video where a girl is observed opening and
reading a book.

Step 1: Upload/Link Video


Action: Ensure the 9-minute video is displayed in the video placeholder.
Expected Result: The video appears in the video placeholder, ready for analysis.

Step 2: Extract Video Features


Action: Click the "Extract Video Features" button.
Expected Result: A confirmation message appears indicating successful feature
extraction. The chatbot is ready for queries.

Step 3: Enter Query


Action: Enter the query "When did the girl open the book?" into the chatbot text field.
Expected Result: The query is accepted by the chatbot.

Step 4: Submit Query


Action: Click the "Submit" button.
Expected Result: The chatbot processes the query and returns the relevant interval and
highlight.

Step 5: Expected Output


Interval: "7:35 to 7:40 min"
Highlight: "7:38 min"

Fig 5.5 Test Case IV

Dept. of CSE, MVSREC 42


Smart Event Timestamping: Query-Driven VTG

v. Test Case V
Objective: Verify the system's ability to identify the specific interval and highlight
where a girl is seen coming inside a room.

Description: This test case evaluates the performance in identifying a specific interval
and highlight within a 30-seconds video where a girl is observed coming inside a room.

Step 1: Upload/Link Video


Action: Ensure the 30-seconds video is displayed in the video placeholder.
Expected Result: The video appears in the video placeholder, ready for analysis.

Step 2: Extract Video Features


Action: Click the "Extract Video Features" button.
Expected Result: A confirmation message appears indicating successful feature
extraction. The chatbot is ready for queries.

Step 3: Enter Query


Action: Enter the query "When did she come inside the room?".
Expected Result: The query is accepted by the chatbot.

Step 4: Submit Query


Action: Click the "Submit" button.
Expected Result: The chatbot processes the query and returns the relevant interval and
highlight.

Step 5: Expected Output


Interval: "0:00 to 0:07 min"
Highlight: "0:00 min"

Fig 5.6 Test Case V

Dept. of CSE, MVSREC 43


Smart Event Timestamping: Query-Driven VTG

Test Results
Table 5.2 Results

Test
Objective Expected Output Our Output Pass/Fail
Case
To identify the
specific interval
Interval: Interval:
and highlight
"8:44 to 8:52 min" "8:44 to 8:52 min"
I where a buffet Pass
Highlight: Highlight:
tray with Mexican
"8:46 min" "8:46 min"
hamburgers is
seen.
To identify the
Interval: Interval:
specific interval
"0:45 to 0:57 sec" "0:45 to 0:57 sec"
II and highlight Pass
Highlight: Highlight:
where a ship is
"0:54 sec" "0:54 sec"
seen in the water.
To identify the
specific interval
and highlight Interval: Interval:
where the "5:40 to 5:59 min" "5:40 to 5:59 min"
III Pass
countries Sweden Highlight: Highlight:
and Norway are "2:12 min" "2:12 min"
labeled on a map
in a video
To identify the
specific interval Interval: Interval:
and highlight "7:35 to 7:40 min" "7:35 to 7:40 min"
IV Pass
where a girl is Highlight: Highlight:
seen opening and "7:38 min" "7:38 min"
reading a book.
To identify the
specific interval Interval: Interval:
and highlight "0:00 to 0:06 min" "0:00 to 0:07 min"
V Pass
where a girl is Highlight: Highlight:
seen coming "0:00 min" "0:00 min"
inside a room.

Dept. of CSE, MVSREC 44


Smart Event Timestamping: Query-Driven VTG

CHAPTER 6
CONCLUSION AND FUTURE ENCHANCEMENTS
The development of this web-based application marks a significant milestone
in the field of video analysis and user interaction. The system's ability to process video
content, extract relevant features, and respond to user queries with precise intervals and
highlights demonstrates the potential of integrating advanced machine learning models
with practical user interfaces.
The integration of video feature extraction with natural language processing has
yielded a highly user-friendly web application, facilitating seamless video analysis
tasks across various domains. By enabling easy video upload, precise content
extraction, and efficient navigation, the system has demonstrated tangible
advancements in enhancing digital content accessibility and engagement. With a
practical interface and effective machine learning integration, it has significantly
boosted productivity and streamlined processes in fields ranging from media production
to education.
Future Enhancements
As we look ahead, there are several exciting enhancements that can be made to
the project to expand its capabilities, improve its accuracy, and provide an even more
seamless user experience.

1. Multi-Language Support:
Currently, the system primarily supports queries in English. Introducing multi-language
support would make the system accessible to a broader audience. Leveraging advanced
natural language processing models capable of understanding and processing multiple
languages can cater to users worldwide. This enhancement would require integrating
translation services or multi-language NLP models, ensuring the system maintains its
accuracy and efficiency across various languages.

2. Real-Time Video Processing:


Introducing real-time video processing capabilities would significantly enhance the
system's utility. Users could analyze live video feeds or ongoing events, making the
system valuable for applications in security, live event monitoring, and more. Achieving
this would involve optimizing the feature extraction process to handle streaming data
and ensuring the backend infrastructure supports real-time data flow and processing.

Dept. of CSE, MVSREC 45


Smart Event Timestamping: Query-Driven VTG

3. Advanced Visualization Tools:


Developing advanced visualization tools to present the results in a more user-friendly
manner can enhance the overall experience. This could include interactive timelines,
heatmaps showing areas of interest in the video, and more detailed metadata about the
extracted features. These tools would help users better understand the analysis results
and make informed decisions based on the insights provided.

4. Offline Capabilities:
Developing offline capabilities would ensure that users can use the system even without
an internet connection. This would be particularly useful for users in remote locations
or with limited internet access. Offline mode could involve downloading essential
components of the system and allowing local video analysis, with results synchronized
once the connection is restored.

By focusing on these future enhancements, the project can evolve into a more
robust, versatile, and user-friendly tool, catering to a wide range of applications and
user needs.

Dept. of CSE, MVSREC 46


Smart Event Timestamping: Query-Driven VTG

REFERENCES

[1] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and
Bryan Russell. Localizing moments in video with natural language. In ICCV, pages
5803–5812, 2017.

[2] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity
localization via language query. In ICCV, pages 5267–5275, 2017.

[3] Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. Temporally
grounding natural sentence in video. In EMNLP, pages 162–171, 2018.

[4] Humam Alwassel, Silvio Giancola, and Bernard Ghanem. Tsp: Temporally-
sensitive pretraining of video encoders for localization tasks. In ICCV, pages 3173–
3183, 2021.

[5] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast
networks for video recognition. In ICCV, pages 6202–6211, 2019.

[6] Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell.
Temporal localization of moments in video collections with natural language. arXiv
preprint arXiv:1907.12763, 2019.

[7] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles.
Activitynet: A large-scale video benchmark for human activity understanding. In
CVPR, pages 961–970, 2015.

[8] Max Bain, Arsha Nagrani, Gul Varol, and Andrew Zisserman. Frozen in time: A
joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738,
2021.

[9] Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, and Yuexian Zou.
Locvtp: Video-text pretraining for temporal localization. In ECCV, pages 38–56, 2022.

[10] Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. Joint visual
and audio learning for video highlight detection. In ICCV, pages 8127–8137, 2021.

[11] Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander G. Hauptmann.
Excl: Extractive clip localization using natural language descriptions. In NAACL-HLT,
pages 1984–1990, 2019.

Dept. of CSE, MVSREC 47


Smart Event Timestamping: Query-Driven VTG

APPENDIX

File name:main.py

import os
import pdb
import time
import torch
import gradio as gr
import numpy as np
import argparse
import subprocess
from run_on_video import clip, vid2clip, txt2clip

parser = argparse.ArgumentParser(description='')
parser.add_argument('--save_dir', type=str, default='./tmp')
parser.add_argument('--resume', type=str,
default='./results/omni/model_best.ckpt')
parser.add_argument("--gpu_id", type=int, default=2)
args = parser.parse_args()
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_id)

#################################
model_version = "ViT-B/32"
output_feat_size = 512
clip_len = 2
overwrite = True
num_decoding_thread = 4
half_precision = False
clip_model, _ = clip.load(model_version, device=args.gpu_id,
jit=False)

import logging
import torch.backends.cudnn as cudnn
from main.config import TestOptions, setup_model
from utils.basic_utils import l2_normalize_np_array

logger = logging.getLogger(__name__)
logging.basicConfig(format="%(asctime)s.%(msecs)03d:%(levelname)s
:%(name)s - %(message)s",datefmt="%Y-%m-%d %H:%M:%S",
level=logging.INFO)

def load_model():
logger.info("Setup config, data and model...")
opt = TestOptions().parse(args)
# pdb.set_trace()
cudnn.benchmark = True
cudnn.deterministic = False

Dept. of CSE, MVSREC 48


Smart Event Timestamping: Query-Driven VTG

if opt.lr_warmup > 0:
total_steps = opt.n_epoch
warmup_steps = opt.lr_warmup if opt.lr_warmup > 1 else
int(opt.lr_warmup * total_steps)
opt.lr_warmup = [warmup_steps, total_steps]

model, criterion, _, _ = setup_model(opt)


return model

vtg_model = load_model()

def convert_to_hms(seconds):
return time.strftime('%H:%M:%S', time.gmtime(seconds))

def load_data(save_dir):
vid =
np.load(os.path.join(save_dir,'vid.npz'))['features'].astype
(np.float32)
txt = np.load(os.path.join(save_dir,'txt.npz'))['features'].
astype(np.float32)

vid = torch.from_numpy(l2_normalize_np_array(vid))
txt = torch.from_numpy(l2_normalize_np_array(txt))
clip_len = 2
ctx_l = vid.shape[0]

timestamp = ( (torch.arange(0, ctx_l) + clip_len / 2) /


ctx_l) .unsqueeze(1).repeat(1, 2)

if True:
tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
tef_ed = tef_st + 1.0 / ctx_l
tef = torch.stack([tef_st, tef_ed], dim=1) # (Lv, 2)
vid = torch.cat([vid, tef], dim=1) # (Lv, Dv+2)

src_vid = vid.unsqueeze(0).cuda()
src_txt = txt.unsqueeze(0).cuda()
src_vid_mask = torch.ones(src_vid.shape[0],
src_vid.shape[1]).cuda()
src_txt_mask = torch.ones(src_txt.shape[0],
src_txt.shape[1]).cuda()

return src_vid, src_txt, src_vid_mask, src_txt_mask,


timestamp,ctx_l

def forward(model, save_dir, query):


src_vid, src_txt, src_vid_mask, src_txt_mask, timestamp,
ctx_l = load_data(save_dir)

Dept. of CSE, MVSREC 49


Smart Event Timestamping: Query-Driven VTG

src_vid = src_vid.cuda(args.gpu_id)
src_txt = src_txt.cuda(args.gpu_id)
src_vid_mask = src_vid_mask.cuda(args.gpu_id)
src_txt_mask = src_txt_mask.cuda(args.gpu_id)

model.eval()
with torch.no_grad():
output = model(src_vid=src_vid, src_txt=src_txt,
src_vid_mask=src_vid_mask, src_txt_mask=src_txt_mask)

# prepare the model prediction


pred_logits = output['pred_logits'][0].cpu()
pred_spans = output['pred_spans'][0].cpu()
pred_saliency = output['saliency_scores'].cpu()

# prepare the model prediction


pred_windows = (pred_spans + timestamp) * ctx_l * clip_len
pred_confidence = pred_logits

# grounding
top1_window = pred_windows[torch.argmax(pred_confidence)].
tolist()
top5_values, top5_indices = torch.topk(pred_confidence.
flatten(), k=5)
top5_windows = pred_windows[top5_indices].tolist()

q_response = f"For query: {query}"

mr_res = " - ".join([convert_to_hms(int(i))


for i in top1_window])
mr_response = f"The Top-1 interval is: {mr_res}"

hl_res = convert_to_hms(torch.argmax(pred_saliency) *
clip_len)
hl_response = f"The Top-1 highlight is: {hl_res}"
return '\n'.join([q_response, mr_response, hl_response])

def extract_vid(vid_path, state):


history = state['messages']
vid_features = vid2clip(clip_model, vid_path, args.save_dir)
history.append({"role": "user", "content": "Finish extracting
video features."})
history.append({"role": "system", "content": "Please Enter
the text query."})
chat_messages = [(history[i]['content'],
history[i+1]['content']) for i in range(0, len(history),2)]
return '', chat_messages, state

Dept. of CSE, MVSREC 50


Smart Event Timestamping: Query-Driven VTG

def extract_txt(txt):
txt_features = txt2clip(clip_model, txt, args.save_dir)
return

def download_video(url, save_dir='./examples', size=768):


save_path = f'{save_dir}/{url}.mp4'
cmd = f'yt-dlp -S ext:mp4:m4a --throttled-rate 5M -f
"best[width<={size}][height<={size}]" --output {save_path} --
merge-
output-format mp4 https://www.youtube.com/embed/{url}'

if not os.path.exists(save_path):
try:
subprocess.call(cmd, shell=True)
except:
return None
return save_path

def get_empty_state():
return {"total_tokens": 0, "messages": []}

def submit_message(prompt, state):


history = state['messages']

if not prompt:
return gr.update(value=''), [(history[i]['content'],
history[i+1]['content'])
for i in range(0, len(history)-1, 2)], state

prompt_msg = { "role": "user", "content": prompt }

try:
history.append(prompt_msg)
# answer = vlogger.chat2video(prompt)
# answer = prompt
extract_txt(prompt)
answer = forward(vtg_model, args.save_dir, prompt)
history.append({"role": "system", "content": answer})

except Exception as e:
history.append(prompt_msg)
history.append({
"role": "system",
"content": f"Error: {e}"
})
chat_messages = [(history[i]['content'],
history[i+1]['content']) for i in range(0,len(history)-1, 2)]
return '', chat_messages, state

Dept. of CSE, MVSREC 51


Smart Event Timestamping: Query-Driven VTG

def clear_conversation():
return gr.update(value=None, visible=True),
gr.update(value=None,
interactive=True), None, gr.update(value=None, visible=True),
get_empty_state()

def subvid_fn(vid):
save_path = download_video(vid)
return gr.update(value=save_path)

css = """
#col-container {max-width: 80%; margin-left: auto; margin-
right:
auto;}
#video_inp {min-height: 100px}
#chatbox {min-height: 100px;}
#header {text-align: center;}
#hint {font-size: 1.0em; padding: 0.5em; margin: 0;}
.message { font-size: 1.2em; }
"""

with gr.Blocks(css=css) as demo:

state = gr.State(get_empty_state())
with gr.Column(elem_id="col-container"):
gr.Markdown("""## 🤖🤖 Smart Event Timestamping: Query-
Driven Video temporal Grounding Given a video and text
query,return relevant window and highlight.""",
elem_id="header")

with gr.Row():
with gr.Column():
video_inp = gr.Video(label="video_input")
gr.Markdown("👋👋 **Step1**: Select a video in
Examples
(bottom) or input youtube video_id in this textbox,
*e.g.* *G7zJK6lcbyU* for
https://www.youtube.com/watch?v=G7zJK6lcbyU",
elem_id="hint")
with gr.Row():
video_id = gr.Textbox(value="",
placeholder="Youtube video url",
show_label=False)
vidsub_btn = gr.Button("(Optional) Submit
Youtube id")

with gr.Column():

Dept. of CSE, MVSREC 52


Smart Event Timestamping: Query-Driven VTG

vid_ext = gr.Button("Step2: Extract video


feature, may takes a while")

total_tokens_str=r.Markdown(elem_id="total_tokens
_str")
chatbot = gr.Chatbot(elem_id="chatbox")
input_message = gr.Textbox(show_label=False,
placeholder="Enter text query and press enter",
visible=True).style(container=False)
btn_submit = gr.Button("Step3: Enter your text
query")
btn_clear_conversation = gr.Button("🔃🔃 Clear")

examples = gr.Examples(
examples=[
["./examples/charades.mp4"],
],
inputs=[video_inp],
)

btn_submit.click(submit_message, [input_message, state],


[input_message, chatbot])
input_message.submit(submit_message, [input_message, state],
[input_message, chatbot])
btn_clear_conversation.click(clear_conversation, [],
[input_message, video_inp, chatbot, state])
vid_ext.click(extract_vid, [video_inp, state],
[input_message, chatbot])
vidsub_btn.click(subvid_fn, [video_id], [video_inp])

demo.load(queur=False)

demo.queue(concurrency_count=10)
demo.launch(height='800px', server_port=2253, debug=True,
share=True)

Dept. of CSE, MVSREC 53

You might also like