0% found this document useful (0 votes)
11 views9 pages

Multi Scale Contrastive Learning For Video Temporal Grounding

Uploaded by

Stack Overflow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

Multi Scale Contrastive Learning For Video Temporal Grounding

Uploaded by

Stack Overflow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen1 , Yi Bin1,2* , Xiaobao Wu3 , Zhiyuan Hu1 ,


Cong-Duy Nguyen3 , See-Kiong Ng1 , Anh Tuan Luu3
1
Institute of Data Science (IDS), National University of Singapore, Singapore
2
Tongji University, China, 3 Nanyang Technological University (NTU), Singapore

Abstract However, recently the growing availability of long videos,


e.g. on streaming platforms, and demands to query their rich
Temporal grounding, which localizes video moments related
to a natural language query, is a core problem of vision-
content have necessitated productive grounding of large vol-
language learning and video understanding. To encode video umes of queries in long videos. Because of such short-to-
moments of varying lengths, recent methods employ a multi- long video paradigm shift, latest methods (Zhang, Wu, and
level structure known as a feature pyramid. In this struc- Li 2022; Mu, Mo, and Li 2024) have utilized local self-
ture, lower levels concentrate on short-range video moments, attention to restrict attention within a local window, follow-
while higher levels address long-range moments. Because ing the intuition that temporal context beyond a certain range
higher levels experience downsampling to accommodate in- is less helpful for moment localization.
creasing moment length, their capacity to capture informa-
To capture moments at different temporal scales without
tion is reduced and consequently leads to degraded informa-
tion in moment representations. To resolve this problem, we enlarging the window size of the local self-attention, recent
propose a contrastive learning framework to capture salient methods (Zhang, Wu, and Li 2022; Mu, Mo, and Li 2024)
semantics among video moments. Our key methodology is need to combine several Transformer blocks with down-
to leverage samples from the feature space emanating from sampling between every two blocks, resulting in a feature
multiple stages of the video encoder itself requiring neither pyramid of moment representations, as illustrated in Fig-
data augmentation nor online memory banks to obtain posi- ure 1 (left). Unfortunately, due to such downsampling op-
tive and negative samples. To enable such an extension, we in- eration, when moment representations are propagated from
troduce a sampling process to draw multiple video moments lower levels of short-range (local) moments to higher lev-
corresponding to a common query. Subsequently, by utiliz- els of long-range (global) moments, information contained
ing these moments’ representations across video encoder lay-
in representations of longer moments will gradually degrade
ers, we instantiate a novel form of multi-scale and cross-scale
contrastive learning that links local short-range video mo- (Guo et al. 2020; Yang et al. 2023). This could explain why
ments with global long-range video moments. Extensive ex- performance of these methods tends to degrade as the dura-
periments demonstrate the effectiveness of our framework for tion of target moments increase, as shown in Figure 1 (right)
not only long-form but also short-form video grounding. and statistically shown with Intersection-over-Union (IoU)
results in Figure 2, respectively.
Introduction To enrich information in video moment representations,
recent works (Panta et al. 2024; Xiao et al. 2024; Ji et al.
Temporal video grounding aims to localize moments of in- 2024; Liu et al. 2024) have employed contrastive learning
terest in an untrimmed video given a free-form textual de- for temporal grounding. The intuition is to capture mutual
scription. It is a challenging multimodal task since it in- information between video moments and textual query to
volves understanding temporal information in videos and preserve salient semantics in moment representations. These
reasoning about their connections to semantic information works mainly involve query-moment pairs in which queries
in texts. Recently, temporal grounding has drawn increas- relate to video moments of distinct videos, hence the learned
ing attention (Mu, Mo, and Li 2024; Jung et al. 2023; Xu semantics among moment representations would be inde-
et al. 2023; Pan et al. 2023), due to its wide range of applica- pendent from each other. However, such approach might
tions such as surveillance (Zhang, Zhu, and Roy-Chowdhury not be suitable for the latest scalable video-centric approach
2016), robotics (Burgner-Kahrs, Rucker, and Choset 2015), (Zhang, Wu, and Li 2022; Mu, Mo, and Li 2024), in which
and autonomous driving (Claussmann et al. 2019). multiple textual queries are related to one video. Therefore,
Previous methods (Zhang et al. 2020b; Soldan et al. 2021; if the grounding of two textual queries results in tempo-
Zhang et al. 2020a) for temporal grounding concentrate on ral overlapping, there might be a conflict in compact mo-
grounding merely a few queries in short video snippets. ment representations (An et al. 2023). Furthermore, focus-
* Yi Bin is the corresponding author, [email protected] ing upon moment-query relations limits these works to the
Copyright © 2025, Association for the Advancement of Artificial feature space of the final encoder layer, which could not ef-
Intelligence (www.aaai.org). All rights reserved. fectively utilize all hidden representations across encoding
Time Query: Where did I put the T-spanner? Ego4D-NLQ, EgoVLP, Video id: 245f3b76-ef46-48d1-b37c-afe73efbf1cf
Pooling (Downsampling)
Video
Local Self-Attention
input:

Groundtruth (1s): 336.01s 337.01s


Time
Pooling (Downsampling) SnAG: 335.86s 337.13s
Local Self-Attention
Our model: 335.69s 336.82s

Time
Query: How long did I smoothen the wood?
Pooling (Downsampling) Video
Local Self-Attention input:
Feature Extractor
Groundtruth (64s): 88.28s 152.28s

SnAG: 355.62s 384.01s

Our model: 93.33s 146.11s


Video input

Figure 1: (Left) Illustration of feature pyramid to encode video moments of different lengths; (Right) An example where recent
method SnAG (Mu, Mo, and Li 2024) accurately localizes short video moment but fails on long moment.

1.0 1.0
1.0 1.0
0.8 0.8
0.8 0.8
0.6 0.6
0.6 0.6

IoU

IoU
IoU

IoU

0.4
0.4 0.4 0.4
0.2
0.2 0.2 0.2
0.0
0.0 0.0
[0.8, 19.9]

[20.0, 39.9]

[40.0, 59.9]

[60.1, 79.7]

[80.3, 98.8]

[100.1, 119.5]

[0.8, 19.9]

[20.0, 39.9]

[40.0, 59.9]

[60.1, 79.7]

[80.3, 98.8]

[100.1, 119.5]
[0.2, 20.0]

[20.0, 40.0]

[40.0, 59.8]

[60.2, 79.6]

[0.2, 20.0]

[20.0, 40.0]

[40.0, 59.8]

[60.2, 79.6]

Moment Length (s) Moment Length (s) Moment Length (s) Moment Length (s)

Figure 2: First and Second: IoU results with respect to target video moment length on Ego4D-NLQ (Grauman et al. 2022) of
baseline SnAG (Mu, Mo, and Li 2024) and our model. Third and Fourth: IoU results with respect to target video moment length
on TACoS (Regneri et al. 2013) datasets of baseline SnAG (Mu, Mo, and Li 2024) and our model.

layers. For multi-scale temporal grounding, such cross-scale the bank size and update frequency, which demand laborious
representations should be fully used since they express se- tuning effort (Wang et al. 2021b).
mantics in video moments of various lengths.
To prevent these problems, we directly draw samples from
To resolve the above issues, in this paper, we propose a the feature space of video moment encoder. Specifically,
multi-scale contrastive learning framework for multi-scale we take advantage of internal, intermediate representations
temporal grounding. In our framework, instead of leverag- of video moments from the encoder that are readily avail-
ing moment-query relationships, we utilize the association able through the feed-forward step of the network without
among video moments. Particularly, to avoid representation the need to rely upon external steps such as data augmen-
conflict among video moments, we introduce a query-centric tation or online storing of samples in memory banks. Ac-
contrastive approach that draws temporally separate video cordingly, we introduce a within-scale and cross-scale ap-
moments corresponding to a common textual query. A cen- proach to create positive and negative moment samples for
tral component of our framework is the creation of posi- contrastive learning. Regarding the within-scale approach,
tive and negative video moment samples, which previous we seek to pull together representations of such semantically
works primarily apply data augmentation (Kim et al. 2022; close video moments on the same scale of similar tempo-
Xing et al. 2023). However, because most long-form videos ral range. Moreover, we also push apart representations of
consist of a high volume of video moments, choosing an video moments which are unrelated to the textual query. Re-
appropriate augmentation strategy that suits every moment garding the cross-scale approach, we compel the model to
is a non-trivial and lengthy tuning step. Another common relate global long-range video moments to local short-range
approach is to introduce a memory bank to store positive moments, while simultaneously repelling semantically dis-
or negative samples’ representations, which are created by tant cross-scale representations in an analogous cross-scale
aggregating input representations iteratively during training manner. This cross-scale approach would enable long-range
(Panta et al. 2024; Han et al. 2023). Nevertheless, a mem- moment representations to capture nuanced details of short-
ory bank would present additional hyperparameters such as range moments, thereby mitigating informational degrada-
tion within long-range representations. Multi-Scale Contrastive Learning Moment predictions
Within-scale Cross-scale
To sum up, our contributions are the following:
• We propose a multi-scale contrastive framework that fo-
cuses on moment-moment relations to mitigate informa- Moment Decoding
tional degradation in video moment representations. Attract Repel

• We propose a within- and cross-scale strategy that sup- Cross-Modal Fusion


ports semantic consistency not only between similar- (Cross-attention)
range but also cross-range video moment representations
emanating across layers of the video encoder.
Video Encoder Text Encoder
• Our framework achieves superior results across ma- (Multi-scale Transformer with Local Self-attention) (Transformers)
jor benchmark datasets concerning both short-form and
long-form video grounding. How many frying pans
can i see on the shelf?
Video input Textual query
Related Work
Temporal Grounding. Temporal grounding research can Figure 3: Overall illustration of the proposed framework.
be categorized into two groups: two-stage and single-stage.
In the two-stage group, methods generate temporal seg- ment (clip) centered at time t. We use a pre-trained fea-
ments as proposals, then score the segments’ probabilities ture extractor to embed each vt into a moment embedding
of being target moments and predict the refined bound- vt . Given the video V , our task is to localize a moment
ary timestamps. Early approaches (Anne Hendricks et al. y = (s, e) based on a sentence query Q = {q1 , q2 , ..., qK }.
2017; Gao et al. 2017) densely sample proposals leveraging Similar to the input video, we also embed the query Q into
sliding windows and score the proposals independently. In- a sequence of word embeddings {q1 , q2 , ..., qK }.
stead of independent sampling, Liu et al. (2021); Xiao et al.
(2021a,b) subsequently condition proposal generation upon Video encoder. After embedding video clips, we use a
sentence queries and/or video context to avoid dense sam- convolution-based projection function to encode local con-
pling. In contrast, Gao et al. (2021); Soldan et al. (2021); text of video clips:
Wang et al. (2021a); Zhang et al. (2021) enumerate all seg-
ments and organize them into a 2D adjacency map for rela- Z 0 = {z0t }Tt=1 = Conv (v1 , v2 , ..., vT ) . (1)
tion prediction. In the single-stage group, methods localize Subsequently, we designate L Transformer layers to encode
moments in a single shot without utilizing proposals, thus temporal context among video clips. In detail, each Trans-
being more efficient than the two-stage group. Several works former layer consists of a local multi-head self-attention
decode moment boundaries from a pooled representation (Li (LocalMSA) with a window size of W and an MLP block, in
et al. 2022) or learnable queries (Nguyen et al. 2023b). which we restrict the attention to be within a local window:
Contrastive Learning. Notable improvements are made by
Z̄ l = αl · LocalMSA LN Z l−1 + Z l−1 ,

contrastive loss applied to the final encoded outputs (Nguyen (2)
and Luu 2021; Nguyen et al. 2025; Hu, Cui, and Wang 2021; l l l
 l
Ẑ = ᾱ · MLP LN Z̄ + Z̄ , (3)
Nguyen et al. 2022, 2024b; Wang et al. 2021b; Nguyen et al.  
2023a, 2024a; Wu et al. 2023, 2024). Wang et al. (2021b); Z l =↓ Ẑ l , l ∈ {1, 2, ..., L}, (4)
Hu, Cui, and Wang (2021) employ a memory bank to main-
tain an extended set of positive and negative samples. In- l−1 l

stead of utilizing outputs at the single final layer, contrastive where Z l−1 , Z̄ l , Ẑ l ∈ RT ×D , Z l ∈ RT ×D . T l−1 /T l is
learning with local and global representations across dif- the downsampling ratio, αl and ᾱl are learnable per-channel
ferent layers has been widely studied (Zhang et al. 2020c; scaling factors (Touvron et al. 2021), D is the hidden dimen-
Bachman, Hjelm, and Buchwalter 2019; Chaitanya et al. sion, and LN is the layer normalization.
2020). Zhang et al. (2020c) maximize the mutual informa- Inspired by (Mu, Mo, and Li 2024), we implement the
tion between representations of different local windows of a downsampling operator ↓ as a strided depthwise 1D con-
sentence and the representation of the global sentence. volution. The downsampling operation engenders the multi-
scale property of the encoder, generating representations for
longer video moments.
Methodology
In this section, we delineate our proposed contrastive frame- Text encoder. We use Transformer layers, where each
work for multi-scale temporal grounding, particularly focus- layer includes a vanilla self-attention followed by an
ing on a sampling procedure to draw video moment repre- MLP. Thus, the textual encoder produces textual repre-
sentations across temporal scales. sentations E = {e1 , e2 , ..., eK } for query embeddings
{q1 , q2 , ..., qK }.
Preliminary - Video Temporal Grounding Cross-modal fusion. Our architecture uses cross-attention
We denote an input video V as a sequence of video clips to fuse video clip and query word representations. Techni-
{vt }Tt=1 = {v1 , v2 , ..., vT }, where vt denotes a video mo- cally, we modulate video clip representations {Z l }L
l=1 with
word representations E as follows: closer while negative moments further:
Z̃ l = LN Z l , Ẽ = LN (E) ,

(5) L
X X X
  ⊤  Lwithin = −
l
 Z̃ · Ẽ  l=1 i∈P(l) j∈P(l),i̸=j
Ol = σ  √ l
 · Z̃ , (6) l l (11)
D e(zi ·zj )
log P (żl ·zl ) .
e(zi ·zj ) +
l l
e i n
X l = β l · MLP LN Ol + Ol ,

(7) n∈N (l)
l
where β denotes a learnable per-channel scale and σ the Cross-scale contrastive learning. We further associate
Softmax activation function. semantically close moment representations from across dif-
Moment decoding. After cross-modal fusion, our model ferent scales. Specifically, we push short-range moment rep-
converts each time step t to a moment candidate. Specifi- resentations closer to semantically close long-range moment
cally, given xlt , we use a convolutional network comprising representations. This would enable short-range moments to
1D convolutional layers as the classification head to predict a relate to longer video context while long-range features to
score plt . In a similar vein, we use a similar 1D convolutional capture nuanced details of short-range moments.
network attached with a ReLU activation function to regress As video moment features of layer 0 {z0j ′ } are the most
the normalized distances from t to the moment boundaries likely to preserve salient video information compared to
(dst , det ) if xlt is classified as positive. Formally, the decoded other levels, we employ features of the target moments from
moment is computed as: the lowest level as the anchor set for cross-scale contrastive
learning. To construct positive and negative moment set, we
(t, l) = arg max plt , (8) utilize features of higher levels l ∈ {1, 2, ..., L} in the fea-
t,l ture pyramid corresponding to video moments that involve
ŝ = 2l−1 (t − dst ) , ê = 2l−1 (t + det ) . (9) and do not involve the textual query, respectively. Denoting
the set of moment indices in level l that are related to the
During testing, we employ Soft-NMS (Bodla et al. 2017) to query as P(l) and the set of those that are unrelated as N (l),
merge overlapping moment predictions. we define the cross-scale contrastive learning objective as:
L
Cross-scale Contrastive Learning X X X
Lcross = −
Query-centric sampling. As randomly sampling moment-
i∈P(0) l=1 j∈P(l)
query pairs for contrastive learning might lead the model
0 l (12)
to representation conflict if the groundings of two queries e(zi ·zj )
overlap with each other, we instead introduce a sampling ap- log P (z0 ·zl ) .
e(zi ·zj ) +
0 l
proach that draws a text query Q and its temporally separate e i n
n∈N (l)
video moments associated with a common video V :
 
NQ Training Objective
Qj ′ , {yjl ′ }L l L
l=1 ∼ U {Qj , {yj ′ }l=1 }j=1 , (10)
For temporal grounding training, we adopt a focal loss Lcls
where U denotes a discrete uniform distribution, {yjl ′ }L l=1 for target moment classification and a Distance-IoU loss Lreg
the set of target video moments in each layer l, and NQ the for distance regression from a positive time step t to the tar-
number of textual queries related to video V . We generate get moment. Then, we combine these losses with our within-
the target set P(l) via center sampling (Zhang, Wu, and Li and cross-scale contrastive loss:
2022; Mu, Mo, and Li 2024), i.e. given any moment centered
at t, any time step c ∈ t − α TTl , t + α TTl α in layer l is con- L = Lcls + ρreg · Lreg + ρwithin · Lwithin + ρcross · Lcross , (13)
sidered as a target. After sampling the query and target mo- where ρreg , ρwithin , and ρcross denote hyperparameters to bal-
ments, we directly utilize the representations {zlj ′ }L
l=1 of the ance the regression, within-scale, and cross-scale contrastive
target moments {yjl ′ }Ll=1 extracted by the aforementioned losses, respectively.
multi-scale video encoder.
Within-scale contrastive learning. Having obtained the Experiments
representations of target moment samples, we directly uti- To validate the effectiveness, we conduct extensive experi-
lize moments within each scale as positive and negative sam- ments against recent methods for temporal grounding. We
ples. Particularly, we iterate over every layer l of the video also perform ablation study to investigate each component.
encoder, and for each anchor video moment yjl ′ , we consider
all video moments of layer l corresponding to query Qj ′ to Datasets
become positive moment set P (l), and randomly draw those Following previous works, we work on five challenging
not corresponding to query Qj ′ to be negative set N (l). datasets of temporal grounding, which belong to two main
Then, we formulate multi-scale contrastive objective over categories, i.e. 1) Long videos, many queries (Ego4D-NLQ
all layers l ∈ {1, 2, ..., L}, which pushes positive moments (Grauman et al. 2022), MAD (Soldan et al. 2022), and
TACoS (Regneri et al. 2013)) and 2) Short videos, few R@1 R@5
Features Model
0.3 0.5 Avg 0.3 0.5 Avg
queries (ActivityNet-Captions (Krishna et al. 2017) and
VSL-Net 5.45 3.12 4.29 10.74 6.63 8.69
Charades-STA (Sigurdsson et al. 2016)).
CONE 10.40 5.03 7.72 22.74 11.87 17.31
Ego4D-NLQ (Grauman et al. 2022) consists of egocentric SOONet 8.00 3.76 5.88 22.40 11.09 16.75
videos recording daily human activities. Each video pos- SF+BERT
SnAG 9.83 6.83 8.33 27.93 19.27 23.60
sesses length from 3.5 to 20 minutes and is associated with Our model 10.80 7.22 9.49 28.54 20.38 25.06
11.6 queries on average. VSL-Net 10.84 6.81 8.83 18.84 13.45 16.15
MAD (Soldan et al. 2022) comprises 1.2K hours of movies EgoVLP
CONE 14.15 8.18 11.17 30.33 18.02 24.18
with 384K queries transcribed from audio description. Since SnAG 15.72 10.78 13.25 38.39 27.44 32.92
each video is a movie, each exhibits 47 to 202 minutes long. Ours 16.37 11.27 13.96 39.97 28.70 34.43
TACoS (Regneri et al. 2013) focuses on cooking topics. The
total video length is 10.1 hours and each video is tasked with Table 1: Results on Ego4D-NLQ.
143.5 queries for the temporal grounding operation.
ActivityNet-Captions (Krishna et al. 2017) targets dense validation and fair comparison with previous methods, we
video captioning and is subsequently adapted to temporal use ρref = ρwithin = ρcross = 1.0.
grounding. Its video length is two minutes on average and
the average number of queries per video is approximately Baselines
3.65 queries. We consider the following temporal grounding models as
Charades-STA (Sigurdsson et al. 2016) is an action recog- baselines: (i) VSL-Net (Zhang et al. 2020a) utilizing tex-
nition dataset transformed into a temporal grounding one. tual query to highlight regions potential to comprise the
Each video lasts approximately 30 seconds and possesses target moment; (ii) VLG-Net (Soldan et al. 2021) model-
2.4 queries. ing temporal grounding as a graph matching problem; (iii)
Moment-DETR (Lei, Berg, and Bansal 2021), a Trans-
Evaluation Metrics former encoder-decoder architecture that views temporal
We report Recall@K at different temporal intersection-over- grounding as a set prediction problem; (iv) CONE (Hou
union θ (R@K, tIoU = θ) for all datasets. The metric mea- et al. 2022) subsequently slicing a video input into win-
sures the percentage of textual queries whose at least one of dows, selects relevant windows, and ranks the selected win-
the top-K moment predictions temporally overlap with the dows to obtain target moments; (v) MMN (Wang et al.
groundtruth moment more than θ. 2022), a Siamese-like network architecture that is trained
with video-query and query-video contrastive learning; (vi)
Implementation Details SSRN (Zhu et al. 2023) enriching anchor frames with addi-
tional consecutive frames; (vii) G2L (Li et al. 2023) mea-
To fairly compare with previous works and satisfy the scal- suring moment-query similarities using geodesic distance
ability of temporal grounding operation for long videos, we and quantifies cross-modal interactions with game-theoretic
adopt video-centric sampling approach (Mu, Mo, and Li interactions; (viii) SOONet (Pan et al. 2023), an anchor-
2024). For Ego4D-NLQ, we use pre-trained 1) SlowFast based framework that conducts grounding by pre-ranking,
video features (Feichtenhofer et al. 2019) with BERT tex- re-ranking, and regression; (ix) MESM (Liu et al. 2024),
tual features (Devlin et al. 2018), and 2) EgoVLP video a fine-grained moment-query contrastive approach mod-
and textual features (Lin et al. 2022). For testing, we re- eled for query word and video moment representations; (x)
port R@{1, 5}, tIoU = {0.3, 0.5}. For MAD dataset, we Contrastive-MSAT (Panta et al. 2024), applying moment-
use CLIP features (Radford et al. 2021) for both videos and query contrastive loss supported by a momentum-based
texts, and report R@{1, 5, 10, 50}, tIoU = {0.1, 0.3, 0.5}. memory bank; (xi) UVCOM (Xiao et al. 2024), a moment-
For the TACoS dataset, we use C3D video features (Tran query contrastive approach for a unified video comprehen-
et al. 2015) and GloVe textual features (Pennington, Socher, sion framework; (xii) SnAG (Mu, Mo, and Li 2024) achiev-
and Manning 2014). We report results in terms of R@{1, 5}, ing scalable grounding with cross-modal late fusion.
tIoU = {0.5, 0.7}. In addition, we utilize I3D features (Car-
reira and Zisserman 2017) pre-trained on Kinetics (Kay et al.
2017) for Charades-STA and C3D features (Tran et al. 2015)
Experimental Results
for ActivityNet-Captions experiments. For both datasets, Main Results
similar to TACoS, we take advantage of GloVe textual fea- Results on Ego4D-NLQ (Table 1). Our framework signifi-
tures (Pennington, Socher, and Manning 2014). We report cantly outperforms recent temporal grounding methods. For
R@{1, 5}, tIoU = {0.5, 0.7} for testing on Charades-STA, example, using SlowFast+BERT features, we outperform
and R@{1, 5}, tIoU = {0.3, 0.5} for testing on ActivityNet- previous best method, i.e. SnAG, by mean improvements of
Captions. For more details regarding model architecture, we 1.16% and 1.46% in terms of R@1 and R@5 metrics, re-
direct interested readers to the appendix. For both within- spectively. In addition, we accomplish more significant per-
scale and cross-scale contrastive learning implementation, formance gains on the more stringent tIoU threshold of 0.5,
we keep the size of the negative sample set N (l) in every denoting more precise moment localization.
level l to be equal to the size of the positive video clips P(l) Results on MAD (Table 2). Similar to results on Ego4D-
that correspond to the target video moments. Based upon NLQ, our framework obtains an outstanding improvement
R@1 R@5 R@10 R@50
Model
0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5
VLG-Net 3.64 2.76 1.65 11.66 9.31 5.99 17.39 14.56 9.77 39.78 34.27 24.93
Moment-DETR 0.31 0.24 0.16 1.52 1.14 0.28 2.79 2.06 1.20 11.08 7.97 4.71
CONE 8.90 6.87 4.10 20.51 16.11 9.59 27.20 21.53 12.82 43.36 34.73 20.56
SOONet 11.26 9.00 5.32 23.21 19.64 13.14 30.36 26.00 17.84 50.32 44.78 32.59
SnAG 10.28 8.46 5.55 24.42 20.60 13.75 32.23 27.50 19.00 52.28 46.68 35.24
Our model 12.76 10.94 6.92 26.43 22.60 15.43 34.08 29.41 20.70 54.84 48.26 37.77

Table 2: Results on MAD.

TACoS ActivityNet-Captions Charades-STA


Model R@1 R@5 R@1 R@5 R@1 R@5
0.3 0.5 0.3 0.5 0.5 0.7 0.5 0.7 0.5 0.7 0.5 0.7
VLG-NET 45.46 34.19 70.38 56.56 46.32 29.82 77.15 63.33 - - - -
MGSL-Net 42.54 32.27 63.39 50.13 51.87 31.42 82.60 66.71 63.98 41.03 93.21 63.85
MMN 39.24 26.17 62.03 47.39 48.59 29.26 79.50 64.76 - - - -
SSRN 45.10 34.33 65.26 51.85 54.49 33.15 84.72 68.48 65.59 42.65 94.76 65.48
G2L 42.74 30.95 65.83 49.86 51.68 33.35 81.32 67.60 - - - -
MESM 52.69 39.52 - - - - - - 61.24 38.04 - -
Contrastive-MSAT 49.77 37.99 68.31 58.31 47.73 31.21 78.06 63.63 - - - -
UVCOM 36.39 23.32 - - - - - - 59.25 36.64 - -
SnAG 56.44 44.86 81.15 70.66 48.55 30.56 81.71 63.41 64.62 46.26 92.55 71.94
Ours 58.17 47.04 84.84 73.55 54.83 33.56 84.78 68.91 66.64 47.03 93.66 72.53

Table 3: Results on TACoS, ActivityNet-Captions, and Charades-STA.

over previous temporal grounding methods. Specifically, we R@1 R@5


Positive-negative sampling approach
enhance SOONet with 1.68 and 2.82 points of R@1 and 0.3 0.5 0.3 0.5
R@5 on average. Moreover, our model outperforms CONE Data augmentation 57.00 45.46 83.13 72.06
Memory bank 57.69 46.62 84.13 72.94
and SnAG in terms of mean R@1 / R@5 by 3.58 / 6.08 and
Ours 58.17 47.04 84.84 73.55
2.11 / 1.90 points, respectively, especially for the more strin-
gent tIoU threshold. Table 4: Ablation results on TACoS with various positive
Results on TACoS (Table 3 (left)). Our model achieves and negative sampling approaches.
R@1 / R@5 of 47.04% / 73.55% at tIoU = 0.5, outperform-
ing the strongest baseline, i.e. SnAG, by a substantial mar-
gin, i.e. +2.18% R@1 and +2.89% R@5. Combined with the R@1 R@5
Contrastive component
results on Ego4D-NLQ and MAD, these results demonstrate 0.3 0.5 0.3 0.5
that our contrastive framework provides beneficial signals to w/o within-scale 57.40 46.00 83.46 72.39
counter informational degradation in the feature pyramid for w/o cross-scale 57.00 45.85 82.34 71.58
long-form video grounding. Ours 58.17 47.04 84.84 73.55
Results on ActivityNet-Captions (Table 3 (middle)). We
achieve R@1 / R@5 scores of 33.56% / 68.91% at tIoU Table 5: Ablation results on TACoS with multi-scale contra-
= 0.7. These results indicate that we outperform SSRN by tive components.
0.41% and 0.43% with regards to R@1 and R@5, respec-
Effect of contrastive components. We explore what ex-
tively, even though we use the backbone SnAG which is sig-
tent each component of our contrastive framework, i.e.
nificantly weaker than SSRN.
within- or cross-scale objective, contributes to the over-
Results on Charades-STA (Table 3 (right)). Our model
all performance improvement. In Table 5, cross-scale ob-
outperforms previous methods by a wide margin. Partic-
jective plays a more fundamental role in polishing video
ularly, we accomplish 47.03% R@1 and 72.53% R@5 at
moment representations than the within-scale counterpart.
tIoU = 0.7, exceeding SSRN by 4.38% R@1 and 7.04%
Since cross-scale contrastive objective concentrates more
R@5. These outcomes on Charades-STA and ActivityNet-
upon long-range moment representations by relating them
Captions show that mutual information signals among video
with the short-range moment ones, these results validate our
moments contributed by our contrastive framework can pol-
hypothesis that informational degradation is a fundamental
ish video moment representations to help temporal ground-
problem to resolve in multi-scale temporal grounding.
ing on short-form videos.
Effect of moment-moment association. In addition to
Ablation Study our proposed moment-moment association, we experiment
We conduct extensive experiments on TACoS to study the with various approaches, i.e. moment-query association,
influence of the design choices. query-query association, and one approach to associate
R@1 R@5 common textual query to avoid representation conflict. Ac-
Association approach
0.3 0.5 0.3 0.5 cordingly, we define a within-scale contrastive objective to
Query-query 55.61 45.06 81.25 71.75 model relations among similar-range video moments, and a
Moment-query 57.00 46.24 82.44 72.37 cross-scsale objective to model relations among cross-range
CLIP-based moment-moment 57.13 46.94 83.28 72.96 moments. Comprehensive experiments validate the effec-
Ours 58.17 47.04 84.84 73.55 tiveness of our framework for both short-term and long-term
temporal grounding.
Table 6: Ablation results on TACoS with various association
approaches. Acknowledgements
video moments but based on the semantic closeness of their This research/project is supported by the National Re-
corresponding textual queries. For the last approach, we con- search Foundation, Singapore under its AI Singapore Pro-
sider two textual queries to be semantically similar if their gramme (AISG Award No: AISG3-PhD-2023-08-051T).
CLIP-based cosine similarity score is greater than or equal Thong Nguyen is supported by a Google Ph.D. Fellowship
to 0.8 (for positive sampling) and semantically distant if the in Natural Language Processing.
similarity score is smaller than or equal to 0.2 (for negative
sampling). As can be observed in Table 6, query-query as- References
sociation performs the worst, as the approach does not pol- An, X.; Deng, J.; Yang, K.; Li, J.; Feng, Z.; Guo, J.; Yang,
ish moment representations. The moment-moment approach J.; and Liu, T. 2023. Unicom: Universal and compact rep-
outperforms moment-query contrastive learning, but under- resentation learning for image retrieval. arXiv preprint
performs our method. We hypothesize that there might exist arXiv:2304.05884.
representation conflict between two video moments tempo-
rally overlap with each other. Anne Hendricks, L.; Wang, O.; Shechtman, E.; Sivic, J.;
Darrell, T.; and Russell, B. 2017. Localizing moments in
Effect of direct utilization of moment representations. video with natural language. In Proceedings of the IEEE
We study the impact of our direct utilization of moment rep- international conference on computer vision, 5803–5812.
resentations for positive and negative sample generation, and Bachman, P.; Hjelm, R. D.; and Buchwalter, W. 2019.
compare with Tube TokenMix (Xing et al. 2023) as the data Learning representations by maximizing mutual information
augmentation and the momentum-based memory bank ap- across views. Advances in neural information processing
proach (Panta et al. 2024). Table 4 shows that we signif- systems, 32.
icantly surpass other methods, on average by 1.38 / 1.60
Bodla, N.; Singh, B.; Chellappa, R.; and Davis, L. S.
points of R@1 / R@5 over the augmentation approach, and
2017. Soft-NMS–improving object detection with one line
0.45 / 0.66 points of R@1 / R@5 over the memory bank ap-
of code. In Proceedings of the IEEE international confer-
proach. We hypothesize that while memory bank may main-
ence on computer vision, 5561–5569.
tain a high number of samples for contrastive learning, ex-
pensive hyperparameter tuning is essential to achieve an ef- Burgner-Kahrs, J.; Rucker, D. C.; and Choset, H. 2015. Con-
fective performance. tinuum robots for medical applications: A survey. IEEE
Transactions on Robotics, 31(6): 1261–1280.
Qualitative Analysis Carreira, J.; and Zisserman, A. 2017. Quo vadis, action
recognition? a new model and the kinetics dataset. In pro-
In Figure 2, we observe that our model does not encounter
ceedings of the IEEE Conference on Computer Vision and
degraded performance when the lengths of the target mo-
Pattern Recognition, 6299–6308.
ments increase. Moreover, we visualize moment predictions
of the recent method, i.e. SnAG (Mu, Mo, and Li 2024), and Chaitanya, K.; Erdil, E.; Karani, N.; and Konukoglu, E.
our model in Figure 1. Even though SnAG could precisely 2020. Contrastive learning of global and local features
detect the shorter-length moment, it misses the moment of for medical image segmentation with limited annotations.
longer length, due to the degraded information issue. In con- Advances in neural information processing systems, 33:
trast, our framework is able to localize both the short and 12546–12558.
long moments. We hypothesize that our contrastive frame- Claussmann, L.; Revilloud, M.; Gruyer, D.; and Glaser, S.
work can hold salient semantics for video moment repre- 2019. A review of motion planning for highway autonomous
sentations to resolve the degraded signals in the grounding driving. IEEE Transactions on Intelligent Transportation
model, thus enhancing the grounding operation towards long Systems, 21(5): 1826–1848.
video moments.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
Bert: Pre-training of deep bidirectional transformers for lan-
Conclusion guage understanding. arXiv preprint arXiv:1810.04805.
In this paper, we propose a multi-scale contrastive frame- Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019.
work for multi-scale temporal grounding. Essentially, our Slowfast networks for video recognition. In Proceedings of
framework utilizes a query-centric approach to associate the IEEE/CVF international conference on computer vision,
temporally separate video moments which correspond to a 6202–6211.
Gao, J.; Sun, C.; Yang, Z.; and Nevatia, R. 2017. Tall: Tem- Li, J.; Xie, J.; Qian, L.; Zhu, L.; Tang, S.; Wu, F.; Yang, Y.;
poral activity localization via language query. In Proceed- Zhuang, Y.; and Wang, X. E. 2022. Compositional tempo-
ings of the IEEE international conference on computer vi- ral grounding with structured variational cross-graph corre-
sion, 5267–5275. spondence learning. In Proceedings of the IEEE/CVF Con-
Gao, J.; Sun, X.; Xu, M.; Zhou, X.; and Ghanem, B. 2021. ference on Computer Vision and Pattern Recognition, 3032–
Relation-aware video reading comprehension for temporal 3041.
language grounding. arXiv preprint arXiv:2110.05717. Lin, K. Q.; Wang, J.; Soldan, M.; Wray, M.; Yan, R.; Xu,
Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, E. Z.; Gao, D.; Tu, R.-C.; Zhao, W.; Kong, W.; et al. 2022.
A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; Egocentric video-language pretraining. Advances in Neural
et al. 2022. Ego4d: Around the world in 3,000 hours of ego- Information Processing Systems, 35: 7575–7586.
centric video. In Proceedings of the IEEE/CVF Conference Liu, D.; Qu, X.; Dong, J.; and Zhou, P. 2021. Adaptive pro-
on Computer Vision and Pattern Recognition, 18995–19012. posal generation network for temporal sentence localization
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; and Pan, C. 2020. in videos. arXiv preprint arXiv:2109.06398.
Augfpn: Improving multi-scale feature learning for object Liu, Z.; Li, J.; Xie, H.; Li, P.; Ge, J.; Liu, S.-A.; and Jin,
detection. In Proceedings of the IEEE/CVF conference on G. 2024. Towards balanced alignment: Modal-enhanced se-
computer vision and pattern recognition, 12595–12604. mantic modeling for video moment retrieval. In Proceed-
Han, D.; Cheng, X.; Guo, N.; Ye, X.; Rainer, B.; and Priller, ings of the AAAI Conference on Artificial Intelligence, vol-
P. 2023. Momentum cross-modal contrastive learning for ume 38, 3855–3863.
video moment retrieval. IEEE Transactions on Circuits and Mu, F.; Mo, S.; and Li, Y. 2024. SnAG: Scalable and Accu-
Systems for Video Technology. rate Video Grounding. arXiv preprint arXiv:2404.02257.
Hou, Z.; Zhong, W.; Ji, L.; Gao, D.; Yan, K.; Chan, W.-K.; Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan,
Ngo, C.-W.; Shou, Z.; and Duan, N. 2022. Cone: An efficient L. A. 2023a. Improving multimodal sentiment anal-
coarse-to-fine alignment framework for long video temporal ysis: Supervised angular margin-based contrastive learn-
grounding. arXiv preprint arXiv:2209.10918. ing for enhanced fusion representation. arXiv preprint
Hu, H.; Cui, J.; and Wang, L. 2021. Region-aware con- arXiv:2312.02227.
trastive learning for semantic segmentation. In Proceedings
Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024a.
of the IEEE/CVF International Conference on Computer Vi-
Kdmcse: Knowledge distillation multimodal sentence em-
sion, 16291–16301.
beddings with adaptive angular margin contrastive learning.
Ji, W.; Shi, R.; Wei, Y.; Zhao, S.; and Zimmermann, R. 2024. arXiv preprint arXiv:2403.17486.
Weakly Supervised Video Moment Retrieval via Location-
Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le,
irrelevant Proposal Learning. In Companion Proceedings of
K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2025.
the ACM on Web Conference 2024, 1595–1603.
Meta-optimized Angular Margin Contrastive Framework for
Jung, M.; Jang, Y.; Choi, S.; Kim, J.; Kim, J.-H.; and Video-Language Representation Learning. In European
Zhang, B.-T. 2023. Overcoming Weak Visual-Textual Conference on Computer Vision, 77–98. Springer.
Alignment for Video Moment Retrieval. arXiv preprint
arXiv:2306.02728. Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for
neural topic model. Advances in neural information pro-
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; cessing systems, 34: 11974–11986.
Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev,
P.; et al. 2017. The kinetics human action video dataset. Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-
arXiv preprint arXiv:1705.06950. K.; and Tuan, L. A. 2023b. Demaformer: Damped ex-
ponential moving average transformer with energy-based
Kim, T.; Kim, J.; Shim, M.; Yun, S.; Kang, M.; Wee,
modeling for temporal language grounding. arXiv preprint
D.; and Lee, S. 2022. Exploring temporally dynamic
arXiv:2312.02549.
data augmentation for video recognition. arXiv preprint
arXiv:2206.15015. Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.;
Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; and Car- and Luu, A. T. 2024b. Topic Modeling as Multi-Objective
los Niebles, J. 2017. Dense-captioning events in videos. In Contrastive Optimization. arXiv preprint arXiv:2402.07577.
Proceedings of the IEEE international conference on com- Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.;
puter vision, 706–715. and Bing, L. 2022. Adaptive contrastive learning on multi-
Lei, J.; Berg, T. L.; and Bansal, M. 2021. Detecting mo- modal transformer for review helpfulness predictions. arXiv
ments and highlights in videos via natural language queries. preprint arXiv:2211.03524.
Advances in Neural Information Processing Systems, 34: Pan, Y.; He, X.; Gong, B.; Lv, Y.; Shen, Y.; Peng, Y.; and
11846–11858. Zhao, D. 2023. Scanning only once: An end-to-end frame-
Li, H.; Cao, M.; Cheng, X.; Li, Y.; Zhu, Z.; and Zou, Y. work for fast temporal grounding in long videos. In Proceed-
2023. G2l: Semantically aligned and uniform video ground- ings of the IEEE/CVF International Conference on Com-
ing via geodesic and game theory. In Proceedings of the puter Vision, 13767–13777.
IEEE/CVF International Conference on Computer Vision, Panta, L.; Shrestha, P.; Sapkota, B.; Bhattarai, A.; Man-
12032–12042. andhar, S.; and Sah, A. K. 2024. Cross-modal Contrastive
Learning with Asymmetric Co-attention Network for Video Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T.
Moment Retrieval. In Proceedings of the IEEE/CVF Winter 2024. Modeling Dynamic Topics in Chain-Free Fashion by
Conference on Applications of Computer Vision, 607–614. Evolution-Tracking Contrastive Learning and Unassociated
Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Word Exclusion. In Ku, L.-W.; Martins, A.; and Srikumar,
Global vectors for word representation. In Proceedings of V., eds., Findings of the Association for Computational Lin-
the 2014 conference on empirical methods in natural lan- guistics ACL 2024, 3088–3105. Bangkok, Thailand and vir-
guage processing (EMNLP), 1532–1543. tual meeting: Association for Computational Linguistics.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Xiao, S.; Chen, L.; Shao, J.; Zhuang, Y.; and Xiao, J. 2021a.
Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Natural language video localization with learnable moment
et al. 2021. Learning transferable visual models from nat- proposals. arXiv preprint arXiv:2109.10678.
ural language supervision. In International conference on Xiao, S.; Chen, L.; Zhang, S.; Ji, W.; Shao, J.; Ye, L.;
machine learning, 8748–8763. PMLR. and Xiao, J. 2021b. Boundary proposal network for two-
Regneri, M.; Rohrbach, M.; Wetzel, D.; Thater, S.; Schiele, stage natural language video localization. In Proceedings of
B.; and Pinkal, M. 2013. Grounding action descriptions in the AAAI Conference on Artificial Intelligence, volume 35,
videos. Transactions of the Association for Computational 2986–2994.
Linguistics, 1: 25–36. Xiao, Y.; Luo, Z.; Liu, Y.; Ma, Y.; Bian, H.; Ji, Y.; Yang, Y.;
Sigurdsson, G. A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, and Li, X. 2024. Bridging the gap: A unified video com-
I.; and Gupta, A. 2016. Hollywood in homes: Crowdsourc- prehension framework for moment retrieval and highlight
ing data collection for activity understanding. In Computer detection. In Proceedings of the IEEE/CVF Conference on
Vision–ECCV 2016: 14th European Conference, Amster- Computer Vision and Pattern Recognition, 18709–18719.
dam, The Netherlands, October 11–14, 2016, Proceedings, Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; and Jiang, Y.-
Part I 14, 510–526. Springer. G. 2023. Svformer: Semi-supervised video transformer for
Soldan, M.; Pardo, A.; Alcázar, J. L.; Caba, F.; Zhao, C.; Gi- action recognition. In Proceedings of the IEEE/CVF con-
ancola, S.; and Ghanem, B. 2022. Mad: A scalable dataset ference on computer vision and pattern recognition, 18816–
for language grounding in videos from movie audio descrip- 18826.
tions. In Proceedings of the IEEE/CVF Conference on Com- Xu, M.; Soldan, M.; Gao, J.; Liu, S.; Pérez-Rúa, J.-M.; and
puter Vision and Pattern Recognition, 5026–5035. Ghanem, B. 2023. Boundary-denoising for video activity
Soldan, M.; Xu, M.; Qu, S.; Tegner, J.; and Ghanem, B. localization. arXiv preprint arXiv:2304.02934.
2021. Vlg-net: Video-language graph matching network for Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; and Liang,
video grounding. In Proceedings of the IEEE/CVF Interna- R. 2023. AFPN: asymptotic feature pyramid network for
tional Conference on Computer Vision, 3224–3234. object detection. In 2023 IEEE International Conference on
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; and Systems, Man, and Cybernetics (SMC), 2184–2189. IEEE.
Jégou, H. 2021. Going deeper with image transformers. In Zhang, C.-L.; Wu, J.; and Li, Y. 2022. Actionformer: Lo-
Proceedings of the IEEE/CVF international conference on calizing moments of actions with transformers. In European
computer vision, 32–42. Conference on Computer Vision, 492–510. Springer.
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, Zhang, H.; Sun, A.; Jing, W.; and Zhou, J. T. 2020a. Span-
M. 2015. Learning spatiotemporal features with 3d convo- based localizing network for natural language video local-
lutional networks. In Proceedings of the IEEE international ization. arXiv preprint arXiv:2004.13931.
conference on computer vision, 4489–4497. Zhang, M.; Yang, Y.; Chen, X.; Ji, Y.; Xu, X.; Li, J.; and
Wang, H.; Zha, Z.-J.; Li, L.; Liu, D.; and Luo, J. 2021a. Shen, H. T. 2021. Multi-stage aggregated transformer net-
Structured multi-level interaction network for video mo- work for temporal language localization in videos. In Pro-
ment localization via language query. In Proceedings of ceedings of the IEEE/CVF Conference on Computer Vision
the IEEE/CVF Conference on Computer Vision and Pattern and Pattern Recognition, 12669–12678.
Recognition, 7026–7035. Zhang, S.; Peng, H.; Fu, J.; and Luo, J. 2020b. Learning
Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; and 2d temporal adjacent networks for moment localization with
Van Gool, L. 2021b. Exploring cross-image pixel contrast natural language. In Proceedings of the AAAI Conference on
for semantic segmentation. In Proceedings of the IEEE/CVF Artificial Intelligence, volume 34, 12870–12877.
international conference on computer vision, 7303–7313. Zhang, S.; Zhu, Y.; and Roy-Chowdhury, A. K. 2016.
Wang, Z.; Wang, L.; Wu, T.; Li, T.; and Wu, G. 2022. Neg- Context-aware surveillance video summarization. IEEE
ative sample matters: A renaissance of metric learning for Transactions on Image Processing, 25(11): 5469–5478.
temporal grounding. In Proceedings of the AAAI Confer- Zhang, Y.; He, R.; Liu, Z.; Lim, K. H.; and Bing, L. 2020c.
ence on Artificial Intelligence, volume 36, 2613–2623. An unsupervised sentence embedding method by mutual in-
Wu, X.; Dong, X.; Nguyen, T.; Liu, C.; Pan, L.-M.; and Luu, formation maximization. arXiv preprint arXiv:2009.12061.
A. T. 2023. Infoctm: A mutual information maximization Zhu, J.; Liu, D.; Zhou, P.; Di, X.; Cheng, Y.; Yang, S.; Xu,
perspective of cross-lingual topic modeling. In Proceed- W.; Xu, Z.; Wan, Y.; Sun, L.; et al. 2023. Rethinking the
ings of the AAAI Conference on Artificial Intelligence, vol- video sampling and reasoning strategies for temporal sen-
ume 37, 13763–13771. tence grounding. arXiv preprint arXiv:2301.00514.

You might also like