Multi Scale Contrastive Learning For Video Temporal Grounding
Multi Scale Contrastive Learning For Video Temporal Grounding
Time
Query: How long did I smoothen the wood?
Pooling (Downsampling) Video
Local Self-Attention input:
Feature Extractor
Groundtruth (64s): 88.28s 152.28s
Figure 1: (Left) Illustration of feature pyramid to encode video moments of different lengths; (Right) An example where recent
method SnAG (Mu, Mo, and Li 2024) accurately localizes short video moment but fails on long moment.
1.0 1.0
1.0 1.0
0.8 0.8
0.8 0.8
0.6 0.6
0.6 0.6
IoU
IoU
IoU
IoU
0.4
0.4 0.4 0.4
0.2
0.2 0.2 0.2
0.0
0.0 0.0
[0.8, 19.9]
[20.0, 39.9]
[40.0, 59.9]
[60.1, 79.7]
[80.3, 98.8]
[100.1, 119.5]
[0.8, 19.9]
[20.0, 39.9]
[40.0, 59.9]
[60.1, 79.7]
[80.3, 98.8]
[100.1, 119.5]
[0.2, 20.0]
[20.0, 40.0]
[40.0, 59.8]
[60.2, 79.6]
[0.2, 20.0]
[20.0, 40.0]
[40.0, 59.8]
[60.2, 79.6]
Moment Length (s) Moment Length (s) Moment Length (s) Moment Length (s)
Figure 2: First and Second: IoU results with respect to target video moment length on Ego4D-NLQ (Grauman et al. 2022) of
baseline SnAG (Mu, Mo, and Li 2024) and our model. Third and Fourth: IoU results with respect to target video moment length
on TACoS (Regneri et al. 2013) datasets of baseline SnAG (Mu, Mo, and Li 2024) and our model.
layers. For multi-scale temporal grounding, such cross-scale the bank size and update frequency, which demand laborious
representations should be fully used since they express se- tuning effort (Wang et al. 2021b).
mantics in video moments of various lengths.
To prevent these problems, we directly draw samples from
To resolve the above issues, in this paper, we propose a the feature space of video moment encoder. Specifically,
multi-scale contrastive learning framework for multi-scale we take advantage of internal, intermediate representations
temporal grounding. In our framework, instead of leverag- of video moments from the encoder that are readily avail-
ing moment-query relationships, we utilize the association able through the feed-forward step of the network without
among video moments. Particularly, to avoid representation the need to rely upon external steps such as data augmen-
conflict among video moments, we introduce a query-centric tation or online storing of samples in memory banks. Ac-
contrastive approach that draws temporally separate video cordingly, we introduce a within-scale and cross-scale ap-
moments corresponding to a common textual query. A cen- proach to create positive and negative moment samples for
tral component of our framework is the creation of posi- contrastive learning. Regarding the within-scale approach,
tive and negative video moment samples, which previous we seek to pull together representations of such semantically
works primarily apply data augmentation (Kim et al. 2022; close video moments on the same scale of similar tempo-
Xing et al. 2023). However, because most long-form videos ral range. Moreover, we also push apart representations of
consist of a high volume of video moments, choosing an video moments which are unrelated to the textual query. Re-
appropriate augmentation strategy that suits every moment garding the cross-scale approach, we compel the model to
is a non-trivial and lengthy tuning step. Another common relate global long-range video moments to local short-range
approach is to introduce a memory bank to store positive moments, while simultaneously repelling semantically dis-
or negative samples’ representations, which are created by tant cross-scale representations in an analogous cross-scale
aggregating input representations iteratively during training manner. This cross-scale approach would enable long-range
(Panta et al. 2024; Han et al. 2023). Nevertheless, a mem- moment representations to capture nuanced details of short-
ory bank would present additional hyperparameters such as range moments, thereby mitigating informational degrada-
tion within long-range representations. Multi-Scale Contrastive Learning Moment predictions
Within-scale Cross-scale
To sum up, our contributions are the following:
• We propose a multi-scale contrastive framework that fo-
cuses on moment-moment relations to mitigate informa- Moment Decoding
tional degradation in video moment representations. Attract Repel
stead of utilizing outputs at the single final layer, contrastive where Z l−1 , Z̄ l , Ẑ l ∈ RT ×D , Z l ∈ RT ×D . T l−1 /T l is
learning with local and global representations across dif- the downsampling ratio, αl and ᾱl are learnable per-channel
ferent layers has been widely studied (Zhang et al. 2020c; scaling factors (Touvron et al. 2021), D is the hidden dimen-
Bachman, Hjelm, and Buchwalter 2019; Chaitanya et al. sion, and LN is the layer normalization.
2020). Zhang et al. (2020c) maximize the mutual informa- Inspired by (Mu, Mo, and Li 2024), we implement the
tion between representations of different local windows of a downsampling operator ↓ as a strided depthwise 1D con-
sentence and the representation of the global sentence. volution. The downsampling operation engenders the multi-
scale property of the encoder, generating representations for
longer video moments.
Methodology
In this section, we delineate our proposed contrastive frame- Text encoder. We use Transformer layers, where each
work for multi-scale temporal grounding, particularly focus- layer includes a vanilla self-attention followed by an
ing on a sampling procedure to draw video moment repre- MLP. Thus, the textual encoder produces textual repre-
sentations across temporal scales. sentations E = {e1 , e2 , ..., eK } for query embeddings
{q1 , q2 , ..., qK }.
Preliminary - Video Temporal Grounding Cross-modal fusion. Our architecture uses cross-attention
We denote an input video V as a sequence of video clips to fuse video clip and query word representations. Techni-
{vt }Tt=1 = {v1 , v2 , ..., vT }, where vt denotes a video mo- cally, we modulate video clip representations {Z l }L
l=1 with
word representations E as follows: closer while negative moments further:
Z̃ l = LN Z l , Ẽ = LN (E) ,
(5) L
X X X
⊤ Lwithin = −
l
Z̃ · Ẽ l=1 i∈P(l) j∈P(l),i̸=j
Ol = σ √ l
· Z̃ , (6) l l (11)
D e(zi ·zj )
log P (żl ·zl ) .
e(zi ·zj ) +
l l
e i n
X l = β l · MLP LN Ol + Ol ,
(7) n∈N (l)
l
where β denotes a learnable per-channel scale and σ the Cross-scale contrastive learning. We further associate
Softmax activation function. semantically close moment representations from across dif-
Moment decoding. After cross-modal fusion, our model ferent scales. Specifically, we push short-range moment rep-
converts each time step t to a moment candidate. Specifi- resentations closer to semantically close long-range moment
cally, given xlt , we use a convolutional network comprising representations. This would enable short-range moments to
1D convolutional layers as the classification head to predict a relate to longer video context while long-range features to
score plt . In a similar vein, we use a similar 1D convolutional capture nuanced details of short-range moments.
network attached with a ReLU activation function to regress As video moment features of layer 0 {z0j ′ } are the most
the normalized distances from t to the moment boundaries likely to preserve salient video information compared to
(dst , det ) if xlt is classified as positive. Formally, the decoded other levels, we employ features of the target moments from
moment is computed as: the lowest level as the anchor set for cross-scale contrastive
learning. To construct positive and negative moment set, we
(t, l) = arg max plt , (8) utilize features of higher levels l ∈ {1, 2, ..., L} in the fea-
t,l ture pyramid corresponding to video moments that involve
ŝ = 2l−1 (t − dst ) , ê = 2l−1 (t + det ) . (9) and do not involve the textual query, respectively. Denoting
the set of moment indices in level l that are related to the
During testing, we employ Soft-NMS (Bodla et al. 2017) to query as P(l) and the set of those that are unrelated as N (l),
merge overlapping moment predictions. we define the cross-scale contrastive learning objective as:
L
Cross-scale Contrastive Learning X X X
Lcross = −
Query-centric sampling. As randomly sampling moment-
i∈P(0) l=1 j∈P(l)
query pairs for contrastive learning might lead the model
0 l (12)
to representation conflict if the groundings of two queries e(zi ·zj )
overlap with each other, we instead introduce a sampling ap- log P (z0 ·zl ) .
e(zi ·zj ) +
0 l
proach that draws a text query Q and its temporally separate e i n
n∈N (l)
video moments associated with a common video V :
NQ Training Objective
Qj ′ , {yjl ′ }L l L
l=1 ∼ U {Qj , {yj ′ }l=1 }j=1 , (10)
For temporal grounding training, we adopt a focal loss Lcls
where U denotes a discrete uniform distribution, {yjl ′ }L l=1 for target moment classification and a Distance-IoU loss Lreg
the set of target video moments in each layer l, and NQ the for distance regression from a positive time step t to the tar-
number of textual queries related to video V . We generate get moment. Then, we combine these losses with our within-
the target set P(l) via center sampling (Zhang, Wu, and Li and cross-scale contrastive loss:
2022; Mu, Mo, and Li 2024), i.e. given any moment centered
at t, any time step c ∈ t − α TTl , t + α TTl α in layer l is con- L = Lcls + ρreg · Lreg + ρwithin · Lwithin + ρcross · Lcross , (13)
sidered as a target. After sampling the query and target mo- where ρreg , ρwithin , and ρcross denote hyperparameters to bal-
ments, we directly utilize the representations {zlj ′ }L
l=1 of the ance the regression, within-scale, and cross-scale contrastive
target moments {yjl ′ }Ll=1 extracted by the aforementioned losses, respectively.
multi-scale video encoder.
Within-scale contrastive learning. Having obtained the Experiments
representations of target moment samples, we directly uti- To validate the effectiveness, we conduct extensive experi-
lize moments within each scale as positive and negative sam- ments against recent methods for temporal grounding. We
ples. Particularly, we iterate over every layer l of the video also perform ablation study to investigate each component.
encoder, and for each anchor video moment yjl ′ , we consider
all video moments of layer l corresponding to query Qj ′ to Datasets
become positive moment set P (l), and randomly draw those Following previous works, we work on five challenging
not corresponding to query Qj ′ to be negative set N (l). datasets of temporal grounding, which belong to two main
Then, we formulate multi-scale contrastive objective over categories, i.e. 1) Long videos, many queries (Ego4D-NLQ
all layers l ∈ {1, 2, ..., L}, which pushes positive moments (Grauman et al. 2022), MAD (Soldan et al. 2022), and
TACoS (Regneri et al. 2013)) and 2) Short videos, few R@1 R@5
Features Model
0.3 0.5 Avg 0.3 0.5 Avg
queries (ActivityNet-Captions (Krishna et al. 2017) and
VSL-Net 5.45 3.12 4.29 10.74 6.63 8.69
Charades-STA (Sigurdsson et al. 2016)).
CONE 10.40 5.03 7.72 22.74 11.87 17.31
Ego4D-NLQ (Grauman et al. 2022) consists of egocentric SOONet 8.00 3.76 5.88 22.40 11.09 16.75
videos recording daily human activities. Each video pos- SF+BERT
SnAG 9.83 6.83 8.33 27.93 19.27 23.60
sesses length from 3.5 to 20 minutes and is associated with Our model 10.80 7.22 9.49 28.54 20.38 25.06
11.6 queries on average. VSL-Net 10.84 6.81 8.83 18.84 13.45 16.15
MAD (Soldan et al. 2022) comprises 1.2K hours of movies EgoVLP
CONE 14.15 8.18 11.17 30.33 18.02 24.18
with 384K queries transcribed from audio description. Since SnAG 15.72 10.78 13.25 38.39 27.44 32.92
each video is a movie, each exhibits 47 to 202 minutes long. Ours 16.37 11.27 13.96 39.97 28.70 34.43
TACoS (Regneri et al. 2013) focuses on cooking topics. The
total video length is 10.1 hours and each video is tasked with Table 1: Results on Ego4D-NLQ.
143.5 queries for the temporal grounding operation.
ActivityNet-Captions (Krishna et al. 2017) targets dense validation and fair comparison with previous methods, we
video captioning and is subsequently adapted to temporal use ρref = ρwithin = ρcross = 1.0.
grounding. Its video length is two minutes on average and
the average number of queries per video is approximately Baselines
3.65 queries. We consider the following temporal grounding models as
Charades-STA (Sigurdsson et al. 2016) is an action recog- baselines: (i) VSL-Net (Zhang et al. 2020a) utilizing tex-
nition dataset transformed into a temporal grounding one. tual query to highlight regions potential to comprise the
Each video lasts approximately 30 seconds and possesses target moment; (ii) VLG-Net (Soldan et al. 2021) model-
2.4 queries. ing temporal grounding as a graph matching problem; (iii)
Moment-DETR (Lei, Berg, and Bansal 2021), a Trans-
Evaluation Metrics former encoder-decoder architecture that views temporal
We report Recall@K at different temporal intersection-over- grounding as a set prediction problem; (iv) CONE (Hou
union θ (R@K, tIoU = θ) for all datasets. The metric mea- et al. 2022) subsequently slicing a video input into win-
sures the percentage of textual queries whose at least one of dows, selects relevant windows, and ranks the selected win-
the top-K moment predictions temporally overlap with the dows to obtain target moments; (v) MMN (Wang et al.
groundtruth moment more than θ. 2022), a Siamese-like network architecture that is trained
with video-query and query-video contrastive learning; (vi)
Implementation Details SSRN (Zhu et al. 2023) enriching anchor frames with addi-
tional consecutive frames; (vii) G2L (Li et al. 2023) mea-
To fairly compare with previous works and satisfy the scal- suring moment-query similarities using geodesic distance
ability of temporal grounding operation for long videos, we and quantifies cross-modal interactions with game-theoretic
adopt video-centric sampling approach (Mu, Mo, and Li interactions; (viii) SOONet (Pan et al. 2023), an anchor-
2024). For Ego4D-NLQ, we use pre-trained 1) SlowFast based framework that conducts grounding by pre-ranking,
video features (Feichtenhofer et al. 2019) with BERT tex- re-ranking, and regression; (ix) MESM (Liu et al. 2024),
tual features (Devlin et al. 2018), and 2) EgoVLP video a fine-grained moment-query contrastive approach mod-
and textual features (Lin et al. 2022). For testing, we re- eled for query word and video moment representations; (x)
port R@{1, 5}, tIoU = {0.3, 0.5}. For MAD dataset, we Contrastive-MSAT (Panta et al. 2024), applying moment-
use CLIP features (Radford et al. 2021) for both videos and query contrastive loss supported by a momentum-based
texts, and report R@{1, 5, 10, 50}, tIoU = {0.1, 0.3, 0.5}. memory bank; (xi) UVCOM (Xiao et al. 2024), a moment-
For the TACoS dataset, we use C3D video features (Tran query contrastive approach for a unified video comprehen-
et al. 2015) and GloVe textual features (Pennington, Socher, sion framework; (xii) SnAG (Mu, Mo, and Li 2024) achiev-
and Manning 2014). We report results in terms of R@{1, 5}, ing scalable grounding with cross-modal late fusion.
tIoU = {0.5, 0.7}. In addition, we utilize I3D features (Car-
reira and Zisserman 2017) pre-trained on Kinetics (Kay et al.
2017) for Charades-STA and C3D features (Tran et al. 2015)
Experimental Results
for ActivityNet-Captions experiments. For both datasets, Main Results
similar to TACoS, we take advantage of GloVe textual fea- Results on Ego4D-NLQ (Table 1). Our framework signifi-
tures (Pennington, Socher, and Manning 2014). We report cantly outperforms recent temporal grounding methods. For
R@{1, 5}, tIoU = {0.5, 0.7} for testing on Charades-STA, example, using SlowFast+BERT features, we outperform
and R@{1, 5}, tIoU = {0.3, 0.5} for testing on ActivityNet- previous best method, i.e. SnAG, by mean improvements of
Captions. For more details regarding model architecture, we 1.16% and 1.46% in terms of R@1 and R@5 metrics, re-
direct interested readers to the appendix. For both within- spectively. In addition, we accomplish more significant per-
scale and cross-scale contrastive learning implementation, formance gains on the more stringent tIoU threshold of 0.5,
we keep the size of the negative sample set N (l) in every denoting more precise moment localization.
level l to be equal to the size of the positive video clips P(l) Results on MAD (Table 2). Similar to results on Ego4D-
that correspond to the target video moments. Based upon NLQ, our framework obtains an outstanding improvement
R@1 R@5 R@10 R@50
Model
0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5
VLG-Net 3.64 2.76 1.65 11.66 9.31 5.99 17.39 14.56 9.77 39.78 34.27 24.93
Moment-DETR 0.31 0.24 0.16 1.52 1.14 0.28 2.79 2.06 1.20 11.08 7.97 4.71
CONE 8.90 6.87 4.10 20.51 16.11 9.59 27.20 21.53 12.82 43.36 34.73 20.56
SOONet 11.26 9.00 5.32 23.21 19.64 13.14 30.36 26.00 17.84 50.32 44.78 32.59
SnAG 10.28 8.46 5.55 24.42 20.60 13.75 32.23 27.50 19.00 52.28 46.68 35.24
Our model 12.76 10.94 6.92 26.43 22.60 15.43 34.08 29.41 20.70 54.84 48.26 37.77