OST: Refining Text Knowledge with Optimal Spatio-Temporal
Descriptor for General Video Recognition

Tongjia Chen¹, Hongshan Yu

{}^{1}\textsuperscript{{\char 12\relax}}

, Zhengeng Yang², Zechuan Li¹, Wei Sun¹, Chen Chen³
¹Hunan University, ²Hunan Normal University
³Center for Research in Computer Vision, University of Central Florida
✉ Corresponding author Project Page: https://tomchen-ctj.github.io/OST.

Abstract

Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

1 Introduction

Refer to caption — Figure 1: Motivation of our method. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the textual discrepancy between descriptive narratives and concise category names. This oversight results in a less separable latent space, which may hinder video recognition.

Large-scale contrastive language-image pre-training [46, 25, 65] have shown remarkable performance in various computer vision tasks. The visual-semantic joint space not only serves powerful visual representation but also enables few/zero-shot transferring to downstream tasks with the reference of natural language. However, training a similar model for video recognition can be costly since large-scale video-language datasets are exponentially more massive [57] due to the extra temporal dimension. Hence, a feasible solution is to adapt the pre-trained image-text models for the task of video recognition. As depicted in Fig. 1, current methods devise a range of temporal learners to address the visual discrepancy while preserving text-domain knowledge in the semantic space of action category names, often by merging the category name with CLIP-style hard-prompts (e.g., “a video of a person {ski jumping}") [56, 53, 45, 60, 41]. Despite providing essential inter-class correlations that can benefit general recognition, we speculate this paradigm overlooks the textual discrepancy between web-scaled descriptive narratives in CLIP pre-training and concise category names in downstream video recognition. Given that category names of video datasets generally consist of verbs and nouns, the nouns exhibit variability while the verbs tend to remain consistent. For instance, playing cello, playing organ & playing violin are distinct actions related to playing instruments. The sole differentiation between these category names lies in the noun itself, resulting in low discriminative text embeddings. This may lead to a less separable semantic space, potentially introducing ambiguity in recognition tasks [5].

To validate our hypothesis, we perform a sanity check on the semantic distribution of category embeddings across ImageNet [17], Kinetics-400 [7], and Something-Something v2 [20]. Initially, we employ a CLIP-B/16 text encoder to extract semantic embeddings of category names and leverage t-SNE visualization [54] to illustrate embedding clusters across the three datasets. As depicted in Fig. 2 (Left), features from K400 and Sthv2 datasets exhibit denser clustering compared to those from ImageNet, qualitatively indicating the low semantic distinction of video category names. To quantify this distinction and provide further support for our hypothesis, we compute pair-wise cosine similarity within each dataset and determine the average similarity, serving as a measure of semantic density. A higher similarity implies a denser distribution of category embeddings and less separable semantics in the latent space. Fig. 2 (Right) visually demonstrates consistently higher mean cosine similarity of category names on video datasets compared to image datasets. This observation suggests that the intrinsic semantic space associated with video category names is less distinct. Since the category embedding serves as a decision plane [60] in cross-modal matching (i.e. compute the cosine similarity between category embeddings and visual features), such reduced distinctiveness may potentially diminish its efficacy in recognition tasks.

To mitigate this issue, one could manually craft textual narratives, but this process is labor-intensive. Alternatively, Large Language Models (LLMs) serve as a viable solution, acting as expansive knowledge bases that can generate detailed descriptors efficiently. As shown in Fig. 1, we can substantially refine our comprehension of ski jumping by integrating external contextual information such as the forest, the snow slope, and different action steps performed by the ski jumper. Hence, we propose to prompt LLMs with category names into what we define as Spatio-Temporal Descriptors to enrich the semantic space with external knowledge. Where Spatio Descriptors should possess the capability to capture static appearances, for instance, the environment and distinct objects included, while Temporal Descriptors should focus on describing the temporal evolution of actions. This allows for the disentanglement of the category name into two complementary semantic spaces, thereby enhancing the semantic distinction and providing external knowledge for general recognition.

Based on the obtained descriptors, an intuitive solution is to aggregate these descriptors as a global category embedding via pooling, and match the embedding with corresponding visual features [28, 38]. However, this utilization might be suboptimal due to the following reasons: 1) Since the descriptors for one action class may not be contained in every video instance in this action category, directly matching the pooled descriptor-level representations with each video is potentially ineffective. 2) The propensity of LLMs to exhibit hallucinations [69] may bring noises to descriptors. To address this, we need to consider the adaptability of descriptors to individual video instances. In this vein, we propose Optimal Descriptor Solver to obtain an optimal transport plan that adaptively aligns features across frame-level tokens and descriptors.

In light of the above explorations, we propose Optimal Spatio-Temporal Descriptor (OST), a general pipeline for video recognition. Our OST comprises two components: We first disentangle the category name into Spatio-Temporal Descriptors, which not only bridges the semantic gap between narratives and category names but also serves as a knowledge base for general recognition. Then, we propose Optimal Descriptor Solver that adaptively aligns frame-level representations with Spatio-Temporal Descriptors to enhance video recognition. To demonstrate the effectiveness of our OST, we conduct comprehensive experiments on six benchmarks, including Kinetics-400 [7] & 600 [8], UCF-101 [50], HMDB-51 [31], Something-Something V2 [20], and ActivityNet [6]. The results indicate that our method achieves state-of-the-art performance in open-vocabulary tasks, e.g. zero-shot, few-shot, and also consistently improves the performance when combined with existing pipelines in fully-supervised settings. The main contributions of this work are as follows:

•

We provide new insights that prior pipelines for adapting vision-language pre-trained models to video recognition are constrained by the semantic space of category names.
•

We propose Spatio-Temporal Descriptors derived from LLMs to enhance the distinction of semantic space and provide external knowledge for general recognition.
•

We introduce Optimal Descriptor Solver that forms the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors to fully refine the semantic knowledge.
•

Our OST presents a new way to utilize external knowledge to adapt pre-trained visual-language models for general video recognition. Experimental results in zero-shot, few-shot, and fully-supervised settings demonstrate the superior performance and generalizability of our method.

2 Related Work

Video Recognition. As a fundamental component of computer vision, mainstream pipelines have typically explored traditional 2D, 3D CNNs [55, 33, 63, 7, 52, 23] and Transformer-based methods [64, 3, 37, 41, 53, 18, 35, 12]. Additionally, methods modeling action phases [51, 68, 2, 70] have shown promise in video recognition, especially for long-form videos. Recently, cross-modal video recognition [45, 53, 56, 62, 26, 41, 61, 60] has benefited a lot from the powerful visual-text joint semantic space of CLIP. This cross-modal paradigm not only fosters strong representations with rich semantics but also achieves great open-vocabulary capacities. However, dominant pipelines [60, 53, 45, 56] focus on the temporal discrepancies between images and videos while maintaining text-domain knowledge constantly. In contrast, our method prioritizes the refinement of text knowledge.

Language for Visual Recognition. Differing from visual signals, natural language contains dense semantic information. Thus, language can serve as a rich source to provide inter-class correlations to benefit visual recognition. CuPL [43] and pipeline proposed by Menon et al. [38] utilizes category descriptions from GPT-3 as global category embedding for improved zero-shot image classification. Kaul et al. [28] propose to utilize LLM descriptions and visual prototypes to construct a multi-modal classifier for enhanced open-vocabulary object detection. MAXI [34] proposes to construct text bags generated via multiple sources (e.g., captions and descriptions) to perform unsupervised finetuning for robust zero-shot action recognition. ASU [13] utilizes semantic units manually derived from WordNet and Wikipedia for video recognition. In this work, we aim to refine text knowledge by finding the optimal Spatio-Temporal Descriptors automatically generated by LLMs to bridge the semantic discrepancy and provide external knowledge to benefit general video recognition.

Optimal Transport. Optimal Transport (OT), also known as Monge Problem [40], is an essential mathematical framework that facilitates the establishment of correspondences between two distinct distributions. Its great characteristics for distribution matching have benefited a variety of machine learning tasks [29], including domain adaptation [14, 16], generative models [1, 21, 48], graph matching [10, 42], image matching [36, 66], and prompt learning [9, 30], etc. In this work, we propose to utilize OT distance to solve the cross-modal matching problem. To the best of our knowledge, this is the first work to form the video-text matching problem as solving the OT problem between frame-level representations and textual embeddings.

3 Method

In this section, we first review the preliminaries of optimal transport in Sec. 3.1, then discuss our proposed Spatio-Temporal Descriptor and Optimal Descriptor Solver scheme in Sec. 3.2 and Sec. 3.3, respectively. Finally, we introduce the training objectives in Sec. 3.4.

3.1 Preliminaries

Optimal transport aims to seek the minimal-cost transport plan between two distributions. In this work, we only consider the discrete distribution which is closely related to our framework. Assuming we have two sets of discrete empirical distributions:

\boldsymbol{\mu}=\sum^{M}_{i=1}p_{i}\delta_{x_{i}},\quad\boldsymbol{\nu}=\sum^% {N}_{j=1}q_{j}\delta_{y_{j}},\vspace{-3mm}

(1)

where $p_{i}$ and $q_{j}$ are the probability distribution summing to 1, $M$ and $N$ are number of samples in each empirical distribution, $\delta$ denotes the Dirac function. Since each certain distribution is discrete, the optimal transport plan $\boldsymbol{P}$ matching the two distributions is also discrete. In this setting, we can adapt Kantorovich OT formulation [27] and form the optimal transport problem as:

		$\displaystyle\boldsymbol{P}^{\ast}=\underset{\boldsymbol{P}\in\mathbb{R}^{M% \times N}}{\arg{\min}}\sum^{M}_{i=1}\sum^{N}_{j=1}\boldsymbol{P}_{ij}% \boldsymbol{C}_{ij}$		(2)
		$\displaystyle\textrm{s.t.}\quad\boldsymbol{P}\boldsymbol{e}=\boldsymbol{\mu},% \quad\boldsymbol{P}^{\top}\boldsymbol{e}=\boldsymbol{\nu}.$		(2)

$\boldsymbol{C}\in\mathbb{R}^{M\times{N}}$ is the cost matrix that represents the distance between the support points $x_{i}$ and $y_{j}$ such as $\boldsymbol{C}_{ij}=1-sim(x_{i},y_{j})$ . $\boldsymbol{P}^{\ast}$ is the optimal transport plan between two empirical distributions to minimize the total distance and $\boldsymbol{e}$ is the vector of ones. Considering the computational and statistical limitations of this original OT formulation, we adopt the Sinkhorn-Knopp [15] algorithm to solve the entropy-regularized OT problem. The regularized OT problem is defined as:

		$\displaystyle\boldsymbol{P}^{\ast}=\underset{\boldsymbol{P}\in\mathbb{R}^{M% \times N}}{\arg{\min}}\sum^{M}_{i=1}\sum^{N}_{j=1}\boldsymbol{P}_{ij}% \boldsymbol{C}_{ij}-\lambda\boldsymbol{H}(\boldsymbol{P})$		(3)
		$\displaystyle\textrm{s.t.}\quad\boldsymbol{P}\boldsymbol{e}=\boldsymbol{\mu},% \quad\boldsymbol{P}^{\top}\boldsymbol{e}=\boldsymbol{\nu},$		(3)

where $\boldsymbol{H}(\cdot)$ is the regularization operator and $\lambda$ is a regularization coefficient. Eq. 3 is a convex problem and thus can be solved using the Sinkhorn algorithm. With $\boldsymbol{K}=\exp(-\boldsymbol{C}/\lambda)$ , the regularized optimal transport can be computed by:

\boldsymbol{P}^{\ast}=\text{diag}(\boldsymbol{a})\boldsymbol{K}\text{diag}(% \boldsymbol{b}),

(4)

where $\boldsymbol{a}$ and $\boldsymbol{b}$ are marginal constraints:

\boldsymbol{a}\leftarrow\boldsymbol{\mu}/\boldsymbol{K}\boldsymbol{b},\quad% \boldsymbol{b}\leftarrow\boldsymbol{\nu}/\boldsymbol{K}^{\top}\boldsymbol{a}.

(5)

3.2 Spatio-Temporal Descriptor

In addressing the low semantic distinction of video categories, our objective is to disentangle category names into Spatio-Temporal Descriptors. We posit that each type of descriptor yields information that is complementary to the other. Spatio Descriptors are intended to capture static visual elements that can be discerned from a single image—such as settings and common objects. For Temporal Descriptors, we aim to decompose the action classes in a step-by-step manner to describe the temporal evolution of an action. We use OpenAI’s API for GPT-3.5 [4] with a temperature of 0.7 to generate corresponding descriptors.

To generate Spatio Descriptors, inspired by [19], we use the following prompt $\mathcal{P}^{s}(\cdot)$ with category name $\boldsymbol{cls}$ to query LLM: “Please give me a long list of descriptors for action: { $\boldsymbol{cls}$ }, ${N_{s}}$ descriptors in total."¹¹1For a detailed demonstration of prompts we used, please refer to Supplementary Material.. This prompt enables the LLM to always return a list with ${N_{s}}$ descriptors. This process can be formulated as:

\boldsymbol{Des^{s}}=\boldsymbol{LLM}[(\mathcal{P}^{s}(\boldsymbol{cls}))],

(6)

For Temporal Descriptors, we utilize the temporal prompt $\mathcal{P}^{t}(\cdot)$ as “Please give me a long list of decompositions of steps for action: { $\boldsymbol{cls}$ }, ${N_{t}}$ steps in total" and obtain ${N_{t}}$ descriptors:

\boldsymbol{Des^{t}}=\boldsymbol{LLM}[(\mathcal{P}^{t}(\boldsymbol{cls}))].

(7)

Nonetheless, our empirical study (please refer to Sec.4.2) indicates that the direct application of temporal descriptors $\boldsymbol{Des^{t}}$ , yields only marginal enhancements. As discussed in [39, 34, 22], image-text pre-trained models are less sensitive to verbs. The initial semantic space of the temporal descriptors generated by CLIP might be limited. Thus, we adopt a hard prompt: “A video of { $\boldsymbol{cls}$ } usually includes { $\boldsymbol{Des^{t}}$ }" to condition temporal descriptors on the category names. We find this operation brings consistent improvements in recognition tasks.

Through this approach, we can disentangle the category name into two complementary semantic spaces. This disentanglement significantly mitigates the semantic similarity among class names and also serves sufficient knowledge for general recognition.

Table 1: Comparisons with state-of-the-art methods for zero-shot video recognition on HMDB51, UCF101 and Kinetics-600. We report Top-1 and Top-5 accuracy using single-view inference.

Method	Venue	Encoder	Frames	HMDB-51	UCF-101	K600 (Top-1)	K600 (Top-5)
Uni-modal zero-shot video recognition models
ER-ZSAR [11]	ICCV’21	TSM	16	35.3 $\pm$ 4.6	51.8 $\pm$ 2.9	42.1 $\pm$ 1.4	73.1 $\pm$ 0.3
JigsawNet [44]	ECCV’22	R(2+1)D	16	38.7 $\pm$ 3.7	56.0 $\pm$ 3.1	-	-
Adapting pre-trained CLIP
Vanilla CLIP [46]	ICML’21	ViT-B/16	32	40.8 $\pm$ 0.3	63.2 $\pm$ 0.2	59.8 $\pm$ 0.3	83.5 $\pm$ 0.2
ActionCLIP [56]	arXiv’21	ViT-B/16	32	40.8 $\pm$ 5.4	58.3 $\pm$ 3.4	66.7 $\pm$ 1.1	91.6 $\pm$ 0.3
Vita-CLIP [58]	CVPR’23	ViT-B/16	8 / 32	48.6 $\pm$ 0.6	75.0 $\pm$ 0.6	67.4 $\pm$ 0.5	-
A5 [26]	ECCV’22	ViT-B/16	32	44.3 $\pm$ 2.2	69.3 $\pm$ 4.2	55.8 $\pm$ 0.7	81.4 $\pm$ 0.3
XCLIP [41]	ECCV’22	ViT-B/16	32	44.6 $\pm$ 5.2	72.0 $\pm$ 2.3	65.2 $\pm$ 0.4	86.1 $\pm$ 0.8
DiST [45]	ICCV’23	ViT-B/16	32	55.4 $\pm$ 1.2	72.3 $\pm$ 0.6	-	-
Tuning pre-trained CLIP
ViFi-CLIP [47]	CVPR’23	ViT-B/16	32	51.3 $\pm$ 0.7	76.8 $\pm$ 0.8	71.2 $\pm$ 1.0	92.2 $\pm$ 0.3
MAXI [34]	ICCV’23	ViT-B/16	16 / 32	52.3 $\pm$ 0.6	78.2 $\pm$ 0.7	71.5 $\pm$ 0.8	92.5 $\pm$ 0.4
OST	CVPR’24	ViT-B/16	8	54.9 $\pm$ 1.1	77.9 $\pm$ 1.3	73.9 $\pm$ 0.8	94.1 $\pm$ 0.3
OST	CVPR’24	ViT-B/16	32	55.9 $\pm$ 1.2	79.7 $\pm$ 1.1	75.1 $\pm$ 0.6	94.6 $\pm$ 0.2

3.3 Optimal Descriptor Solver

A considerable number of transformer-based video recognition pipelines obtain video-level representation via pooling over image-level [CLS] tokens and then classify the video into a category by calculating the matching score using cosine similarity with category embeddings [47, 60, 56, 53], this pipeline can be formulated as:

\boldsymbol{S}_{k}=cos(\boldsymbol{\overline{V}},\boldsymbol{Cat_{k}}),

(8)

where $cos(\cdot,\cdot)$ is the cosine similarity, $\boldsymbol{V}\in\mathbb{R}^{T\times{d}}$ is a set of local representations with $T$ frames in total, $\boldsymbol{Cat_{k}}\in\mathbb{R}^{d}$ is category embedding for each class. As discussed before, only relying on the understanding of category names may lead to a less distinctive semantic space. After obtaining Spatio-Temporal Descriptors introduced in Sec. 3.2, an intuitive operation is to form a global-level descriptor embedding to benefit visual recognition:

\boldsymbol{S_{k}^{s}}_{pool}=cos(\boldsymbol{\overline{V}},\boldsymbol{% \overline{D_{k}^{s}}}),\quad\boldsymbol{S_{k}^{t}}_{pool}=cos(\boldsymbol{% \overline{V}},\boldsymbol{\overline{D_{k}^{t}}}),

(9)

where $\boldsymbol{D_{k}}\in\mathbb{R}^{N\times{d}}$ is the embedding of Spatio-Temporal Descriptors. By pooling along the $N$ dimension, we can obtain the discriminative global descriptor embedding. However, we find this formation can lead to sub-optimal performances: 1) By averaging the descriptor-level representations, the model treats all of the attributes equally. Since the descriptors are generated by an autoregressive language model without instance-level knowledge, these descriptors may not be contained in every video. 2) The hallucination problem of LLMs may bring noises to the descriptor.

Hence, a natural question arises: how can we assign optimal descriptors for each video instance? In this regard, we introduce Optimal Descriptor Solver (OD Solver), by adapting optimal transport theory, we formulate the video-text matching problem as an optimal matching flow. After obtain a set of frame-level features $\boldsymbol{V}\in\mathbb{R}^{T\times{d}}$ and descriptor-level embedding for each class $\boldsymbol{D_{k}^{s}}\in\mathbb{R}^{N_{s}\times{d}}$ , $\boldsymbol{D_{k}^{t}}\in\mathbb{R}^{N_{t}\times{d}}$ . The cost matrix for each class can be defined as:

\boldsymbol{C_{k}^{s}}=1-cos(\boldsymbol{V},\boldsymbol{D_{k}^{s}}),\quad% \boldsymbol{C_{k}^{t}}=1-cos(\boldsymbol{V},\boldsymbol{D_{k}^{t}}).

(10)

According to Eq. 3, the entropy-regularized OT problem can be defined as:

		$\displaystyle\boldsymbol{P}^{\ast}=\underset{\boldsymbol{P}\in\mathbb{R}^{T% \times N}}{\arg{\min}}\sum^{T}_{i=1}\sum^{N}_{j=1}\boldsymbol{P}_{ij}% \boldsymbol{C}_{ij}-\lambda\boldsymbol{H}(\boldsymbol{P})$		(11)
		$\displaystyle\textrm{s.t.}\quad\boldsymbol{P}\boldsymbol{e}=\boldsymbol{\mu},% \quad\boldsymbol{P}^{\top}\boldsymbol{e}=\boldsymbol{\nu}.$		(11)

We can obtain the optimal transport plan $\boldsymbol{P_{k}^{s}}^{\ast}$ and $\boldsymbol{P_{k}^{t}}^{\ast}$ for Spatio-Temporal Descriptors respectively by solving the convex problem in Eq. 11 via the Sinkhorn algorithm as defined in Eq. 4. Here $\boldsymbol{P_{k}}^{\ast}\in\mathbb{R}^{T\times N}$ denotes the optimal matching flow between the video and descriptors. The matching score based on the optimal matching flow can be obtained via Frobenius inner product:

		$\displaystyle\boldsymbol{S_{k}^{s}}_{OT}=\sum^{T}_{i=1}\sum^{N}_{j=1}% \boldsymbol{P_{k}^{s}}_{ij}^{\ast}cos(\boldsymbol{V}_{i},\boldsymbol{D_{k}^{s}% }_{j}),$		(12)
		$\displaystyle\boldsymbol{S_{k}^{t}}_{OT}=\sum^{T}_{i=1}\sum^{N}_{j=1}% \boldsymbol{P_{k}^{t}}_{ij}^{\ast}cos(\boldsymbol{V}_{i},\boldsymbol{D_{k}^{t}% }_{j}).$		(12)

Table 2: Comparisons with state-of-the-art methods for few-shot video recognition on HMDB51, UCF101 and Something-Something V2. We scaled up the task to categorize all categories in the dataset with only a few samples per category for training. Here

K

denotes training samples for each class. We report Top-1 accuracy using single-view inference.

Method	HMDB-51				UCF-101				SSv2
Method	$K$ = $2$	$K$ = $4$	$K$ = $8$	$K$ = $16$	$K$ = $2$	$K$ = $4$	$K$ = $8$	$K$ = $16$	$K$ = $2$	$K$ = $4$	$K$ = $8$	$K$ = $16$
Directly tuning on CLIP
Vanilla CLIP [46]	41.9	41.9	41.9	41.9	63.6	63.6	63.6	63.6	2.7	2.7	2.7	2.7
ActionCLIP [56]	47.5	57.9	57.3	59.1	70.6	71.5	73.0	91.4	4.1	5.8	8.4	11.1
XCLIP [41]	53.0	57.3	62.8	64.0	48.5	75.6	83.7	91.4	3.9	4.5	6.8	10.0
A5 [26]	39.7	50.7	56.0	62.4	71.4	79.9	85.7	89.9	4.4	5.1	6.1	9.7
ViFi-CLIP [47]	57.2	62.7	64.5	66.8	80.7	85.1	90.0	92.7	6.2	7.4	8.5	12.4
OST	59.1_+1.9	62.9_+0.2	64.9_+0.4	68.2_+1.4	82.5_+1.8	87.5_+2.4	91.7_+1.7	93.9_+1.2	7.0 _+0.8	7.7 _+0.3	8.9 _+0.4	12.2
Fine-tuned on K400
ViFi-CLIP [47]	55.8	60.5	64.3	65.4	84.0	86.5	90.3	92.8	6.6	6.8	8.6	11.0
MAXI [34]	58.0	60.1	65.0	66.5	86.8	89.3	92.4	93.5	7.1	8.4	9.3	12.4
OST	64.8_+6.8	66.7_+6.2	69.2_+4.2	71.6_+5.1	90.3_+3.5	92.6_+3.3	94.4_+2.0	96.2_+2.7	8.0 _+0.9	8.9 _+0.5	10.5_+1.2	12.6_+0.2

By fusing the overall matching score in the Euclidean space and Wasserstein space described in Eq. 9 and Eq. 12 respectively, the overall logits can be expressed as:

\boldsymbol{S_{k}}_{\textbf{OD}}=\frac{1}{4}(\boldsymbol{S_{k}^{s}}_{pool}+% \boldsymbol{S_{k}^{t}}_{pool}+\boldsymbol{S_{k}^{s}}_{OT}+\boldsymbol{S_{k}^{t% }}_{OT}).

(13)

Please refer to Supplementary Material for pseudo-codes.

Table 3: Fully-supervised video recognition on Kinetics-400, Something-Something V2 and ActivityNet. We report Top-1 accuracy using single-view inference.

Kinetics-400
Method	Encoder - Frames
Method	B/32 - 8	B/32 - 16	B/16 - 8	B/16 - 16
Text4Vis [60]	78.5	79.3	81.4	82.6
OST	78.7(+0.2)	79.8(+0.5)	82.0(+0.6)	83.2(+0.6)
Something-Something V2
Text4Vis [60]	54.3	56.1	57.9	59.9
OST	54.4(+0.1)	56.4(+0.3)	58.4(+0.5)	60.3(+0.4)
ActivityNet
Text4Vis [60]	83.4	85.0	86.4	88.4
OST	84.0(+0.6)	85.8(+0.8)	87.1(+0.7)	88.7(+0.3)

3.4 Training Objectives

Considering the overall logits calculated by OD Solver in Eq. 13 can be described as video-to-text logits $\boldsymbol{S_{k}^{v2t}}_{\textbf{OD}}=\boldsymbol{OD}(\boldsymbol{V},% \boldsymbol{D^{s,t}_{k}})$ . A symmetric text-to-video logits can be obtained via a similar way $\boldsymbol{S_{k}^{t2v}}_{\textbf{OD}}=\boldsymbol{OD}(\boldsymbol{D^{s,t}_{k}% },\boldsymbol{V})$ . Then, the softmax-normalized similarity scores can be expressed as:

		$\displaystyle\boldsymbol{p_{i}^{v2t}}_{\textbf{OD}}=\frac{1}{K}\sum^{K}_{k=1}% \frac{exp(\boldsymbol{S_{ki}^{v2t}}_{\textbf{OD}}/\tau)}{\sum^{B}_{j=1}exp(% \boldsymbol{S_{kj}^{v2t}}_{\textbf{OD}}/\tau)},$		(14)
		$\displaystyle\boldsymbol{p_{i}^{t2v}}_{\textbf{OD}}=\frac{1}{K}\sum^{K}_{k=1}% \frac{exp(\boldsymbol{S_{ki}^{t2v}}_{\textbf{OD}}/\tau)}{\sum^{B}_{j=1}exp(% \boldsymbol{S_{kj}^{t2v}}_{\textbf{OD}}/\tau)},$		(14)

where $\tau$ refers to the temperature hyperparameter for scaling, $B$ is the number of samples in the current mini-batch, and $K$ is the number of classes. Let $\boldsymbol{q^{v2t}},\boldsymbol{q^{t2v}}$ denotes the ground-truth similarity scores, we can define the Kullback-Leibler (KL) divergence [32] as the overall contrastive loss to optimize the model as:

\mathcal{L}_{\textbf{OD}}=\frac{1}{2}[KL(\boldsymbol{p^{v2t}}_{\textbf{OD}},% \boldsymbol{q^{v2t}})+KL(\boldsymbol{p^{t2v}}_{\textbf{OD}},\boldsymbol{q^{t2v% }})].

(15)

4 Experiments

Datasets. We conduct experiments across 6 video benchmarks: Kinetics-400 [7] & 600 [8], UCF-101 [50], HMDB-51 [31], Something-Something V2 [20], and ActivityNet [6]. Our investigation encompasses various settings, including zero-shot, few-shot, and fully-supervised video recognition. See Supplementary Material for details.

Implementation Details. We employ a CLIP ViT-B/16 to conduct both zero-shot and few-shot experiments. We generate $N_{s,t}=4$ descriptors for each category. Following [34, 24, 59], we perform a linear weight-space ensembling between the original CLIP and the finetuned model with a ratio of $0.2$ . See Supplementary Material for details.

4.1 Main Results

Zero-shot video recognition. We present our zero-shot video recognition results and compare our approach with SOTAs in Table 1. The model is first fine-tuned on the Kinetics400 dataset and evaluated directly on downstream datasets to ascertain its generalization capacity with respect to unseen classes. Our approach outperforms regular uni-modal zero-shot video recognition pipelines by a large margin as shown in the upper table. Moreover, we draw comparisons with methods that use K400 to adapt CLIP models for zero-shot recognition. Noteworthy among these are methods that integrate additional temporal learners [26, 45, 41] or employ VL prompting techniques [26, 58]. Contrary to these approaches, our pipeline leverages refined textual knowledge to boost video recognition without altering the underlying architecture. We observe consistent improvements in all datasets with respect to these methods.

Table 4: Ablation studies. We utilize ViT-B/16 as the backbone and use 8 frames for training/validation unless otherwise specified. All of the performances are top-1 accuracy (%) in the zero-shot setting using single-view inference and spatial size of

224\times 224

Method	HMDB	UCF	K600
Category Name [47]	50.9	75.5	70.8
Descriptors*	53.3 (+2.4)	76.6 (+1.1)	69.3
OD Solver	54.5 (+3.6)	77.9 (+2.4)	72.3 (+1.5)

(a) Study on cross-modal matching mechanisms. Here we apply the number of descriptors

N_{s,t}=4

. * denotes pooling descriptors along with category names.

Spatio	Temporal	HMDB	UCF	K600
✓	✗	46.7	65.3	56.3
✗	✓	53.1	77.5	71.6
✓	✓	54.5	77.9	72.3

(b) The impact of different descriptors. Here ✓ means applying corresponding Spatio/Temporal descriptors.

$N$	HMDB	UCF	K600
2	53.8	77.3	72.1
4	54.5	77.9	72.3
8	53.0	77.5	72.6

N

Spatio	Temporal	HMDB	UCF	K600
✗	✗	49.8	74.1	64.2
✓	✗	53.5	79.0	71.8
✓	✓	53.5	78.9	72.1
✗	✓	54.5	77.9	72.3

(d) Study on category conditioning operation. ✓ means conditioning corresponding descriptors on category names.

Ensemble	HMDB	UCF	K600
✗	55.4	80.1	72.9
✓	55.9	79.7	75.1

(e) The effects of weight-space ensembling. ✓ means perform ensemble with a ratio of

0.2

. 32 frames are used during training/validation.

We further compare our method with other fully finetuning paradigms [47, 34]. Serving as a baseline to our method, ViFi-CLIP [47] relies on the direct utilization of category names to fine-tune the CLIP model. Notably, utilizing only 8 frames for training and validation, our method demonstrates competitive performance, surpassing our baseline by a large margin. Upon scaling up the input frames to 32, our method consistently exhibits improvements across all datasets in comparison to prior SOTAs. Even against MAXI [34] which leverages more diverse textual knowledge, such as frame-level captions, our approach showcases superior accuracy with a 3.6% improvement on HMDB, 1.5% on UCF, and 3.6% on K600.

Few-shot video recognition. We demonstrate our method’s learning capacity and generalizability under the challenging all-way few-shot regime. The Top-1 accuracy on three datasets is reported in Table 2. We conduct

Table 5: Additional cost analysis of our method, we report step latency during training, and throughput (TP) during inference. We refer to Top-1 as zero-shot accuracy on Kinetics-600.

Method	Top-1 (%)	Latency (s)	TP (video/s)
ViFi-CLIP [47]	71.2	0.40 (1.0 $\times$ )	40.9 (1.00 $\times$ )
OST	75.1	0.44 (1.1 $\times$ )	40.0 (0.98 $\times$ )

experiments in two different aspects. We first conduct an experiment that directly tunes CLIP for few-shot recognition. Our method shows consistent improvement over our baseline [47] on HMDB-51, UCF101, and even temporal-heavy dataset SSv2.

Following [34], we adopt our best model in zero-shot settings to further verify our method’s generalization capacity. As a comparison, ViFi-CLIP shows degraded performance in this fashion (e.g. $K=4$ on UCF, $K=16$ on SSv2). In this regime, our method outperforms the unsupervised contrastive training framework MAXI [34] in different shot settings by an average of $\sim$ 5% on HMDB, $\sim$ 3% on UCF, and $\sim$ 1% on SSv2. This indicates the generalizability of our pipeline in the extremely low-shot settings.

Fully-supervised video recognition. We also conduct fully-supervised experiments on three large-scale video benchmarks Kinetics-400, Something-Something V2, and ActivityNet to validate the effectiveness of our method in supervised settings. Serving as a standard pipeline to adapt pre-trained vision-language models for supervised video recognition, we choose Text4Vis [60] as our baseline and vary different encoders ViT-B/32, and ViT-B/16 with 8, and 16 frames, respectively. As shown in Table 3, we find our method improves upon our corresponding baseline for all different architectures on all datasets. We can see that the performance on K400 and SSv2 is about 0.5% higher than Text4Vis [60]. For ActivityNet, the accuracy is even 0.8% higher than our counterparts.

4.2 Ablation Studies

We conduct ablation studies on zero-shot settings in Table 4 to investigate our OST’s learning capacity and generalizability in different instantiations.

Different cross-modal matching mechanisms. Table 4(a) shows the effects of different cross-modal matching mechanisms. For a fair comparison, we start with a baseline that uses the category name during matching as [47]. By simply aggregating the Descriptors along with the category name via mean pooling, the accuracy on HMDB and UCF improved by 2.4% and 1.1%, respectively. However, on the K600 dataset, we observe a 1.5% performance drop. This validates our hypothesis that the enhanced distinction brought by pooling operation can benefit downstream recognition, but might not be optimal. We then introduce our OD Solver to solve the optimal matching flow, we find that our approach can further boost the performance on HMDB and UCF, and achieve a remarkable improvement of 1.5% on the large-scale dataset K600. Notably, the categories in the K600 validation set are more complicated compared to HMDB and UCF. This validates our OD Solver’s effectiveness, especially in complicated open-vocabulary settings.

The impact of different descriptors. We investigate the impact of Spatio-Temporal Descriptors on the performance of our proposed method. The results shown in Table 4(b) demonstrate that each descriptor is complementary to others. Indicating that both Spatio and Temporal Descriptors provide crucial information for recognition tasks. We also observe that the effect of temporal descriptors is more convincing compared to Spatio Descriptors.

Numbers of descriptors. We investigate the influences of varying the number of descriptors $N$ in Table 4(c). We conducted experiments with 2, 4, and 8 Spatio-Temporal Descriptors. We can observe that the performance reaches its peak at $N_{s,t}=4$ . We’ve further checked the quality²²2Please refer to Supplementary Material for examples of descriptors. of descriptors when varying $N$ . We find that 2 descriptors can not afford enough information to supply cross-modal matching. When the number of descriptors reaches 8, the hallucination problem of LLM becomes more severe, resulting in a significant amount of noisy descriptors. In this case, we set $N$ as 4 in our basic settings.

The impact of conditioning descriptors on category names. We study the effect of conditioning descriptors on category names on the final zero-shot accuracy. Table 4(d) shows that conditioning temporal descriptors on category names can achieve the best performances while conditioning both descriptors may lead to performance degradation. This further indicates the points framed in [39, 34, 22] that visual-language pre-trained models are less sensitive to verbs. As a result, the category conditioning technique can ensure the semantic distribution of the Temporal Descriptors clustered well, making the optimization process smoother.

The effects of weight-space ensembling. We investigate the effects of the linear weight-space ensembling technique. As shown in Table 4(e), the ensembling technique greatly mitigates the catastrophic forgetting problem, especially on the large-scale Kinetics-600 dataset, where the zero-shot accuracy is improved by 2.2%.

4.3 Cost Analysis

We analyze the additional cost of our method during training and inference in Table 5. Latency is measured in our basic training setting and throughput is measured using the largest possible batch size before running out of memory with a single NVIDIA 4090-24G. Notably, the original implementation of ViFi-CLIP [47] utilizes cross-entropy loss and maintains the logits for all categories in every mini-batch during training, leading to a larger latency. For a fair comparison, we re-implement ViFi-CLIP with local infoNCE-styled loss [56] to analyze the training cost. Our pipeline only requires an extra 0.1 $\times$ training time and reduces the throughput by about 2%, which is acceptable given the improvement in performance.

4.4 Visualizations

We conduct a qualitative study on the attention map of our OST in the zero-shot setting. As depicted in Fig. 4, compared to our baseline ViFi-CLIP [47] our method can not only focus on varied spatial cues but also consistently attend to temporal salient elements (e.g. the player’s feet) for videos that include more scene dynamics. Additionally, we investigate the attention map of our method on extreme outlier samples in Fig. 5. Our empirical findings indicate that out OST upholds robust generalization capabilities, even in extreme out-of-distribution examples. Please refer to Supplementary Material for more qualitative results.

5 Conclusion

In this work, we introduce a novel general video recognition pipeline OST. We prompt an LLM to augment category names into Spatio-Temporal Descriptors and refine the semantic knowledge via Optimal Descriptor Solver. Comprehensive evaluations in six datasets and three different tasks demonstrate the effectiveness of our approach.

Acknowledgement

The work was done while Tongjia was a research intern mentored by Chen Chen. We thank Ming Li and Yong He for proofreading and discussion.

References

Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
Becattini et al. [2020] Federico Becattini, Tiberio Uricchio, Lorenzo Seidenari, Lamberto Ballan, and Alberto Del Bimbo. Am i done? predicting action progress in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(4):1–24, 2020.
Bertasius et al. [2021] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In ICML, 2021.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020.
Bulat and Tzimiropoulos [2023] Adrian Bulat and Georgios Tzimiropoulos. Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models. In CVPR, 2023.
Caba Heilbron et al. [2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
Carreira and Zisserman [2017] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
Carreira et al. [2018] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
Chen et al. [2023a] Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Prompt learning with optimal transport for vision-language models. In ICLR, 2023a.
Chen et al. [2020] Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, and Jingjing Liu. Graph optimal transport for cross-domain alignment. In ICML, 2020.
[11] Shizhe Chen and Dong Huang. Elaborative rehearsal for zero-shot action recognition. In ICCV.
Chen et al. [2023b] Tom Tongjia Chen, Hongshan Yu, Zhengeng Yang, Ming Li, Zechuan Li, Jingwen Wang, Wei Miao, Wei Sun, and Chen Chen. First place solution to the cvpr’2023 aqtc challenge: A function-interaction centric approach with spatiotemporal visual-language alignment. arXiv preprint arXiv:2306.13380, 2023b.
Chen et al. [2023c] Yifei Chen, Dapeng Chen, Ruijin Liu, Hao Li, and Wei Peng. Video action recognition with attentive semantic units. In ICCV, 2023c.
Courty et al. [2017] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. NeurIPS, 2017.
Cuturi [2013] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013.
Damodaran et al. [2018] Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In ECCV, 2018.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
Fan et al. [2021] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In ICCV, 2021.
Feng et al. [2023] Zhili Feng, Anna Bair, and J Zico Kolter. Leveraging multiple descriptive features for robust few-shot image learning. arXiv preprint arXiv:2307.04317, 2023.
Goyal et al. [2017] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The “something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. NeurIPS, 2017.
Hendricks and Nematzadeh [2021] Lisa Anne Hendricks and Aida Nematzadeh. Probing image-language transformers for verb understanding. arXiv preprint arXiv:2106.09141, 2021.
Hou et al. [2017] Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional neural network (t-cnn) for action detection in videos. In ICCV, 2017.
Ilharco et al. [2022] Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt. Patching open-vocabulary models by interpolating weights. NeurIPS, 2022.
Jia et al. [2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
Ju et al. [2022] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In ECCV, 2022.
Kantorovich [2006] Leonid V Kantorovich. On the translocation of masses. Journal of mathematical sciences, 133(4):1381–1382, 2006.
Kaul et al. [2023] Prannay Kaul, Weidi Xie, and Andrew Zisserman. Multi-modal classifiers for open-vocabulary object detection. arXiv preprint arXiv:2306.05493, 2023.
Khamis et al. [2023] Abdelwahed Khamis, Russell Tsuchida, Mohamed Tarek, Vivien Rolland, and Lars Petersson. Earth movers in the big data era: A review of optimal transport in machine learning. arXiv preprint arXiv:2305.05080, 2023.
Kim et al. [2023] Kwanyoung Kim, Yujin Oh, and Jong Chul Ye. Zegot: Zero-shot segmentation through optimal transport of text prompts. arXiv preprint arXiv:2301.12171, 2023.
Kuehne et al. [2011] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
Kullback and Leibler [1951] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
Lin et al. [2019] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In ICCV, 2019.
Lin et al. [2023] Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, and Horst Bischof. Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge. arXiv preprint arXiv:2303.08914, 2023.
Lin et al. [2022] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In ECCV, 2022.
Liu et al. [2021] Benlin Liu, Yongming Rao, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Multi-proxy wasserstein classifier for image classification. In AAAI, 2021.
Liu et al. [2022] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In CVPR, 2022.
Menon and Vondrick [2023] Sachit Menon and Carl Vondrick. Visual classification via description from large language models. ICLR, 2023.
Momeni et al. [2023] Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, and Cordelia Schmid. Verbs in action: Improving verb understanding in video-language models. In ICCV, 2023.
Monge [1781] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pages 666–704, 1781.
Ni et al. [2022] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
Petric Maretic et al. [2019] Hermina Petric Maretic, Mireille El Gheche, Giovanni Chierchia, and Pascal Frossard. Got: an optimal transport framework for graph comparison. NeurIPS, 2019.
Pratt et al. [2023] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. In CVPR, 2023.
Qian et al. [2022] Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G Hauptmann. Rethinking zero-shot action recognition: Learning from latent atomic actions. In ECCV, 2022.
Qing et al. [2023] Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In ICCV, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
Rasheed et al. [2023] Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. In CVPR, 2023.
Salimans et al. [2018] Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving gans using optimal transport. arXiv preprint arXiv:1803.05573, 2018.
Sinkhorn [1967] Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Strafforello et al. [2023] Ombretta Strafforello, Xin Liu, Klamer Schutte, and Jan van Gemert. Video bagnet: short temporal receptive fields increase robustness in long-term action recognition. In ICCV, 2023.
Tran et al. [2018] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
Tu et al. [2023] Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, and Yu-Gang Jiang. Implicit temporal modeling with learnable alignment for video recognition. arXiv preprint arXiv:2304.10465, 2023.
Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
Wang et al. [2016] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, 2016.
Wang et al. [2021] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
Wang et al. [2023] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
Wasim et al. [2023] Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In CVPR, 2023.
Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In CVPR, 2022.
Wu et al. [2023a] Wenhao Wu, Zhun Sun, and Wanli Ouyang. Revisiting classifier: Transferring vision-language models for video recognition. In AAAI, 2023a.
Wu et al. [2023b] Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, and Wanli Ouyang. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In CVPR, 2023b.
Xue et al. [2022] Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430, 2022.
Yang et al. [2021] Taojiannan Yang, Sijie Zhu, Matias Mendieta, Pu Wang, Ravikumar Balakrishnan, Minwoo Lee, Tao Han, Mubarak Shah, and Chen Chen. Mutualnet: Adaptive convnet via mutual learning from different model configurations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Yang et al. [2023] Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for efficient video action recognition. ICLR, 2023.
Yuan et al. [2021] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
Zhang et al. [2020] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR, 2020.
Zhang et al. [2023a] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
Zhang et al. [2017] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot learning. In CVPR, 2017.
Zhang et al. [2023b] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.
Zhou et al. [2023] Jiaming Zhou, Kun-Yu Lin, Yu-Kun Qiu, and Wei-Shi Zheng. Twinformer: Fine-to-coarse temporal modeling for long-term action recognition. IEEE Transactions on Multimedia, 2023.

\thetitle

Supplementary Material

6 Overview of Supplementary Material

In the supplementary material, we provide additional details in the following sections:

•

Section 7: Further Analysis and Experiments
•

Section 8: Details of Optimal Descriptor Solver
•

Section 9: Dataset and Implementation Details
•

Section 10: Demonstration of Prompts and Descriptors
•

Section 11: Broader Impact and Limitation

7 Further Analysis and Experiments

7.1 Visualizations of Adaptive Transport Plan

We analyze the adaptive transport plan in our proposed OD Solver. Qualitative visualizations of the transport plan are illustrated in Fig. 6 and Fig. 7, with detailed explanations provided in the captions. We find that our proposed OD Solver can adaptly assign each descriptor to the video instance.

7.2 Visualizations of Attention Maps

We provide additional visualizations of the attention maps of our proposed OST in Fig. 8.

7.3 The Robustness of OST

We present case studies to illustrate the robustness of our proposed OST, specifically focusing on the transport plan depicted in Fig. 11 for scenarios where certain action steps are missing, and the attention maps in Fig. 12 where our OST effectively resolves category mismatches. Detailed analysis is provided within the captions of these figures.

7.4 Variant of Global Similarity

Besides the global similarity score computation illustrated in Eq. 9 in the main paper, an alternative global similarity score can be computed by initially determining the similarity between video representations and descriptor-level embeddings separately, and subsequently averaging these scores to derive the overall global video-descriptor similarity score. Although this approach may appear mathematically analogous to Eq. 9, the modified gradient flow during the training process could yield divergent outcomes. As demonstrated in Table 6, this implementation still exhibits sub-optimal performance in comparison to OST, thereby underscoring the superiority of our proposed method.

Table 6: Study on variants of global similarity score

Method	HMDB-51	UCF-101	K600
Variant 1	53.3	76.6	69.3
Variant 2	52.0	76.4	69.3
OST	54.5	77.9	72.3

8 Details of Optimal Descriptor Solver

8.1 Theoretical Analysis

In this section, we will provide the theoretical analysis of the existence and unicity of the optimal transport plan $\boldsymbol{P}^{\ast}$ in our proposed OD Solver.

As discussed in Eq. 2 in the main paper, after obtaining a set of frame-level features $\boldsymbol{V}\in\mathbb{R}^{T\times{d}}$ and descriptor-level embedding for each class $\boldsymbol{D_{k}^{s}}\in\mathbb{R}^{N_{s}\times{d}}$ , $\boldsymbol{D_{k}^{t}}\in\mathbb{R}^{N_{t}\times{d}}$ . The cost matrix for each class can be defined as:

\boldsymbol{C_{k}^{s}}=1-cos(\boldsymbol{V},\boldsymbol{D_{k}^{s}}),\quad% \boldsymbol{C_{k}^{t}}=1-cos(\boldsymbol{V},\boldsymbol{D_{k}^{t}}).

(16)

We can define the OT problem in Kantorovich formulation as:

		$\displaystyle\boldsymbol{P}^{\ast}=\underset{\boldsymbol{P}\in\mathbb{R}^{T% \times N}}{\arg{\min}}\sum^{T}_{i=1}\sum^{N}_{j=1}\boldsymbol{P}_{ij}% \boldsymbol{C}_{ij}$		(17)
		$\displaystyle\textrm{s.t.}\quad\boldsymbol{P}\boldsymbol{e}=\boldsymbol{\mu},% \quad\boldsymbol{P}^{\top}\boldsymbol{e}=\boldsymbol{\nu}.$		(17)

However, solving the problem in Eq. 17 costs $O(n^{3}logn)$ -complexity, which is time-consuming. By adopting Sinkhorn [15] algorithm, we can define the entropy-regularized OT problem as:

		$\displaystyle\boldsymbol{P}^{\ast}=\underset{\boldsymbol{P}\in\mathbb{R}^{T% \times N}}{\arg{\min}}\sum^{T}_{i=1}\sum^{N}_{j=1}\boldsymbol{P}_{ij}% \boldsymbol{C}_{ij}-\lambda\boldsymbol{H}(\boldsymbol{P})$		(18)
		$\displaystyle\textrm{s.t.}\quad\boldsymbol{P}\boldsymbol{e}=\boldsymbol{\mu},% \quad\boldsymbol{P}^{\top}\boldsymbol{e}=\boldsymbol{\nu}.$		(18)

Adding an entropy regularization to the original OT problem makes the optimal regularized transport plan more straightforward. This allows us to calculate the optimal transport distance via Matrix Scaling Algorithms [49].
Lemma 1. For $\lambda>0$ , the optimal transport plan $\boldsymbol{P}^{\ast}$ is unique and has the form $\boldsymbol{P}^{\ast}=diag(\boldsymbol{a})\boldsymbol{K}diag(\boldsymbol{b})$ , where $\boldsymbol{a}$ and $\boldsymbol{b}$ are two probability vectors of $\mathbb{R}^{d}$ uniquely defined up to a multiplicative factor and $\boldsymbol{K}=exp(-\boldsymbol{C}/\lambda)$ .

Proof. The existence and unicity of $\boldsymbol{P}^{\ast}$ follows from the boundedness of $\boldsymbol{\mu},\boldsymbol{\nu}$ and the strict convexity of minus the entropy. Consider $\mathcal{L}(P,\alpha,\beta)$ as the Lagrangian of Eq. 18, where $\alpha,\beta$ serve as the dual variables corresponding to the equality constraints in $\boldsymbol{\mu},\boldsymbol{\nu}$ :

\displaystyle\begin{split}\mathcal{L}(\boldsymbol{P},\alpha,\beta)&=\sum_{ij}% \left(\frac{1}{\lambda}p_{ij}\log p_{ij}+p_{ij}m_{ij}\right)\\ &\quad+\alpha^{\top}(\boldsymbol{P}\boldsymbol{e}-\boldsymbol{\mu})+\beta^{% \top}(\boldsymbol{P}^{\top}\boldsymbol{e}-\boldsymbol{\nu}).\end{split}

(19)

For any couple $(i,j)$ , if $\left({\partial\mathcal{L}}/{\partial p_{ij}}=0\right)$ , then it follows that $p_{ij}=e^{-{1}/{2}-\lambda_{\alpha_{i}}}e^{-\lambda m_{ij}}e^{-{1}/{2}-\lambda% _{\beta_{j}}}$ . Given that all entries in matrix $\boldsymbol{K}$ are strictly positive, we know from Sinkhorn’s work [49] that there is a one-of-a-kind matrix in the form of $diag(\boldsymbol{a})\boldsymbol{K}diag(\boldsymbol{b})$ which fits the constraints given by $\boldsymbol{\mu},\boldsymbol{\nu}$ . Therefore, this matrix is necessarily $\boldsymbol{P}^{\ast}$ , and we can calculate it using the Sinkhorn fixed point iteration:

\boldsymbol{a}\leftarrow\boldsymbol{\mu}/\boldsymbol{K}\boldsymbol{b},\quad% \boldsymbol{b}\leftarrow\boldsymbol{\nu}/\boldsymbol{K}^{\top}\boldsymbol{a}.

(20)

∎

8.2 Pseudo-Code on OD Solver

As explained in the paper, our OD Solver is effective and simple to implement. In Algorithm 1, we show the PyTorch style pseudo-code on the implementation of our proposed Optimal Descriptor Solver.

Algorithm 1 PyTorch style pseudo-code on Optimal Descriptor Solver

⬇

1def OptimalDescriptorSolver(video_emb, descriptor_emb):

2 A, N, D = descriptor_emb.shape # Get the shape of descriptor embeddings

3 B, T, D = video_emb.shape # Get the shape of video embeddings

4 sim = torch.einsum(’b t d, a n d->t n b a’, video_emb, descriptor_emb) # Compute the similarity

5 sim = rearrange(sim, ’t n b a->(b a)t n’) # Rearrange dimensions

6 cost_mat = 1 - sim # Calculate the cost matrix

7 pp_x = torch.zeros(B*A, T).fill_(1. / T) # Initialize the horizontal probability vector

8 pp_y = torch.zeros(B*A, N).fill_(1. / N) # Initialize the vertical probability vector

9 with torch.no_grad():

10 KK = torch.exp( - cost_mat / eps) # Calculate the cost matrix with exponentiation

11 P = Sinkhorn(KK, pp_x, pp_y) # Apply Sinkhorn algorithm to obtain the optimal transport plan P

13 # Using optimal transport plan P to obtain logits

14 score_ot = torch.sum(P * sim, dim=(1, 2)) # Frobenius inner product

15 logits = score_ot.view(B, A) # Classification logits

16 return logits

18def Sinkhorn(K, u, v):

19 r = torch.ones_like(u) # Initialize r as a tensor of ones with the same shape as u

20 c = torch.ones_like(v) # Initialize c as a tensor of ones with the same shape as v

21 thresh = 1e-2 # Threshold to determine convergence in Sinkhorn iterations

22 max_iter = 100 # Maximum number of Sinkhorn iterations

23 # Sinkhorn iteration

24 for i in range(max_iter): # Iterate up to the maximum number of iterations

25 r0 = r # Save the previous iteration’s r

26 r = u / torch.matmul(K, c.unsqueeze(-1)).squeeze(-1) # Update r

27 c = v / torch.matmul(K.permute(0, 2, 1), r.unsqueeze(-1)).squeeze(-1) # Update c

28 err = (r - r0).abs().mean() # Calculate the mean absolute change in iterations

29 if err.item() < thresh: # If the change is below the threshold, stop iterating

30 break

31 P = torch.matmul(r.unsqueeze(-1), c.unsqueeze(-2)) * K # Obtain the final transport plan P

32 return P

9 Implementation Details

9.1 Dataset Details

We provide 6 video benchmarks used in our empirical studies:

Kinetic-400 [7] is a large-scale video dataset consisting of 10-second video clips collected from YouTube. 240,000 training videos and 20,000 validation videos in 400 different action categories.

Kinetic-600 [8] is an extension of Kinetics-400, consisting of approximarely 480,000 videos from 600 action categories. The videos are divided into 390,000 for training, 30,000 for validation, and 60,000 for testing. We mainly use its validation set for zero-shot evaluation.

UCF-101 [50] is a video recognition dataset for realistic actions, collected from YouTube, including 13,320 video clips with 101 action categories in total. There are three splits of the training and testing data.

HMDB-51 [31] is a relatively small video dataset compared to Kinetics and UCF-101. It has around 7,000 videos with 51 classes. HMDB-51 has three splits of the training and testing data.

Something-Something V2 [20] is a challenging temporal-heavy dataset which contains 220,000 video clips across 174 fine-grained classes.

ActivityNet [6] We use the ActivityNet-v1.3 in our experiments. ActivityNet is a large-scale untrimmed video benchmark, containing 19,994 untrimmed videos of 5 to 10 minutes from 200 activity categories.

9.2 Implementation Details

Zero-shot Experiments. We mainly follow the zero-shot setting in [47, 41]. We tune both the visual and textual encoder of a CLIP ViT-B/16 with 32 frames on Kinetics-400 for 10 epochs. The batch size is set as 256 and single-view inference is adopted during validation. We set the hyperparameters in the Sinkhorn algorithm [15] as $\lambda=0.1$ . We adopt the AdamW optimizer paired with a $8\times 10^{-6}$ initial learning rate with the CosineAnnealing learning rate schedule. Following [34, 24, 59], we perform a linear weight-space ensembling between the original CLIP model and the finetuned model with a ratio of $0.2$ .

We apply the following evaluation protocols in our zero-shot experiments: For UCF-101 and HMDB-51, the prediction is conducted on three official splits of the test data. We report average Top-1 accuracy and standard deviation. For Kinetics-600, following [11], the 220 new categories outside Kinetics-400 are used for evaluation. We use the three splits provided by [11] and sample 160 categories for evaluation from the 220 categories in Kinetics-600 for each split. We report average Top-1 and Top-5 accuracy and standard deviation.

Few-shot Experiments. For the few-shot setting, we utilize CLIP ViT-B/16 as We adopt the few-shot split from [47, 41] that randomly samples 2, 4, 8, and 16 videos from each class on UCF-101, HMDB-51, and Something-Something V2 for constructing the training set. For evaluation, we use the first split of the test set on UCF-101, HMDB-51, and Something-Something V2. We utilize 32 frames during training and validation. Top-1 accuracy with single-view inference is reported. We set the batch size as 64 and train for 50 epochs in few-shot experiments.

Fully-supervised Experiments. For fully-supervised studies, we base our approach on Text4Vis [60] to conduct experiments in frozen text settings and keep the hyperparameters and data augmentations consistent with the baseline. We vary CLIP ViT-B/32, and ViT-B/16 as encoder and train with 8, and 16 frames, respectively. We report Top-1 accuracy using single-view inference.

Data Augmentation Recipe. For a fair comparison, we largely follow the data augmentations in ViFi-CLIP [47] for zero-shot and few-shot experiments and follow the recipe in Text4Vis [60] for fully-supervised experiments. The details for our data augmentation recipe are shown in Table 7.

Table 7: Data augmentation recipe for video recognition.

Augmentation
Setting	Zero/Few-shot	Fully-supervised
RandomFlip	0.5	0.5
Crop	MultiScaleCrop	RandomSizedCrop
ColorJitter	0.8	0
GrayScale	0.2	0.2
Label smoothing	0	0
Mixup	0	0
Cutmix	0	0

Training and Testing. We employ the identical alignment mechanism throughout both the training and testing phases. The only difference lies in the application of contrastive-style operations during training, where logits are obtained exclusively from descriptors within the current mini-batch. During testing, classification scores are calculated against descriptors from all classes.

10 Demonstration of Prompts and Descriptors

10.1 Prompting the Language Model

We provide our prompts for generating Spatio-Temporal Descriptors in Fig. 9 and Fig. 10, respectively. We provide details in the figure captions.

10.2 Additional Examples of Spatio-Temporal Descriptors

In this section, we provide additional examples of the Spatio-Temporal Descriptors.
Descriptors for action category “Adjusting Glasses":

Spatio Descriptor:

1.

person wearing glasses
2.

hand adjusting glasses
3.

glasses sliding on face
4.

fingers pushing up glasses

Temporal Descriptor:

1.

Push the glasses up the bridge of your nose
2.

Align the temples with your ears
3.

Adjust the nose pads for comfort
4.

Ensure that the glasses rest comfortably on your face

Descriptors for action category “Assembling Bicycle":

Spatio Descriptor:

1.

Bicycle frame
2.

Handlebars
3.

Wheels
4.

Pedals

Temporal Descriptor:

1.

Attach the front wheel to the bicycle frame using a wrench and follow the specified torque setting.
2.

Secure the handlebars onto the front fork by tightening the stem bolts with an Allen wrench.
3.

Install the pedals onto the crank arms by screwing them in clockwise.
4.

Adjust the seat height to the desired position and tighten the seat clamp to secure it.

Descriptors for action category “Building Sandcastle":

Spatio Descriptor:

1.

beach
2.

sand
3.

castle
4.

bucket

Temporal Descriptor:

1.

Dig a shallow hole in the sand for the base
2.

Fill the hole with wet sand and pack it down firmly
3.

Create a large mound of sand on top of the base
4.

Use your hands or tools to shape the sand into walls and towers

Descriptors for action category “Opening Wine Bottle":

Spatio Descriptor:

1.

wine bottle
2.

corkscrew
3.

uncorking
4.

pouring

Temporal Descriptor:

1.

Hold the wine bottle firmly
2.

Remove the foil or plastic covering from the top of the bottle
3.

Insert the corkscrew into the center of the cork
4.

Twist the corkscrew counterclockwise to remove the cork

Descriptors for action category “Planing Wood":

Spatio Descriptor:

1.

wood
2.

sawdust
3.

saw
4.

workbench

Temporal Descriptor:

1.

Measure and mark the dimensions of the wood piece
2.

Cut the wood according to the marked measurements
3.

Smooth the edges of the cut wood using sandpaper
4.

Apply a coat of varnish or paint to protect and enhance the appearance of the wood

11 Broader Impact and Limitation

OST represents an effective way to utilize external knowledge to adapt pre-trained visual-language models for general video recognition. Our approach can benefit zero-shot, few-shot, and fully-supervised video recognition with no modification to the model architecture and minor additional computational costs. Furthermore, the proposed Spatio-Temporal Descriptor can greatly reduce the semantic similarity of action categories. The employment of LLMs to generate corresponding descriptors can be readily extended to various unseen action categories, allowing the open-vocabulary understanding of actions in the wild.

However, the quality of descriptors directly connects to the final performance. The process of generating descriptors highly depends on the knowledge learned by the LLM, which is only partially controllable by varying the prompts. Additionally, our findings suggest that the informational needs for describing actions differ across various categories. Relying solely on four Spatio-Temporal Descriptors might not be ideal for every category. An adaptive approach, where the number of descriptors is tailored to each category, would likely be more effective.

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Abstract

1 Introduction

2 Related Work

3 Method

3.1 Preliminaries

3.2 Spatio-Temporal Descriptor

3.3 Optimal Descriptor Solver

3.4 Training Objectives

4 Experiments

4.1 Main Results

4.2 Ablation Studies

4.3 Cost Analysis

4.4 Visualizations

5 Conclusion

Acknowledgement

References

6 Overview of Supplementary Material

7 Further Analysis and Experiments

7.1 Visualizations of Adaptive Transport Plan

7.2 Visualizations of Attention Maps

7.3 The Robustness of OST

7.4 Variant of Global Similarity

8 Details of Optimal Descriptor Solver

8.1 Theoretical Analysis

8.2 Pseudo-Code on OD Solver

9 Implementation Details

9.1 Dataset Details

9.2 Implementation Details

10 Demonstration of Prompts and Descriptors

10.1 Prompting the Language Model

10.2 Additional Examples of Spatio-Temporal Descriptors

11 Broader Impact and Limitation

OST: Refining Text Knowledge with Optimal Spatio-Temporal
Descriptor for General Video Recognition