0% found this document useful (0 votes)
50 views10 pages

Modality-Aware Contrastive Instance Learning With

The paper presents a novel approach for weakly-supervised audio-visual violence detection using a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy. It addresses issues of modality asynchrony and undifferentiated instances in the multiple instance learning framework by clustering audio and visual instances into semi-bags and applying contrastive learning. Experimental results demonstrate that the proposed framework outperforms existing methods on the XD-Violence dataset while maintaining lower complexity.

Uploaded by

Trong Nguyen Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views10 pages

Modality-Aware Contrastive Instance Learning With

The paper presents a novel approach for weakly-supervised audio-visual violence detection using a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy. It addresses issues of modality asynchrony and undifferentiated instances in the multiple instance learning framework by clustering audio and visual instances into semi-bags and applying contrastive learning. Experimental results demonstrate that the proposed framework outperforms existing methods on the XD-Violence dataset while maintaining lower complexity.

Uploaded by

Trong Nguyen Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Modality-Aware Contrastive Instance Learning with

Self-Distillation for Weakly-Supervised Audio-Visual Violence


Detection
Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†
1 School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, China
2 Academy for Engineering and Technology, Fudan University, China
3 Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Fudan University, China

{jsyu19,jinyuliu20,chengy18,fengrui,yjzhang}@fudan.edu.cn
arXiv:2207.05500v1 [cs.CV] 12 Jul 2022

ABSTRACT ACM Reference Format:


Weakly-supervised audio-visual violence detection aims to distin- Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,† .
2022. Modality-Aware Contrastive Instance Learning with Self-Distillation
guish snippets containing multimodal violence events with video-
for Weakly-Supervised Audio-Visual Violence Detection. In Proceedings of
level labels. Many prior works perform audio-visual integration and ACM MULTIMEDIA CONFERENCE 2022 (MM’2022). ACM, New York, NY,
interaction in an early or intermediate manner, yet overlooking the USA, 10 pages. https://doi.org/10.1145/3503161.3547868
modality heterogeneousness over the weakly-supervised setting. In
this paper, we analyze the modality asynchrony and undiffer-
entiated instances phenomena of the multiple instance learning
(MIL) procedure, and further investigate its negative impact on
weakly-supervised audio-visual learning. To address these issues,
we propose a modality-aware contrastive instance learning with
self-distillation (MACIL-SD) strategy . Specifically, we leverage a
lightweight two-stream network to generate audio and visual bags, Violent Event

in which unimodal background, violent, and normal instances are


Violent Normal
a) Modality Asynchrony
clustered into semi-bags in an unsupervised way. Then audio and
Snippet
visual violent semi-bag representations are assembled as positive Audio

pairs, and violent semi-bags are combined with background and Visual

normal instances in the opposite modality as contrastive negative Video

pairs. Furthermore, a self-distillation module is applied to transfer


unimodal visual knowledge to the audio-visual model, which alle- Audio

Visual

viates noises and closes the semantic gap between unimodal and
multimodal features. Experiments show that our framework outper- Violent Bag Normal Bag
Violent Normal Background
forms previous methods with lower complexity on the large-scale b) Undifferentiated Instances
XD-Violence dataset. Results also demonstrate that our proposed ap-
proach can be used as plug-in modules to enhance other networks. Figure 1: a) An example of the modality asynchrony. Dur-
Codes are available at https://github.com/JustinYuu/MACIL_SD. ing the violent event abuse, the abuser first hits the victim,
where the violent message is reflected in the visual modal-
ity. Then the scream of the victim occurs, indicating the au-
CCS CONCEPTS ditory violence information. b) The illustration of the un-
• Computing methodologies → Scene anomaly detection. differentiated instances. In each bag, violent cues are dis-
tributed in some instances while others contain background
KEYWORDS noises, and the discrepancy between normal segments and
Multi-Modality, Contrastive Learning, Violence Detection. background noises also exists. We argue that adding addi-
tional constraints could enhance model discrimination.

Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed 1 INTRODUCTION
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM Recent years have witnessed the extension of violence detection
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, from a pure vision task [4, 18, 20, 30, 33, 45, 47, 54, 61, 67, 68] to
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
an audio-visual multimodal problem [43, 44, 62], for which the
MM’2022, October 10–14, 2022, Lisbon, Portugal corresponding auditory content supplements fine-grained violent
© 2022 Association for Computing Machinery.
∗ Equal contribution.
ACM ISBN 978-1-4503-9203-7/22/10. . . $15.00
https://doi.org/10.1145/3503161.3547868 † Corresponding authors.
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†

cues. Despite numerous modality fusion and interaction methods • Equipped with a lightweight network, our framework outper-
have shown promising results, the modality discrepancy of the forms the state-of-the-art methods on the XD-Violence dataset,
multiple instance learning (MIL) [38] framework under the weakly- and our model also shows the generalizability as plug-in modules.
supervised setting remains to be explored.
To alleviate the appetite for fine-labeled data, MIL is widely 2 RELATED WORKS
adopted for the weakly-supervised violence detection, where the 2.1 Weakly-Supervised Violence Detection
output of each video sequence is formed into a bag containing
multiple snippet-level instances. In the audio-visual scenarios, all Weakly-supervised violence detection requires identifying violent
prior works share a general scheme that regards each audio-visual snippets under video-level labels, where the MIL [38] framework is
snippet as an integral instance and averaging the top-K audio- widely used for denoising irrelevant information. Some previous
visual logits as the final video-level scores. However, we analyze works [4, 18, 20, 30, 45, 47, 61, 67, 68] regard violence detection as a
that this formula suffers from two defects: modality asynchrony pure vision task and leverage CNN-based networks to encode visual
and undifferentiated instances. Modality asynchrony indicates features. Among these methods, various feature integration and
the temporal inconsistency between auditory and visual violence amelioration methods are proposed to enhance the robustness of
cues. Taking the typical violent event abuse in Figure 1(a) as an MIL. Tian et al. [54] propose RTFM, a robust temporal feature magni-
example, when the abuser hits the victim, the scream occurs af- tude learning method to refine the capacity of recognizing positive
terward, and the entire procedure is regarded as a violent event. instances. Li et al. [33] design a Transformer [57]-based multi-
In this situation, scenes in part of the visual modality (2nd-3rd sequence learning network to reduce the probability of instance
snippets) and audio modality (4th-5th snippets) contain violent selection errors. However, these models neglect the corresponding
clues. We argue that directly leveraging an audio-visual pair as an auditory information as well as the cross-modality interactions,
instance could introduce data noise to the video-level optimization. thereby restricting the performance of violence prediction.
The other defect we discovered is undifferentiated instances, that Recently, Wu et al. [62] curate a large-scale audio-visual dataset
is, picking the top- K instances for optimization results in numer- XD-Violence and establish an audio-visual benchmark. However,
ous disengaged instances. As shown in Figure 1(b), in a sequence they integrate audio and visual features in an early fusion way,
of violent videos, the violent event can be reflected in some au- thereby limiting further inter-modality interactions. To facilitate
dio/visual instances. In contrast, others contain irrelevant elements multimodal fusion, Pang et al. [43] propose an attention-based net-
such as background noises. On the contrary, in the videos of normal work to adaptively integrate audio and visual features with mutual
events, a few snippets contain elements of normal events, while learning module in an intermediate manner. Different from prior
others include background information. In this case, the K-max methods, we perform inter-modality interactions via a lightweight
activation abandons the instances containing background elements, two-stream network and conduct discriminative multimodal learn-
and the discrepancy between violent and normal instances is not ing via modality-aware contrast and self-distillation.
explicitly revealed. To this end, we argue that adding contrastive
constraints among the violent, normal, and background instances
2.2 Contrastive Learning
could contribute to the discrimination toward violent content. Contrastive learning is formulated by contrasting positive pairs
Driven by preliminary analysis, we propose a simple yet effec- against negative pairs without data supervisory. In the unimodal
tive framework constructed by modality-aware contrastive instance field, several visual methods [10, 23, 25, 35] leverage the augmenta-
learning (MA-CIL) and self-distillation (SD) module. To address the tion of visual data as a contrast to increase model discrimination.
modality asynchrony, we form the unimodal bags apart from the Furthermore, some natural language processing methods utilize the
original audio-visual bags, compute unimodal logits, and cluster em- token- and sentence-level contrasts to enhance the performance of
beddings of top-K and bottom-K unimodal instances as semi-bags. pre-trained models [15, 50] and supervised tasks [17, 46]. For the
To differentiate instances, we propose a modality-aware contrastive- multimodal fields, some works introduce modality-aware contrasts
based method. In detail, the audio and visual violent semi-bags are to vision-language tasks, such as image captioning [16, 58], visual
constructed as the positive pairs, while the violent semi-bags are question answering [9, 60], and representation learning [34, 49, 59,
assembled with embeddings of instances in the background and 66]. Moreover, recent literature [1, 2, 14, 32, 37, 39, 40, 42] utilizes
normal semi-bags as negative pairs. Furthermore, a self-distillation the temporal consistency of audio-visual streams as contrastive
module is applied to distill unimodal knowledge to the audio-visual pretext tasks to learn robust audio-visual representations. Based
model, which closes the semantic gap between modalities and alle- on existing instance-level contrastive frameworks [12, 63], we put
viates the data noise introduced by the abundant cross-modality forward the concept of semi-bags and leverage the cross-modality
interactions. In summary, our contributions are as follows: contrast to obtain model discrimination.

• We analyze the modality asynchrony and undifferentiated in- 2.3 Cross-Modality Knowledge Distillation
stances phenomena of the widely-used MIL framework in audio- Knowledge distillation is first proposed to transfer knowledge from
visual scenarios, further elaborating their disadvantages for the large-scale architectures to lightweight models [5, 28]. However, the
weakly-supervised audio-visual violence detection. cross-modality distillation aims to transfer unimodal knowledge
• We propose a modality-aware contrastive instance learning with to multimodal models for alleviating the semantic gap between
self-distillation framework to introduce feature discrimination modalities. Several methods [21, 29] distill depth features to the
and alleviate modality noise. RGB representations via hallucination networks to address the
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection MM’2022, October 10–14, 2022, Lisbon, Portugal

Modality-Aware Contrastive Instance Learning


> 0.5 0.5
instance ( )
Visual Stream

FC Self-Attention FC Violent or Normal? bottom-k bottom-k

...
...
background background

top-k top-k
violent normal
Vout
Visual
Self-Distillation

...
Encoder
Vout
Video-level Label
semi-bag ( )
Video
V instance ( )

bottom-k bottom-k
FC

...
Audio-Visual Stream

background background

top-k top-k
violent normal
Cross-Modality FC

...
Violent or Normal?
Attention

Audio
FC

...
...

Encoder semi-bag ( )
Modality-Aware
Audio Contrastive Instance
Aout Learning Positive Pair: < violent, violent>
Negative Pairs: < violent, normal> < violent, background>
A Aout Addition Sigmoid Negative Pairs: < violent, normal> < violent, background>

Figure 2: An illustration of our proposed Modality-Aware Contrastive Instance Learning with Self-Distillation framework.
Our approach consists of three parts: the lightweight two-stream network, modality-aware contrastive learning (MA-CIL),
and self-distillation (SD) module. Taking audio and visual features extracted from pretrained networks as inputs, we design a
simple yet effective attention-based network to perform audio-visual interaction. Then a modality-aware contrasting-based
method is used to cluster instances of different types into several semi-bags and further obtain model discrimination. Finally,
a self-distillation module is deployed to transfer visual knowledge to our audio-visual network, aiming to alleviate modality
noise and close the semantic gap between unimodal and multimodal features. The entire framework is trained jointly in a
weakly supervised manner, and we adopt the multiple instance learning (MIL) strategy for optimization.

modality missing and noisy phenomena. Chen et al. [13] propose network to generate unimodal logits 𝑙𝑎 , 𝑙 𝑣 , and audio-visual logits
an audio-visual distillation strategy, which learns the compositional 𝑙𝑎𝑣 . The embeddings of audio and visual instances are symbolized
embedding and transfers knowledge across semantic-uncorrelated as ℎ𝑎 and ℎ 𝑣 . Then we average 𝐾 maximum logits and use the
modalities. Recently, Multimodal Knowledge Expansion [65] is pro- sigmoid activation to generate the video-level prediction 𝑝. Due
posed as a two-stage distillation strategy, which transfers knowl- to the additional constraint of our proposed contrastive learning
edge from unimodal teacher networks to the multimodal student method, we define the unimodal bags B𝑎 , B𝑣 . In each unimodal bag,
network by generating pseudo labels. Inspired by the methodology instances are clustered into several semi-bags 𝐵𝑚 , 𝑚 ∈ {𝑎, 𝑣 } based
of self-distillation [6, 8, 11, 19, 52, 64], we propose the parameter on their intrinsic characteristics, and the corresponding semi-bag
integration paradigm to transfer visual knowledge to our audio- representations are noted as B𝑚 , 𝑚 ∈ {𝑎, 𝑣 }.
visual model via two similar lightweight networks, which reduces
the modality noise and benefits robust audio-visual representation. 4 METHODOLOGY
Our proposed framework consists of three parts, a lightweight
3 PRELIMINARIES two-stream network, modality-aware contrastive instance learning
Given an audio-visual video sequence 𝑆 = (𝑆 𝐴 , 𝑆𝑉 ), where 𝑆 𝐴 is (MA-CIL), and the self-distillation (SD) module. An illustration of
the audio channel, and 𝑆𝑉 denotes the visual channel, the entire our framework shown in Figure 2 is detailed as follows.
sequence is divided into 𝑇 non-overlapping segments {𝑠𝑡𝐴 , 𝑠𝑡𝑉 }𝑡𝑁=1 .
For an audio-visual pair (𝑠𝑡𝐴 , 𝑠𝑡𝑉 ), weakly-supervised violence detec- 4.1 Two-Stream Network
tion task requires to distinguish whether it contains violent events Considering prior methods suffer from the parameter redundancy
via an event relevance label 𝑦𝑡 ∈ {0, 1}, where 𝑦𝑡 = 1 means at of the large-scale networks, we design an encoder-agnostic light-
least one modality in the current segment includes violent cues. weight architecture to achieve feature aggregation and modality in-
In the training phase, only video-level labels 𝑦 are available for teraction. Taking the visual and auditory feature 𝑓𝑣 , 𝑓𝑎 extracted by
optimization. Hence, a general scheme is to utilize the multiple pre-trained networks (e.g., I3D and VGGish for visual and audio fea-
instance learning (MIL) procedure to satisfy the weak supervision. tures, respectively) as input, our proposed network consists of three
In the MIL framework, each video sequence 𝑆 is regarded as a parts, linear layers to keep the dimension of input features identical,
bag, and video segments {𝑠𝑡𝐴 , 𝑠𝑡𝑉 }𝑡𝑁=1 are taken as instances. Then cross-modality attention layer to perform inter-modality interac-
instances are aggregated via a specific feature-level/score-level tions, and MIL module for the weakly-supervised training. Among
pooling method to generate video-level predictions 𝑝. In this paper, these modules, the cross-modality attention layer is ameliorated
we utilize the K-max activation with average pooling rather than from the encoder part of Transformer [57], which includes the multi-
attention-based methods [41, 53] and global pooling [51, 67] as the head self-attention, feed-forward layer, residual connection [26],
aggregation function. To be specific, given the audio and visual and layer normalization [3]. In the raw self-attention block, features
feature 𝑓𝑎 , 𝑓𝑣 extracted by CNN networks, we use a multimodal are projected by three different parameter matrices as query, key,
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†

and value vectors, respectively. Then the scale dot-product atten- However, we argue that audio and visual violent instances with
𝑞𝑘𝑇
tion score is computed by 𝑎𝑡𝑡 (𝑞, 𝑘, 𝑣) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √ )𝑣, where diverse positions could be semantically mismatched, such as ex-
𝑑𝑚 pressing the beginning and ending of a violent event, respectively.
𝑞, 𝑘, 𝑣 denotes the query, key, and value vectors, 𝑑𝑚 is the dimen-
Therefore, it is unnatural to assume that they share the same im-
sion of query vectors, 𝑇 denotes the matrix transpose operation. To
plication. In contrast, we conduct average pooling to embeddings
enforce cross-modality interactions, we change the key and value
of all violence instances in each bag and form a semi-bag-level
vectors of the self-attention block to features in other modalities:
representation B𝑚 𝑣𝑖𝑜 , 𝑚 ∈ {𝑎, 𝑣 }. By doing so, the audio and visual
ℎ𝑎 = 𝑎𝑡𝑡 ( 𝑓𝑎𝑊𝑄 , 𝑓𝑣𝑊𝐾 , 𝑓𝑣𝑊𝑉 ), (1) representation both express event-level semantics, thereby alleviat-
ℎ 𝑣 = 𝑎𝑡𝑡 (𝑓𝑣𝑊𝑄 , 𝑓𝑎𝑊𝐾 , 𝑓𝑎𝑊𝑉 ), (2) ing the noise issue. To this end, we construct semi-bag-level positive
pairs, which are assembled by audio and visual violent semi-bag
where ℎ𝑎 , ℎ 𝑣 are updated audio and visual features,𝑊𝑄 ,𝑊𝐾 , and𝑊𝑉
representations B𝑎𝑣𝑖𝑜 , B𝑣𝑣𝑖𝑜 . We also construct semi-bag-to-instance
are learnable parameters. We adopt the sharing parameter strategy
negative pairs to maintain numerous contrastive samples, where vi-
for feature projection to reduce computation.
olent semi-bag representations are combined with background and
We adopt the MIL procedure under the weakly-supervised set-
ting to obtain video-level scores. Unlike prior works, we process normal instance embeddings ℎ𝑚 𝑛𝑜𝑟 , ℎ𝑏𝑔𝑑 , 𝑚 ∈ {𝑎, 𝑣 } in the opposite
𝑚
unimodal features individually to alleviate modality asynchrony. modality as negative pairs.
To be specific, fully-connected layers are used in each modality We use the InfoNCE [55] as the training objective of this part,
to generate unimodal logits. Then we take the summation of uni- which closes the distance between positive pairs and enlarges the
modal logits as the fused audio-visual logits while reserving the distance between negatives. The objective for audio violent semi-
unimodal logits for the following contrastive learning. Finally, the bag representation 𝐵𝑎𝑣𝑖𝑜 (𝑖) against visual normal instance embed-
top-K audio-visual logits are average-pooled and put into a sig- dings {ℎ𝑛𝑜𝑟
𝑣 (𝑛)}𝑛=1 is formulated as:
𝐾𝑛𝑜𝑟

moid activation to generate video-level scores for optimization. 𝑣 2𝑛 𝑣𝑖𝑜


L𝑐𝑡 (𝐵𝑎 (𝑖)) =
The entire procedure is formulated as:
𝑒 𝜙 (𝐵𝑎 (𝑖),𝐵 𝑣 (𝑖))/𝜏) (5)
𝑣𝑖𝑜 𝑣𝑖𝑜
𝑙𝑎 , 𝑙 𝑣 = 𝑊𝑎 𝑓𝑎𝑜𝑢𝑡 + 𝑏𝑎 ,𝑊𝑣 𝑓𝑣𝑜𝑢𝑡 + 𝑏 𝑣 (3) − 𝑙𝑜𝑔 ,
𝑒 𝜙 (𝐵𝑎 (𝑖),𝐵 𝑣 (𝑖))/𝜏) + 𝑛= 𝐾𝑛𝑜𝑟 𝜙 (𝐵𝑎𝑣𝑖𝑜 (𝑖),ℎ𝑛𝑜𝑟 (𝑛)/𝜏)
𝑣𝑖𝑜 𝑣𝑖𝑜 Í
𝑝 = Θ(Ω (𝜎 (𝑙𝑎 ⊕ 𝑙 𝑣 ))) (4) 1 𝑒
𝑣

where 𝑊𝑎 ,𝑊𝑣 , 𝑏𝑎 , 𝑏 𝑣 are learnable parameters, Ω is the K-max acti- where 𝜙 denotes cosine similarity function, 𝜏 is the temperature
vation, 𝜎 denotes the sigmoid function, ⊕ is the summation opera- hyperparameter, 𝐾𝑛𝑜𝑟 denotes the normal instances number in the
tion, Θ denotes average pooling, and 𝑝 is video-level prediction. whole mini-batch. Similarly, the objective for audio violent semi-
bag representation 𝐵𝑎𝑣𝑖𝑜 (𝑖) against visual background instances
4.2 MA-CIL embeddings {ℎ 𝑣
𝑏𝑔𝑑 𝐾
1 is formulated as:
(𝑛)}𝑛=𝑏𝑔𝑑
To utilize more disengaged instances, we propose the MA-CIL mod- 𝑣 2𝑏 𝑣𝑖𝑜
L𝑐𝑡 (𝐵𝑎 (𝑖)) =
ule, which is shown on the right side of Figure 2. Given the embed-
dings ℎ𝑎 , ℎ 𝑣 , we perform unsupervised clustering to divide them 𝑒 𝜙 (𝐵𝑎 (𝑖),𝐵 𝑣 (𝑖))/𝜏) (6)
𝑣𝑖𝑜 𝑣𝑖𝑜

into violent, normal, and background semi-bag representations − 𝑙𝑜𝑔 Í𝐾 𝑏𝑔𝑑


,
𝑒 𝜙 (𝐵𝑎 (𝑖),𝐵 𝑣 (𝑖))/𝜏) + 𝑛=𝑏𝑔𝑑 𝜙 (𝐵𝑎𝑣𝑖𝑜 (𝑖),ℎ 𝑣 (𝑛)/𝜏)
𝑣𝑖𝑜 𝑣𝑖𝑜
based on the visual and audio logits. We argue that the discrepancy 1 𝑒
between semantic-irrelevant instances can be exploited to enrich where 𝐾𝑏𝑔𝑑 denotes the background instance number in the whole
model’s capacity for discrimination. mini-batch. The visual-against-audio counterparts are highly simi-
To be specific, we first leverage the video-level probabilities 𝑝 lar, thus we omit these for concise writing.
to distinguish whether the given video contains violent events. In
each mini-batch, for the video sequence 𝑆𝑖 that 𝑝𝑖 > 0.5, top-K 4.3 Self-Distillation
instances with highest logits are clustered as the violence semi- The audio-visual interactions provided by the former parts could
bag 𝐵𝑚 𝑣𝑖𝑜 (𝑖) = {ℎ (𝑛)}𝐾𝑣𝑖𝑜 , 𝑚 ∈ {𝑎, 𝑣 }. For the sequence 𝑆 that
𝑚 𝑛=1 𝑗 introduce abundant modality noises, and modality asynchrony also
𝑝 𝑗 ≤ 0.5, top-K instances are selected as the normal semi-bag results in the semantic mismatch of multimodal and unimodal
𝐵𝑚𝑛𝑜𝑟 ( 𝑗) = {ℎ (𝑛)}𝐾𝑛𝑜𝑟 , 𝑚 ∈ {𝑎, 𝑣 }. We hope adding contrast to
𝑚 features in the same temporal position. To address these issues, we
𝑛=1
the normal and violent events could help the model distinguish the argue that training a similar visual network simultaneously enables
violent extent of percepted signals. the model to ensemble unimodal and multimodal knowledge. With a
Moreover, we argue that both normal and violent videos contain controllable co-distillation strategy, our proposed module warrants
background snippets, and learning the difference between event- modality noise reduction and robust modality-agnostic knowledge.
related segments and background noises could benefit the localiza- Specifically, we propose an analogous unimodal network that
tion. Therefore, we select the bottom-K instances of the whole mini- contains comparable architecture with our two-stream network.
batch as the background semi-bag 𝐵𝑚 = {ℎ𝑚 (𝑛)}𝑛=𝑏𝑔𝑑
𝑏𝑔𝑑 𝐾 The cross-modality attention block is substituted by the standard
1 , 𝑚 ∈ {𝑎, 𝑣 }.
In each mini-batch, the model should contrast violent audio-visual transformer encoder block including self-attention. During training,
instances against negative pairs constructed by violent instances the unimodal network is trained with a relatively small learning
and other instances (background and normal). rate, and parameters of the same layers are infused into the audio-
An intuitive way is to randomly pick intra- and inter-semi-bag visual network with an exponential moving average strategy:
instances in the opposite modality as positive and negative pairs. 𝜃 𝑎𝑣 ← 𝑚𝜃 𝑎𝑣 + (1 − 𝑚)𝜃 𝑣 (7)
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection MM’2022, October 10–14, 2022, Lisbon, Portugal

where 𝜃 𝑎𝑣 and 𝜃 𝑣 denotes parameters of the audio-visual model and Table 1: Comparison of the frame-level AP performance
visual model, respectively, 𝑚 denotes the control hyperparameter with unsupervised and weakly-supervised baselines. † de-
following a cosine scheduler that increases from the original value notes results re-implemented by integrating logits of two
𝑚^ to 1 during training. identical networks with audio and visual inputs, and * in-
dicates re-implemented by fusing audio and visual features
4.4 Learning Objective as inputs.
The entire framework is optimized in a joint-training manner. For
the video-level prediction 𝑝, we leverage binary cross-entropy L B Manner Method Modality AP (%) Param.
as the training objective and use a linearly growing strategy to SVM baseline V 50.78 /
control the weight of contrastive loss. The total objective is: Unsup. OCSVM [48] V 27.25 /
𝜆𝑣 2𝑛 (𝑡) ∑︁ 𝑣 2𝑛 𝑣𝑖𝑜 Hasan et al. [24] V 30.77 /
𝑣 2𝑛 𝑣𝑖𝑜
L𝑎𝑣 = (L𝑐𝑡 (𝐵𝑎 (𝑖)) + L𝑐𝑡 (𝐵 𝑣 (𝑖)))+
𝐾𝑣𝑖𝑜 Sultani et al. [51] V 73.20 /
(8)
𝑖
𝜆𝑣 2𝑏 (𝑡) ∑︁ 𝑣 2𝑏 𝑣𝑖𝑜 Wu et al. [61] V 75.90 /
𝑣 2𝑏 𝑣𝑖𝑜
(L𝑐𝑡 (𝐵𝑎 (𝑖)) + (L𝑐𝑡 (𝐵 𝑣 (𝑖))) + L𝐵 RTFM [54] V 77.81 12.067M
𝐾𝑣𝑖𝑜 𝑖 RTFM* [54] A+V 78.10 13.510M
RTFM† [54] A+V 78.54 13.190M
𝜆 (𝑡 ) = 𝑚𝑖𝑛 (𝑟 ∗ 𝑡, Λ) (9)
W. Sup. Li et al. [33] V 78.28 /
where 𝐾𝑣𝑖𝑜 denotes the number of violence semi-bags in the whole Wu et al. [62] A+V 78.64 0.843M
mini-batch, 𝜆(𝑡) is a controller to increase weight within a few Wu et al.† [62] A+V 78.66 1.539M
epochs linearly, 𝑟 denotes the growing ratio, 𝑡 is the current epoch, Pang et al. [43] A+V 81.69 1.876M
and Λ denotes the maximum weight.
Ours (light) A+V 82.17 0.347M
The visual network is optimized via the BCE loss with video-
Ours (full) A+V 83.40 0.678M
level labels to distill unimodal knowledge. The two objectives are
optimized simultaneously during training while in the inference
phase, only the audio-visual network is used for prediction.
divide each audio into 960-ms overlapped segments and compute
5 EXPERIMENT the log-mel spectrogram with 96 × 64 bins.
The entire network is trained on an NVIDIA Tesla V100 GPU
We design experiments to verify our model from two perspectives, for 50 epochs. We set the batch size as 128 and the initial learning
the end-to-end framework compared with state-of-the-art methods rate as 4e-4, which is dynamically adjusted by a cosine annealing
and assembling with other networks as plug-in modules. Experi- scheduler. For the visual distillation network, the learning rate is set
mental details and analyses are introduced as follows. as 8e-5. We use Adam [31] as the optimizer without weight decay.
During optimization, the weighted hyperparameter 𝑟, Λ𝑣 2𝑏 , Λ𝑣 2𝑛
5.1 Dataset and Evaluation Metric are 0.1, 1.5, and 1.5, respectively. The initial distillation weight
XD-Violence [62] dataset is by far the only available large-scale 𝑚^ is set to 0.91. The temperature 𝜏 of InfoNCE [55] is set to be
audio-visual dataset for violence detection, which is also the largest 0.1. The hidden dimension of our two-stream network is 128, and
dataset compared with other unimodal datasets. XD-Violence con- the dropout rate
 𝑇 is 0.1.
 For the MIL, we set the value 𝐾 of K-max
sists of 4,757 untrimmed videos (217 hours) and six types of violent activation as 16 + 1 , where 𝑇 denotes the length of input feature.
events, which are curated from real-life movies and in-the-wild
scenes on YouTube. Although previous methods adopt some popu- 5.3 Comparisons with State-of-the-Arts
lar datasets [36, 51] as benchmarks, we argue that these datasets We compare our proposed approach with state-of-the-art models,
only contain unimodal visual contents, which cannot perform cross- including (1) unsupervised methods: SVM baseline, OCSVM [48],
modality interactions and further verify our proposed multimodal and Hasan et al. [24]; (2) unimodal weakly-supervised methods:
framework. Hence, following [43, 62], we select the large-scale Sultani et al. [51], RTFM [54], Li et al. [33], and Wu et al. [61]; (3)
audio-visual dataset XD-Violence as benchmark. During inference, audio-visual weakly-supervised methods: Wu et al. [62] and Pang et
we utilize the frame-level average precision (AP) as evaluation al. [43]. We report the AP results on XD-Violence dataset in Table 1.
metrics following previous works [43, 54, 62]. With video-level supervisory signals, our method outperforms
all previous unsupervised approaches by a large margin. Moreover,
5.2 Implementation Details compared with previous unimodal weakly-supervised methods, our
To make a fair comparison, we adopt the same feature extracting model surpasses prior results with a minimum of 5.12%, showing
procedure as prior methods [43, 54, 61, 62]. Concretely, we use the the necessity of utilizing multimodal cues for violent detection.
I3D [7] network pretrained on the Kinetics-400 dataset to extract To further demonstrate the efficacy of our modality-aware con-
visual features. Audio features are extracted via the VGGish [22, 27] trastive instance learning and cross-modality distillation, we select
network pretrained on a large YouTube dataset. The visual sample state-of-the-art methods [43, 62] as audio-visual baselines and re-
rate is set to be 24 fps, and visual features are extracted by a sliding implement SOTA unimodal MIL method [54] with two modality-
window with a size of 16 frames. For the auditory data, we first expansion strategies. First, following [62], we fuse the audio and
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†

Table 2: Results on proposed MA-CIL and SD modules Table 3: Ablation studies on different components of our pro-
as plug-in modules. * indicates results re-implemented by posed framework.
fusing audio and visual features as inputs. † denotes re-
implemented by integrating logits of two identical networks Index Two-Stream MA-CIL SD AP (%)
with audio and visual inputs, respectively. ‡ is the ablated
model that removes the fusion module and mutual loss. 1 ! % % 71.37
2 ! % ! 74.01
Method MA-CIL SD AP (%) Param. 3 ! ! % 82.17
Wu et al. [62] % % 78.64 0.843M 4 ! ! ! 83.40
Wu et al.† [62] % % 78.66 1.539M
Wu et al. [62] % ! 80.07 (1.43↑) 1.612M
Wu et al.† [62] ! % 79.98 (1.32↑) 1.539M module and use the native version for combining with SD. Since
RTFM [54] % % 77.81 12.067M the unimodal network RTFM [54] and the audio-visual method Wu
RTFM* [54] % % 78.10 13.510M et al. [62] can only be amalgamated with MA-CIL in the two-stream
RTFM† [54] % % 78.54 13.190M network manner (†), while SD should be assembled in an early
RTFM* [54] % ! 80.40 (2.30↑) 25.577M modality fusion way (*), we can only combine these frameworks
RTFM† [54] ! % 80.00 (1.46↑) 13.190M with our modules separately. For the multimodal approach [43], we
both testify the joint and independent enhancement performances
Pang et al. [46] % % 81.69 1.876M
of our MA-CIL and SD modules.
Pang et al.‡ [46] % % 80.03 1.086M We report the results on the XD-Violence dataset in Table 2.
Pang et al. [46] % ! 81.21 (1.18↑) 2.138M First, we observe that MA-CIL boosts the unimodal baselines Wu et
Pang et al. [46] ! % 80.90 (0.87↑) 1.086M al. [62] and RTFM [54] for 1.32% and 1.46%, respectively, showing
Pang et al. [46] ! ! 82.21 (2.18↑) 1.613M that our contrastive learning method improves the discrimination
of models. We also note that equipped with the SD module, the
performances of [54, 62] also gain an increase of 1.43% and 2.30%.
visual features in an early way as model inputs. This approach for- For the multimodal baseline [43], we remove the mutual loss and
bids the intermediate modality interaction in the network, aiming to multimodal fusion modules and leverage the vanilla attention-based
show the performance of simply integrating multimodal data. Con- variant (‡) for comparison. Results show that enhanced with MA-
sidering some networks may be unsuitable for multimodal inputs, CIL and SD separately or jointly both achieve accuracy boosts.
we put forward another strategy to train two unimodal networks In summary, we conclude that integrating our MA-CIL and SD
simultaneously and generate audio and visual logits, respectively. modules is beneficial to numerous networks and our modules can
The audio-visual predictions are generated by fusing unimodal be utilized flexibly depending on specific usages.
logits. Results show that our framework achieves 1.71% higher per-
formance against state-of-the-art method Pang et al. [43], which 5.5 Complexity Analysis
verifies that our MA-CIL and SD modules are practical for violence As we mentioned before, we propose a computation-friendly frame-
detection. Our method outperforms RTFM* and Wu et al. by 5.30% work that does not introduce too many parameters. To support our
and 4.76% for multimodal variants using audio-visual inputs. For claims, we compare parameter amounts with previous methods,
variants using two-stream architecture, we observe that our model which are shown in the Param. column of Table 1, 2. In Table 1, we
surpasses RTFM† and Wu et al.† by 4.86% and 4.74%, respectively, report the parameter amounts of previous works we re-implement
which suggests that modality-aware interactions are indispensable and our proposed framework, where Ours (light) denotes the ab-
for multimodal scenarios. To conclude, using the same input fea- lated model without self-distillation, and Ours (full) indicates the
tures, our method achieves superior performance compared with full model with MA-CIL and SD. In Table 2, we provide parameter
all audio-visual methods, showing the effectiveness of our entire amounts of the raw methods and our enhancement variants.
proposed audio-visual framework. From the comparison with other methods, we observe that Ours
(light) holds the smallest model size (0.347M) while outperforming
5.4 Plug-in Module all previous methods. Combined with the SD module, our full model
We also argue that our proposed modules have satisfying generaliz- still has fewer parameter amounts and achieves the best perfor-
ability and are capable of enhancing other networks. To this end, we mance. This result demonstrates the efficiency of our framework,
combine our framework with state-of-the-art methods and evaluate which leverages a much simpler network yet gains better perfor-
the performance. First, we re-implement the state-of-the-art audio- mance. As shown in Table 2, we note that the MA-CIL method
visual method [43] using the official implementations provided by does not include any parameters, which exploits the intrinsic prior
the original paper. Then we select the unimodal method with pub- of multimodal instances and obtains model discrimination with
licly available codes RTFM [54] as the unimodal baseline, which is no computation cost. When boosting the multimodal model [43],
ameliorated to multimodal networks by two means we mentioned the enhanced model has comparable size to the raw model due to
above (* and †). For the multimodal method Wu et al. [62], we use the analogous model structure. This suggests that our proposed
the two-stream variant to examine the performance of our MA-CIL modules are flexible to be adapted to multimodal networks.
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection MM’2022, October 10–14, 2022, Lisbon, Portugal

Table 4: Ablation study for the hyperparameters in the pro- 4.0 MA-CIL Loss
BCE Loss 86.0
posed modality-aware contrastive instance learning. 2.0 Accuracy

1.0
82.0
0.5
Index Λ𝑣 2𝑎 Λ𝑣 2𝑏 ratio (𝑟 ) AP (%)

Accuracy
Loss
0.2 78.0
1 0.1 82.62
0.1
2 1.0 1.0 0.3 82.67 74.0
0.05
3 3.0 82.09
0.02
4 0.1 82.95 10 20 30 40 50
Epoch
5 1.5 1.0 0.3 81.37
6 3.0 82.15 Figure 4: Illustration of the accuracy and loss curves in 50
epochs during training. The red curve denotes the video-
7 0.1 83.21
level prediction accuracy. The ranges of BCE loss and con-
8 1.0 1.5 0.3 82.62
trastive loss are shown in blue and green curves, respec-
9 3.0 81.68
tively.
10 0.1 83.40
11 1.5 1.5 0.3 81.61 Vanilla Trained
12 3.0 82.14

83.5 83.4

83.0
82.57
82.49
82.5 82.40

82.35
AP(%)

82.32 82.34
82.22 ℎ𝑎 ℎ𝑎
82.0 82.06
1.92

81.53
81.5
81.32

81.0
0.12 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94
m (control hyperparameter)

Figure 3: Ablation studies of different settings for control


hyperparameter 𝑚 in our self-distillation module.
ℎ𝑣 ℎ𝑣
5.6 Ablation Studies Figure 5: Feature space visualizations of the vanilla features
To further investigate the contribution of our proposed modules, and the output of our model on XD-Violence testing videos.
we conduct ablation experiments to demonstrate how each aspect
of our framework affects the overall performance. respectively. 𝑟 is the linearly increasing ratio. Table 4 shows the
We first conduct experiments on the effectiveness of each compo- results of different settings about Λ𝑣 2𝑏 , Λ𝑣 2𝑛 , and 𝜏. We observe that
nent, and the results are shown in Table 3. The vanilla two-stream the optimal setting is Λ𝑣 2𝑏 = 1.5, Λ𝑣 2𝑛 = 1.5, 𝜏 = 0.1, while training
network without MA-CIL and SD achieves a performance of 71.37%. with the full weights from the very beginning (r=3.0) brings worse
We argue that the limited performance is driven by the small-scale performance. This suggests that gently raising the proportion of
model architecture. Equipped with MA-CIL, we observe a remark- contrastive loss is a plausible training strategy, where the model
able performance boost from 71.37% to 82.17%, proving that our focuses more on the quality of audio and visual embeddings in the
proposed contrastive method benefits model discrimination and early stage and learning feature discrimination afterwards.
further improves the detection performance. We then investigate Finally, we investigate the control hyperparameter 𝑚 of the self-
the role of our SD module. Combining our SD module to the raw distillation block in our proposed method as shown in Figure 3.
two-stream network and network with MA-CIL, the ablated mod- Results show that the best performance achieves at 𝑚 = 0.91.
els achieve the AP increase of 2.64% and 1.23%, respectively. This
indicates that the SD module is effective both with and without 5.7 Qualitative Analysis
contrastive learning, and the two modules complement each other We first visualize the variation of the training loss and video-level
for a better violence detection performance. accuracy on the XD-Violence dataset. Results are shown in Figure 4,
Then we perform ablation studies on the loss control strategy where the red curve denotes the video-level accuracy, and the blue
of our modality-aware contrastive instance learning. As shown in and green curves denote BCE loss and contrastive loss, respectively.
Table 4, Λ𝑣 2𝑏 , Λ𝑣 2𝑛 denote the maximum weights of L𝑐𝑡 𝑣 2𝑏 , L 𝑣 2𝑛 ,
𝑐𝑡 For the prediction accuracy, we observe a sudden decrease in the
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†
DeadPool About Time

non-violence
Anomaly Score

Time Time

non-violence
Anomaly Score

Time Time

Figure 6: Visualization of results on the XD-Violence test set. Red regions are the temporal ground-truths of violent events.

first 10 training epochs, where the contrastive


BadBoys learning constraints discrepancy between segments in different types (violent, normal,
are gradually applied with the increasing weights. After learning and background), and further contribute to the violent detection.
the discrimination for a few epochs, the training accuracy begins to
increase and finally outperforms previous results. A similar conclu- 6 CONCLUSION
sion also appears in the loss curves. The reduction of the BCE loss In this paper, we investigate the model asynchrony and undifferen-
comes from the first few epochs, where the model is required to gen- tiated instances phenomena of MIL under audio-visual scenarios,
erate high-quality embeddings. The contrastive loss has a lasting and further show the impact on weakly-supervised audio-visual
decline in dozens of epochs, which means the constraints enforce learning. Then a modality-aware contrastive instance learning with
the model to differentiate instances for a long training period. These a self-distillation framework is proposed to address these issues.
curves also denote that the two objectives are co-optimized without To be specific, we design a lightweight two-stream network to
interfering with each other. We argue that contrastive learning generate audio and visual embedding and logits. Furthermore, a
plays a complementary role to traditional MIL learning, and this cross-modality contrast is applied to audio and visual instances
insight further demonstrates the generalizability of our methods. of different semantics, which involves more unused instances for
We also provide t-SNE [56] visualizations about the distributions better discrimination and alleviates the modality inconsistency. To
of audio and visual features on the XD-Violence test set. Results are diminish training noises, a self-distillation module is leveraged to
shown in Figure 5, where yellow dots denote background segments transfer visual knowledge to the audio-visual network, by which
and purple dots are violent features. We can find that the violent the semantic gaps between unimodal and multimodal features are
and non-violent features are clearly clustered, and the distance narrowed. Our framework outperforms previous methods on the
between uncorrelated features is enlarged after the training proce- XD-Violence dataset with minor expenses. Besides, assembled with
dure. This reveals that aided by our proposed network, instances our contrastive learning and self-distillation modules, several prior
are successfully differentiated in both audio and visual modalities, methods achieve higher detection accuracy, showing the capability
further indicating the effectiveness of our proposed framework. as plug-in modules to ameliorate other networks.
Finally, we provide visualizations of prediction results presented
in Figure 6. Our model accurately localizes the anomalous events ACKNOWLEDGMENTS
and even identifies normal events of a very short duration between This work was supported by National Natural Science Foundation
two violent events. In non-violent videos, the magnitudes between of China (No. 62172101, No. 61976057). This work was supported
normal and background segments are also evident. Scores of normal (in part) by the Science and Technology Commission of Shanghai
events will be a little higher than the background segments yet far Municipality (No. 21511101000, No. 21511100602), and the SPMI
less than the violent segments, and our method generates nearly Innovation and Technology Fund Projects (SAST2020-110).
zero predictions for the background snippets. These results show
that our proposed approach enables the model to perceive the
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection MM’2022, October 10–14, 2022, Lisbon, Portugal

REFERENCES [25] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo-
[1] Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV. mentum contrast for unsupervised visual representation learning. In Proceedings
609–617. of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
[2] Relja Arandjelovic and Andrew Zisserman. 2018. Objects that sound. In ECCV. [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
435–451. learning for image recognition. In CVPR. 770–778.
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza- [27] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren
tion. arXiv preprint arXiv:1607.06450. Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan
[4] Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and Rahul Seybold, et al. 2017. CNN architectures for large-scale audio classification. In
Sukthankar. 2011. Violence detection in video using computer vision techniques. ICASSP. 131–135.
In International conference on Computer analysis of images and patterns. Springer, [28] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge
332–339. in a neural network. arXiv preprint arXiv:1503.02531 2, 7 (2015).
[5] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model [29] Judy Hoffman, Saurabh Gupta, and Trevor Darrell. 2016. Learning with side
compression. In Proceedings of the 12th ACM SIGKDD international conference on information through modality hallucination. In Proceedings of the IEEE conference
Knowledge discovery and data mining. 535–541. on computer vision and pattern recognition. 826–834.
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr [30] Samee Ullah Khan, Ijaz Ul Haq, Seungmin Rho, Sung Wook Baik, and Mi Young
Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised Lee. 2019. Cover the violence: A novel Deep-Learning-Based approach towards
vision transformers. In Proceedings of the IEEE/CVF International Conference on violence-detection in movies. Applied Sciences 9, 22 (2019), 4963.
Computer Vision. 9650–9660. [31] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
[7] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new mization. arXiv preprint arXiv:1412.6980 (2014).
model and the kinetics dataset. In proceedings of the IEEE Conference on Computer [32] Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative Learning
Vision and Pattern Recognition. 6299–6308. of Audio and Video Models from Self-Supervised Synchronization. In NeurIPS.
[8] Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao, and Lawrence 7774–7785.
Carin. 2021. Wasserstein contrastive representation distillation. In Proceedings [33] Shuo Li, Fang Liu, and Licheng Jiao. 2022. Self-Training Multi-Sequence Learning
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16296– with Transformer for Weakly Supervised Video Anomaly Detection. (2022).
16305. [34] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu,
[9] Long Chen, Yuhang Zheng, Yulei Niu, Hanwang Zhang, and Jun Xiao. 2021. and Haifeng Wang. 2020. Unimo: Towards unified-modal understanding and
Counterfactual samples synthesizing and training for robust visual question generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409
answering. arXiv preprint arXiv:2110.01013 (2021). (2020).
[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A [35] Jinyu Liu, Ying Cheng, Yuejie Zhang, Rui-Wei Zhao, and Rui Feng. 2022. Self-
simple framework for contrastive learning of visual representations. In Interna- Supervised Video Representation Learning with Motion-Contrastive Perception.
tional conference on machine learning. PMLR, 1597–1607. arXiv preprint arXiv:2204.04607 (2022).
[11] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E [36] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame
Hinton. 2020. Big self-supervised models are strong semi-supervised learners. prediction for anomaly detection–a new baseline. In Proceedings of the IEEE
Advances in neural information processing systems 33 (2020), 22243–22255. conference on computer vision and pattern recognition. 6536–6545.
[12] Tao Chen, Haizhou Shi, Siliang Tang, Zhigang Chen, Fei Wu, and Yueting Zhuang. [37] Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. 2021. Active
2021. CIL: Contrastive Instance Learning Framework for Distantly Supervised Contrastive Learning of Audio-Visual Video Representations. In ICLR. https:
Relation Extraction. arXiv preprint arXiv:2106.10855. //openreview.net/forum?id=OMizHuea_HB
[13] Yanbei Chen, Yongqin Xian, A Koepke, Ying Shan, and Zeynep Akata. 2021. [38] Oded Maron and Tomás Lozano-Pérez. 1997. A framework for multiple-instance
Distilling audio-visual knowledge by compositional contrastive learning. In learning. Advances in neural information processing systems 10.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [39] Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. 2021. Robust audio-visual
7016–7025. instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer
[14] Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, and Yuejie Zhang. 2020. Look, Vision and Pattern Recognition. 12934–12945.
listen, and attend: Co-attention network for self-supervised audio-visual repre- [40] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance
sentation learning. In ACM MM. 3884–3892. discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Con-
[15] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. ference on Computer Vision and Pattern Recognition. 12475–12486.
Electra: Pre-training text encoders as discriminators rather than generators. arXiv [41] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly super-
preprint arXiv:2003.10555 (2020). vised action localization by sparse temporal pooling network. In Proceedings of
[16] Bo Dai and Dahua Lin. 2017. Contrastive learning for image captioning. Advances the IEEE Conference on Computer Vision and Pattern Recognition. 6752–6761.
in Neural Information Processing Systems 30 (2017). [42] Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with
[17] Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca J Passonneau, and Rui Zhang. self-supervised multisensory features. In ECCV. 631–648.
2021. CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning. [43] Wen-Feng Pang, Qian-Hua He, Yong-jian Hu, and Yan-Xiong Li. 2021. Violence
arXiv preprint arXiv:2109.07589 (2021). Detection in Videos Based on Fusing Visual and Audio Information. In ICASSP
[18] Oscar Deniz, Ismael Serrano, Gloria Bueno, and Tae-Kyun Kim. 2014. Fast violence 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing
detection in video. In 2014 international conference on computer vision theory and (ICASSP). IEEE, 2260–2264.
applications (VISAPP), Vol. 2. IEEE, 478–485. [44] Bruno Peixoto, Bahram Lavi, Paolo Bestagini, Zanoni Dias, and Anderson Rocha.
[19] Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng 2020. Multimodal violence detection in videos. In ICASSP 2020-2020 IEEE Inter-
Liu. 2021. Seed: Self-supervised distillation for visual representation. arXiv national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
preprint arXiv:2101.04731. 2957–2961.
[20] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. 2021. Mist: Multiple in- [45] Bruno Peixoto, Bahram Lavi, João Paulo Pereira Martin, Sandra Avila, Zanoni
stance self-training framework for video anomaly detection. In Proceedings of the Dias, and Anderson Rocha. 2019. Toward subjective violence detection in videos.
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14009–14018. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
[21] Nuno C Garcia, Pietro Morerio, and Vittorio Murino. 2018. Modality distilla- Processing (ICASSP). IEEE, 8276–8280.
tion with multiple stream networks for action recognition. In Proceedings of the [46] Hao Peng, Tianyu Gao, Xu Han, Yankai Lin, Peng Li, Zhiyuan Liu, Maosong Sun,
European Conference on Computer Vision (ECCV). 103–118. and Jie Zhou. 2020. Learning from context or names? an empirical study on
[22] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, neural relation extraction. arXiv preprint arXiv:2010.01923 (2020).
R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology [47] Nicolae-Catalin Ristea, Neelu Madan, Radu Tudor Ionescu, Kamal Nasrollahi,
and human-labeled dataset for audio events. In 2017 IEEE international conference Fahad Shahbaz Khan, Thomas B Moeslund, and Mubarak Shah. 2021. Self-
on acoustics, speech and signal processing (ICASSP). IEEE, 776–780. Supervised Predictive Convolutional Attentive Block for Anomaly Detection.
[23] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre arXiv preprint arXiv:2111.09099.
Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan [48] Bernhard Schölkopf, Robert C Williamson, Alex Smola, John Shawe-Taylor, and
Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new John Platt. 1999. Support vector method for novelty detection. Advances in neural
approach to self-supervised learning. Advances in Neural Information Processing information processing systems 12 (1999).
Systems 33 (2020), 21271–21284. [49] Lei Shi, Kai Shuang, Shijie Geng, Peng Su, Zhengkai Jiang, Peng Gao, Zuohui
[24] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Fu, Gerard de Melo, and Sen Su. 2020. Contrastive visual-linguistic pretraining.
Larry S Davis. 2016. Learning temporal regularity in video sequences. In Proceed- arXiv preprint arXiv:2007.13135 (2020).
ings of the IEEE conference on computer vision and pattern recognition. 733–742. [50] Yixuan Su, Fangyu Liu, Zaiqiao Meng, Lei Shu, Ehsan Shareghi, and Nigel Col-
lier. 2021. TaCL: Improving BERT Pre-training with Token-aware Contrastive
Learning. arXiv preprint arXiv:2111.04198 (2021).
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†

[51] Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly de- of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5632–
tection in surveillance videos. In Proceedings of the IEEE conference on computer 5641.
vision and pattern recognition. 6479–6488. [61] Peng Wu and Jing Liu. 2021. Learning causal temporal relation and feature
[52] Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation discrimination for anomaly detection. IEEE Transactions on Image Processing 30,
distillation. arXiv preprint arXiv:1910.10699. 3513–3527.
[53] Yapeng Tian, Dingzeyu Li, and Chenliang Xu. 2020. Unified multisensory per- [62] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei
ception: Weakly-supervised audio-visual video parsing. In European Conference Yang. 2020. Not only look, but also listen: Learning multimodal violence detection
on Computer Vision. Springer, 436–454. under weak supervision. In European Conference on Computer Vision. Springer,
[54] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and 322–339.
Gustavo Carneiro. 2021. Weakly-supervised video anomaly detection with robust [63] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised
temporal feature magnitude learning. In Proceedings of the IEEE/CVF International feature learning via non-parametric instance discrimination. In Proceedings of
Conference on Computer Vision. 4975–4986. the IEEE conference on computer vision and pattern recognition. 3733–3742.
[55] Aaron Van den Oord, Yazhe Li, Oriol Vinyals, et al. 2018. Representation learning [64] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. 2020. Self-
with contrastive predictive coding. arXiv preprint arXiv:1807.03748 2, 3 (2018), 4. training with noisy student improves imagenet classification. In Proceedings of
[56] Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. the IEEE/CVF conference on computer vision and pattern recognition. 10687–10698.
Journal of machine learning research 9, 11 (2008). [65] Zihui Xue, Sucheng Ren, Zhengqi Gao, and Hang Zhao. 2021. Multimodal knowl-
[57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, edge expansion. In Proceedings of the IEEE/CVF International Conference on Com-
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all puter Vision. 854–863.
you need. Advances in neural information processing systems 30. [66] Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda
[58] Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B Chan. 2022. On Zeng, Trishul Chilimbi, and Junzhou Huang. 2022. Vision-Language Pre-Training
Distinctive Image Captioning via Comparing and Reweighting. IEEE Transactions with Triple Contrastive Learning. arXiv preprint arXiv:2202.10401 (2022).
on Pattern Analysis and Machine Intelligence (2022). [67] Jiangong Zhang, Laiyun Qing, and Jun Miao. 2019. Temporal convolutional
[59] Keyu Wen, Jin Xia, Yuanyuan Huang, Linyang Li, Jiayan Xu, and Jie Shao. 2021. network with complementary inner bag loss for weakly supervised anomaly
COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision- detection. In 2019 IEEE International Conference on Image Processing (ICIP). IEEE,
Language Representation. In Proceedings of the IEEE/CVF International Conference 4030–4034.
on Computer Vision. 2208–2217. [68] Tao Zhang, Zhijie Yang, Wenjing Jia, Baoqing Yang, Jie Yang, and Xiangjian He.
[60] Spencer Whitehead, Hui Wu, Heng Ji, Rogerio Feris, and Kate Saenko. 2021. 2016. A new method for violence detection in surveillance scenes. Multimedia
Separating skills and concepts for novel visual question answering. In Proceedings Tools and Applications 75, 12 (2016), 7327–7349.

You might also like