Modality-Aware Contrastive Instance Learning With
Modality-Aware Contrastive Instance Learning With
{jsyu19,jinyuliu20,chengy18,fengrui,yjzhang}@fudan.edu.cn
arXiv:2207.05500v1 [cs.CV] 12 Jul 2022
pairs, and violent semi-bags are combined with background and Visual
Visual
viates noises and closes the semantic gap between unimodal and
multimodal features. Experiments show that our framework outper- Violent Bag Normal Bag
Violent Normal Background
forms previous methods with lower complexity on the large-scale b) Undifferentiated Instances
XD-Violence dataset. Results also demonstrate that our proposed ap-
proach can be used as plug-in modules to enhance other networks. Figure 1: a) An example of the modality asynchrony. Dur-
Codes are available at https://github.com/JustinYuu/MACIL_SD. ing the violent event abuse, the abuser first hits the victim,
where the violent message is reflected in the visual modal-
ity. Then the scream of the victim occurs, indicating the au-
CCS CONCEPTS ditory violence information. b) The illustration of the un-
• Computing methodologies → Scene anomaly detection. differentiated instances. In each bag, violent cues are dis-
tributed in some instances while others contain background
KEYWORDS noises, and the discrepancy between normal segments and
Multi-Modality, Contrastive Learning, Violence Detection. background noises also exists. We argue that adding addi-
tional constraints could enhance model discrimination.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed 1 INTRODUCTION
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM Recent years have witnessed the extension of violence detection
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, from a pure vision task [4, 18, 20, 30, 33, 45, 47, 54, 61, 67, 68] to
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
an audio-visual multimodal problem [43, 44, 62], for which the
MM’2022, October 10–14, 2022, Lisbon, Portugal corresponding auditory content supplements fine-grained violent
© 2022 Association for Computing Machinery.
∗ Equal contribution.
ACM ISBN 978-1-4503-9203-7/22/10. . . $15.00
https://doi.org/10.1145/3503161.3547868 † Corresponding authors.
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†
cues. Despite numerous modality fusion and interaction methods • Equipped with a lightweight network, our framework outper-
have shown promising results, the modality discrepancy of the forms the state-of-the-art methods on the XD-Violence dataset,
multiple instance learning (MIL) [38] framework under the weakly- and our model also shows the generalizability as plug-in modules.
supervised setting remains to be explored.
To alleviate the appetite for fine-labeled data, MIL is widely 2 RELATED WORKS
adopted for the weakly-supervised violence detection, where the 2.1 Weakly-Supervised Violence Detection
output of each video sequence is formed into a bag containing
multiple snippet-level instances. In the audio-visual scenarios, all Weakly-supervised violence detection requires identifying violent
prior works share a general scheme that regards each audio-visual snippets under video-level labels, where the MIL [38] framework is
snippet as an integral instance and averaging the top-K audio- widely used for denoising irrelevant information. Some previous
visual logits as the final video-level scores. However, we analyze works [4, 18, 20, 30, 45, 47, 61, 67, 68] regard violence detection as a
that this formula suffers from two defects: modality asynchrony pure vision task and leverage CNN-based networks to encode visual
and undifferentiated instances. Modality asynchrony indicates features. Among these methods, various feature integration and
the temporal inconsistency between auditory and visual violence amelioration methods are proposed to enhance the robustness of
cues. Taking the typical violent event abuse in Figure 1(a) as an MIL. Tian et al. [54] propose RTFM, a robust temporal feature magni-
example, when the abuser hits the victim, the scream occurs af- tude learning method to refine the capacity of recognizing positive
terward, and the entire procedure is regarded as a violent event. instances. Li et al. [33] design a Transformer [57]-based multi-
In this situation, scenes in part of the visual modality (2nd-3rd sequence learning network to reduce the probability of instance
snippets) and audio modality (4th-5th snippets) contain violent selection errors. However, these models neglect the corresponding
clues. We argue that directly leveraging an audio-visual pair as an auditory information as well as the cross-modality interactions,
instance could introduce data noise to the video-level optimization. thereby restricting the performance of violence prediction.
The other defect we discovered is undifferentiated instances, that Recently, Wu et al. [62] curate a large-scale audio-visual dataset
is, picking the top- K instances for optimization results in numer- XD-Violence and establish an audio-visual benchmark. However,
ous disengaged instances. As shown in Figure 1(b), in a sequence they integrate audio and visual features in an early fusion way,
of violent videos, the violent event can be reflected in some au- thereby limiting further inter-modality interactions. To facilitate
dio/visual instances. In contrast, others contain irrelevant elements multimodal fusion, Pang et al. [43] propose an attention-based net-
such as background noises. On the contrary, in the videos of normal work to adaptively integrate audio and visual features with mutual
events, a few snippets contain elements of normal events, while learning module in an intermediate manner. Different from prior
others include background information. In this case, the K-max methods, we perform inter-modality interactions via a lightweight
activation abandons the instances containing background elements, two-stream network and conduct discriminative multimodal learn-
and the discrepancy between violent and normal instances is not ing via modality-aware contrast and self-distillation.
explicitly revealed. To this end, we argue that adding contrastive
constraints among the violent, normal, and background instances
2.2 Contrastive Learning
could contribute to the discrimination toward violent content. Contrastive learning is formulated by contrasting positive pairs
Driven by preliminary analysis, we propose a simple yet effec- against negative pairs without data supervisory. In the unimodal
tive framework constructed by modality-aware contrastive instance field, several visual methods [10, 23, 25, 35] leverage the augmenta-
learning (MA-CIL) and self-distillation (SD) module. To address the tion of visual data as a contrast to increase model discrimination.
modality asynchrony, we form the unimodal bags apart from the Furthermore, some natural language processing methods utilize the
original audio-visual bags, compute unimodal logits, and cluster em- token- and sentence-level contrasts to enhance the performance of
beddings of top-K and bottom-K unimodal instances as semi-bags. pre-trained models [15, 50] and supervised tasks [17, 46]. For the
To differentiate instances, we propose a modality-aware contrastive- multimodal fields, some works introduce modality-aware contrasts
based method. In detail, the audio and visual violent semi-bags are to vision-language tasks, such as image captioning [16, 58], visual
constructed as the positive pairs, while the violent semi-bags are question answering [9, 60], and representation learning [34, 49, 59,
assembled with embeddings of instances in the background and 66]. Moreover, recent literature [1, 2, 14, 32, 37, 39, 40, 42] utilizes
normal semi-bags as negative pairs. Furthermore, a self-distillation the temporal consistency of audio-visual streams as contrastive
module is applied to distill unimodal knowledge to the audio-visual pretext tasks to learn robust audio-visual representations. Based
model, which closes the semantic gap between modalities and alle- on existing instance-level contrastive frameworks [12, 63], we put
viates the data noise introduced by the abundant cross-modality forward the concept of semi-bags and leverage the cross-modality
interactions. In summary, our contributions are as follows: contrast to obtain model discrimination.
• We analyze the modality asynchrony and undifferentiated in- 2.3 Cross-Modality Knowledge Distillation
stances phenomena of the widely-used MIL framework in audio- Knowledge distillation is first proposed to transfer knowledge from
visual scenarios, further elaborating their disadvantages for the large-scale architectures to lightweight models [5, 28]. However, the
weakly-supervised audio-visual violence detection. cross-modality distillation aims to transfer unimodal knowledge
• We propose a modality-aware contrastive instance learning with to multimodal models for alleviating the semantic gap between
self-distillation framework to introduce feature discrimination modalities. Several methods [21, 29] distill depth features to the
and alleviate modality noise. RGB representations via hallucination networks to address the
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection MM’2022, October 10–14, 2022, Lisbon, Portugal
...
...
background background
top-k top-k
violent normal
Vout
Visual
Self-Distillation
...
Encoder
Vout
Video-level Label
semi-bag ( )
Video
V instance ( )
bottom-k bottom-k
FC
...
Audio-Visual Stream
background background
top-k top-k
violent normal
Cross-Modality FC
...
Violent or Normal?
Attention
Audio
FC
...
...
Encoder semi-bag ( )
Modality-Aware
Audio Contrastive Instance
Aout Learning Positive Pair: < violent, violent>
Negative Pairs: < violent, normal> < violent, background>
A Aout Addition Sigmoid Negative Pairs: < violent, normal> < violent, background>
Figure 2: An illustration of our proposed Modality-Aware Contrastive Instance Learning with Self-Distillation framework.
Our approach consists of three parts: the lightweight two-stream network, modality-aware contrastive learning (MA-CIL),
and self-distillation (SD) module. Taking audio and visual features extracted from pretrained networks as inputs, we design a
simple yet effective attention-based network to perform audio-visual interaction. Then a modality-aware contrasting-based
method is used to cluster instances of different types into several semi-bags and further obtain model discrimination. Finally,
a self-distillation module is deployed to transfer visual knowledge to our audio-visual network, aiming to alleviate modality
noise and close the semantic gap between unimodal and multimodal features. The entire framework is trained jointly in a
weakly supervised manner, and we adopt the multiple instance learning (MIL) strategy for optimization.
modality missing and noisy phenomena. Chen et al. [13] propose network to generate unimodal logits 𝑙𝑎 , 𝑙 𝑣 , and audio-visual logits
an audio-visual distillation strategy, which learns the compositional 𝑙𝑎𝑣 . The embeddings of audio and visual instances are symbolized
embedding and transfers knowledge across semantic-uncorrelated as ℎ𝑎 and ℎ 𝑣 . Then we average 𝐾 maximum logits and use the
modalities. Recently, Multimodal Knowledge Expansion [65] is pro- sigmoid activation to generate the video-level prediction 𝑝. Due
posed as a two-stage distillation strategy, which transfers knowl- to the additional constraint of our proposed contrastive learning
edge from unimodal teacher networks to the multimodal student method, we define the unimodal bags B𝑎 , B𝑣 . In each unimodal bag,
network by generating pseudo labels. Inspired by the methodology instances are clustered into several semi-bags 𝐵𝑚 , 𝑚 ∈ {𝑎, 𝑣 } based
of self-distillation [6, 8, 11, 19, 52, 64], we propose the parameter on their intrinsic characteristics, and the corresponding semi-bag
integration paradigm to transfer visual knowledge to our audio- representations are noted as B𝑚 , 𝑚 ∈ {𝑎, 𝑣 }.
visual model via two similar lightweight networks, which reduces
the modality noise and benefits robust audio-visual representation. 4 METHODOLOGY
Our proposed framework consists of three parts, a lightweight
3 PRELIMINARIES two-stream network, modality-aware contrastive instance learning
Given an audio-visual video sequence 𝑆 = (𝑆 𝐴 , 𝑆𝑉 ), where 𝑆 𝐴 is (MA-CIL), and the self-distillation (SD) module. An illustration of
the audio channel, and 𝑆𝑉 denotes the visual channel, the entire our framework shown in Figure 2 is detailed as follows.
sequence is divided into 𝑇 non-overlapping segments {𝑠𝑡𝐴 , 𝑠𝑡𝑉 }𝑡𝑁=1 .
For an audio-visual pair (𝑠𝑡𝐴 , 𝑠𝑡𝑉 ), weakly-supervised violence detec- 4.1 Two-Stream Network
tion task requires to distinguish whether it contains violent events Considering prior methods suffer from the parameter redundancy
via an event relevance label 𝑦𝑡 ∈ {0, 1}, where 𝑦𝑡 = 1 means at of the large-scale networks, we design an encoder-agnostic light-
least one modality in the current segment includes violent cues. weight architecture to achieve feature aggregation and modality in-
In the training phase, only video-level labels 𝑦 are available for teraction. Taking the visual and auditory feature 𝑓𝑣 , 𝑓𝑎 extracted by
optimization. Hence, a general scheme is to utilize the multiple pre-trained networks (e.g., I3D and VGGish for visual and audio fea-
instance learning (MIL) procedure to satisfy the weak supervision. tures, respectively) as input, our proposed network consists of three
In the MIL framework, each video sequence 𝑆 is regarded as a parts, linear layers to keep the dimension of input features identical,
bag, and video segments {𝑠𝑡𝐴 , 𝑠𝑡𝑉 }𝑡𝑁=1 are taken as instances. Then cross-modality attention layer to perform inter-modality interac-
instances are aggregated via a specific feature-level/score-level tions, and MIL module for the weakly-supervised training. Among
pooling method to generate video-level predictions 𝑝. In this paper, these modules, the cross-modality attention layer is ameliorated
we utilize the K-max activation with average pooling rather than from the encoder part of Transformer [57], which includes the multi-
attention-based methods [41, 53] and global pooling [51, 67] as the head self-attention, feed-forward layer, residual connection [26],
aggregation function. To be specific, given the audio and visual and layer normalization [3]. In the raw self-attention block, features
feature 𝑓𝑎 , 𝑓𝑣 extracted by CNN networks, we use a multimodal are projected by three different parameter matrices as query, key,
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†
and value vectors, respectively. Then the scale dot-product atten- However, we argue that audio and visual violent instances with
𝑞𝑘𝑇
tion score is computed by 𝑎𝑡𝑡 (𝑞, 𝑘, 𝑣) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √ )𝑣, where diverse positions could be semantically mismatched, such as ex-
𝑑𝑚 pressing the beginning and ending of a violent event, respectively.
𝑞, 𝑘, 𝑣 denotes the query, key, and value vectors, 𝑑𝑚 is the dimen-
Therefore, it is unnatural to assume that they share the same im-
sion of query vectors, 𝑇 denotes the matrix transpose operation. To
plication. In contrast, we conduct average pooling to embeddings
enforce cross-modality interactions, we change the key and value
of all violence instances in each bag and form a semi-bag-level
vectors of the self-attention block to features in other modalities:
representation B𝑚 𝑣𝑖𝑜 , 𝑚 ∈ {𝑎, 𝑣 }. By doing so, the audio and visual
ℎ𝑎 = 𝑎𝑡𝑡 ( 𝑓𝑎𝑊𝑄 , 𝑓𝑣𝑊𝐾 , 𝑓𝑣𝑊𝑉 ), (1) representation both express event-level semantics, thereby alleviat-
ℎ 𝑣 = 𝑎𝑡𝑡 (𝑓𝑣𝑊𝑄 , 𝑓𝑎𝑊𝐾 , 𝑓𝑎𝑊𝑉 ), (2) ing the noise issue. To this end, we construct semi-bag-level positive
pairs, which are assembled by audio and visual violent semi-bag
where ℎ𝑎 , ℎ 𝑣 are updated audio and visual features,𝑊𝑄 ,𝑊𝐾 , and𝑊𝑉
representations B𝑎𝑣𝑖𝑜 , B𝑣𝑣𝑖𝑜 . We also construct semi-bag-to-instance
are learnable parameters. We adopt the sharing parameter strategy
negative pairs to maintain numerous contrastive samples, where vi-
for feature projection to reduce computation.
olent semi-bag representations are combined with background and
We adopt the MIL procedure under the weakly-supervised set-
ting to obtain video-level scores. Unlike prior works, we process normal instance embeddings ℎ𝑚 𝑛𝑜𝑟 , ℎ𝑏𝑔𝑑 , 𝑚 ∈ {𝑎, 𝑣 } in the opposite
𝑚
unimodal features individually to alleviate modality asynchrony. modality as negative pairs.
To be specific, fully-connected layers are used in each modality We use the InfoNCE [55] as the training objective of this part,
to generate unimodal logits. Then we take the summation of uni- which closes the distance between positive pairs and enlarges the
modal logits as the fused audio-visual logits while reserving the distance between negatives. The objective for audio violent semi-
unimodal logits for the following contrastive learning. Finally, the bag representation 𝐵𝑎𝑣𝑖𝑜 (𝑖) against visual normal instance embed-
top-K audio-visual logits are average-pooled and put into a sig- dings {ℎ𝑛𝑜𝑟
𝑣 (𝑛)}𝑛=1 is formulated as:
𝐾𝑛𝑜𝑟
where 𝑊𝑎 ,𝑊𝑣 , 𝑏𝑎 , 𝑏 𝑣 are learnable parameters, Ω is the K-max acti- where 𝜙 denotes cosine similarity function, 𝜏 is the temperature
vation, 𝜎 denotes the sigmoid function, ⊕ is the summation opera- hyperparameter, 𝐾𝑛𝑜𝑟 denotes the normal instances number in the
tion, Θ denotes average pooling, and 𝑝 is video-level prediction. whole mini-batch. Similarly, the objective for audio violent semi-
bag representation 𝐵𝑎𝑣𝑖𝑜 (𝑖) against visual background instances
4.2 MA-CIL embeddings {ℎ 𝑣
𝑏𝑔𝑑 𝐾
1 is formulated as:
(𝑛)}𝑛=𝑏𝑔𝑑
To utilize more disengaged instances, we propose the MA-CIL mod- 𝑣 2𝑏 𝑣𝑖𝑜
L𝑐𝑡 (𝐵𝑎 (𝑖)) =
ule, which is shown on the right side of Figure 2. Given the embed-
dings ℎ𝑎 , ℎ 𝑣 , we perform unsupervised clustering to divide them 𝑒 𝜙 (𝐵𝑎 (𝑖),𝐵 𝑣 (𝑖))/𝜏) (6)
𝑣𝑖𝑜 𝑣𝑖𝑜
where 𝜃 𝑎𝑣 and 𝜃 𝑣 denotes parameters of the audio-visual model and Table 1: Comparison of the frame-level AP performance
visual model, respectively, 𝑚 denotes the control hyperparameter with unsupervised and weakly-supervised baselines. † de-
following a cosine scheduler that increases from the original value notes results re-implemented by integrating logits of two
𝑚^ to 1 during training. identical networks with audio and visual inputs, and * in-
dicates re-implemented by fusing audio and visual features
4.4 Learning Objective as inputs.
The entire framework is optimized in a joint-training manner. For
the video-level prediction 𝑝, we leverage binary cross-entropy L B Manner Method Modality AP (%) Param.
as the training objective and use a linearly growing strategy to SVM baseline V 50.78 /
control the weight of contrastive loss. The total objective is: Unsup. OCSVM [48] V 27.25 /
𝜆𝑣 2𝑛 (𝑡) ∑︁ 𝑣 2𝑛 𝑣𝑖𝑜 Hasan et al. [24] V 30.77 /
𝑣 2𝑛 𝑣𝑖𝑜
L𝑎𝑣 = (L𝑐𝑡 (𝐵𝑎 (𝑖)) + L𝑐𝑡 (𝐵 𝑣 (𝑖)))+
𝐾𝑣𝑖𝑜 Sultani et al. [51] V 73.20 /
(8)
𝑖
𝜆𝑣 2𝑏 (𝑡) ∑︁ 𝑣 2𝑏 𝑣𝑖𝑜 Wu et al. [61] V 75.90 /
𝑣 2𝑏 𝑣𝑖𝑜
(L𝑐𝑡 (𝐵𝑎 (𝑖)) + (L𝑐𝑡 (𝐵 𝑣 (𝑖))) + L𝐵 RTFM [54] V 77.81 12.067M
𝐾𝑣𝑖𝑜 𝑖 RTFM* [54] A+V 78.10 13.510M
RTFM† [54] A+V 78.54 13.190M
𝜆 (𝑡 ) = 𝑚𝑖𝑛 (𝑟 ∗ 𝑡, Λ) (9)
W. Sup. Li et al. [33] V 78.28 /
where 𝐾𝑣𝑖𝑜 denotes the number of violence semi-bags in the whole Wu et al. [62] A+V 78.64 0.843M
mini-batch, 𝜆(𝑡) is a controller to increase weight within a few Wu et al.† [62] A+V 78.66 1.539M
epochs linearly, 𝑟 denotes the growing ratio, 𝑡 is the current epoch, Pang et al. [43] A+V 81.69 1.876M
and Λ denotes the maximum weight.
Ours (light) A+V 82.17 0.347M
The visual network is optimized via the BCE loss with video-
Ours (full) A+V 83.40 0.678M
level labels to distill unimodal knowledge. The two objectives are
optimized simultaneously during training while in the inference
phase, only the audio-visual network is used for prediction.
divide each audio into 960-ms overlapped segments and compute
5 EXPERIMENT the log-mel spectrogram with 96 × 64 bins.
The entire network is trained on an NVIDIA Tesla V100 GPU
We design experiments to verify our model from two perspectives, for 50 epochs. We set the batch size as 128 and the initial learning
the end-to-end framework compared with state-of-the-art methods rate as 4e-4, which is dynamically adjusted by a cosine annealing
and assembling with other networks as plug-in modules. Experi- scheduler. For the visual distillation network, the learning rate is set
mental details and analyses are introduced as follows. as 8e-5. We use Adam [31] as the optimizer without weight decay.
During optimization, the weighted hyperparameter 𝑟, Λ𝑣 2𝑏 , Λ𝑣 2𝑛
5.1 Dataset and Evaluation Metric are 0.1, 1.5, and 1.5, respectively. The initial distillation weight
XD-Violence [62] dataset is by far the only available large-scale 𝑚^ is set to 0.91. The temperature 𝜏 of InfoNCE [55] is set to be
audio-visual dataset for violence detection, which is also the largest 0.1. The hidden dimension of our two-stream network is 128, and
dataset compared with other unimodal datasets. XD-Violence con- the dropout rate
𝑇 is 0.1.
For the MIL, we set the value 𝐾 of K-max
sists of 4,757 untrimmed videos (217 hours) and six types of violent activation as 16 + 1 , where 𝑇 denotes the length of input feature.
events, which are curated from real-life movies and in-the-wild
scenes on YouTube. Although previous methods adopt some popu- 5.3 Comparisons with State-of-the-Arts
lar datasets [36, 51] as benchmarks, we argue that these datasets We compare our proposed approach with state-of-the-art models,
only contain unimodal visual contents, which cannot perform cross- including (1) unsupervised methods: SVM baseline, OCSVM [48],
modality interactions and further verify our proposed multimodal and Hasan et al. [24]; (2) unimodal weakly-supervised methods:
framework. Hence, following [43, 62], we select the large-scale Sultani et al. [51], RTFM [54], Li et al. [33], and Wu et al. [61]; (3)
audio-visual dataset XD-Violence as benchmark. During inference, audio-visual weakly-supervised methods: Wu et al. [62] and Pang et
we utilize the frame-level average precision (AP) as evaluation al. [43]. We report the AP results on XD-Violence dataset in Table 1.
metrics following previous works [43, 54, 62]. With video-level supervisory signals, our method outperforms
all previous unsupervised approaches by a large margin. Moreover,
5.2 Implementation Details compared with previous unimodal weakly-supervised methods, our
To make a fair comparison, we adopt the same feature extracting model surpasses prior results with a minimum of 5.12%, showing
procedure as prior methods [43, 54, 61, 62]. Concretely, we use the the necessity of utilizing multimodal cues for violent detection.
I3D [7] network pretrained on the Kinetics-400 dataset to extract To further demonstrate the efficacy of our modality-aware con-
visual features. Audio features are extracted via the VGGish [22, 27] trastive instance learning and cross-modality distillation, we select
network pretrained on a large YouTube dataset. The visual sample state-of-the-art methods [43, 62] as audio-visual baselines and re-
rate is set to be 24 fps, and visual features are extracted by a sliding implement SOTA unimodal MIL method [54] with two modality-
window with a size of 16 frames. For the auditory data, we first expansion strategies. First, following [62], we fuse the audio and
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†
Table 2: Results on proposed MA-CIL and SD modules Table 3: Ablation studies on different components of our pro-
as plug-in modules. * indicates results re-implemented by posed framework.
fusing audio and visual features as inputs. † denotes re-
implemented by integrating logits of two identical networks Index Two-Stream MA-CIL SD AP (%)
with audio and visual inputs, respectively. ‡ is the ablated
model that removes the fusion module and mutual loss. 1 ! % % 71.37
2 ! % ! 74.01
Method MA-CIL SD AP (%) Param. 3 ! ! % 82.17
Wu et al. [62] % % 78.64 0.843M 4 ! ! ! 83.40
Wu et al.† [62] % % 78.66 1.539M
Wu et al. [62] % ! 80.07 (1.43↑) 1.612M
Wu et al.† [62] ! % 79.98 (1.32↑) 1.539M module and use the native version for combining with SD. Since
RTFM [54] % % 77.81 12.067M the unimodal network RTFM [54] and the audio-visual method Wu
RTFM* [54] % % 78.10 13.510M et al. [62] can only be amalgamated with MA-CIL in the two-stream
RTFM† [54] % % 78.54 13.190M network manner (†), while SD should be assembled in an early
RTFM* [54] % ! 80.40 (2.30↑) 25.577M modality fusion way (*), we can only combine these frameworks
RTFM† [54] ! % 80.00 (1.46↑) 13.190M with our modules separately. For the multimodal approach [43], we
both testify the joint and independent enhancement performances
Pang et al. [46] % % 81.69 1.876M
of our MA-CIL and SD modules.
Pang et al.‡ [46] % % 80.03 1.086M We report the results on the XD-Violence dataset in Table 2.
Pang et al. [46] % ! 81.21 (1.18↑) 2.138M First, we observe that MA-CIL boosts the unimodal baselines Wu et
Pang et al. [46] ! % 80.90 (0.87↑) 1.086M al. [62] and RTFM [54] for 1.32% and 1.46%, respectively, showing
Pang et al. [46] ! ! 82.21 (2.18↑) 1.613M that our contrastive learning method improves the discrimination
of models. We also note that equipped with the SD module, the
performances of [54, 62] also gain an increase of 1.43% and 2.30%.
visual features in an early way as model inputs. This approach for- For the multimodal baseline [43], we remove the mutual loss and
bids the intermediate modality interaction in the network, aiming to multimodal fusion modules and leverage the vanilla attention-based
show the performance of simply integrating multimodal data. Con- variant (‡) for comparison. Results show that enhanced with MA-
sidering some networks may be unsuitable for multimodal inputs, CIL and SD separately or jointly both achieve accuracy boosts.
we put forward another strategy to train two unimodal networks In summary, we conclude that integrating our MA-CIL and SD
simultaneously and generate audio and visual logits, respectively. modules is beneficial to numerous networks and our modules can
The audio-visual predictions are generated by fusing unimodal be utilized flexibly depending on specific usages.
logits. Results show that our framework achieves 1.71% higher per-
formance against state-of-the-art method Pang et al. [43], which 5.5 Complexity Analysis
verifies that our MA-CIL and SD modules are practical for violence As we mentioned before, we propose a computation-friendly frame-
detection. Our method outperforms RTFM* and Wu et al. by 5.30% work that does not introduce too many parameters. To support our
and 4.76% for multimodal variants using audio-visual inputs. For claims, we compare parameter amounts with previous methods,
variants using two-stream architecture, we observe that our model which are shown in the Param. column of Table 1, 2. In Table 1, we
surpasses RTFM† and Wu et al.† by 4.86% and 4.74%, respectively, report the parameter amounts of previous works we re-implement
which suggests that modality-aware interactions are indispensable and our proposed framework, where Ours (light) denotes the ab-
for multimodal scenarios. To conclude, using the same input fea- lated model without self-distillation, and Ours (full) indicates the
tures, our method achieves superior performance compared with full model with MA-CIL and SD. In Table 2, we provide parameter
all audio-visual methods, showing the effectiveness of our entire amounts of the raw methods and our enhancement variants.
proposed audio-visual framework. From the comparison with other methods, we observe that Ours
(light) holds the smallest model size (0.347M) while outperforming
5.4 Plug-in Module all previous methods. Combined with the SD module, our full model
We also argue that our proposed modules have satisfying generaliz- still has fewer parameter amounts and achieves the best perfor-
ability and are capable of enhancing other networks. To this end, we mance. This result demonstrates the efficiency of our framework,
combine our framework with state-of-the-art methods and evaluate which leverages a much simpler network yet gains better perfor-
the performance. First, we re-implement the state-of-the-art audio- mance. As shown in Table 2, we note that the MA-CIL method
visual method [43] using the official implementations provided by does not include any parameters, which exploits the intrinsic prior
the original paper. Then we select the unimodal method with pub- of multimodal instances and obtains model discrimination with
licly available codes RTFM [54] as the unimodal baseline, which is no computation cost. When boosting the multimodal model [43],
ameliorated to multimodal networks by two means we mentioned the enhanced model has comparable size to the raw model due to
above (* and †). For the multimodal method Wu et al. [62], we use the analogous model structure. This suggests that our proposed
the two-stream variant to examine the performance of our MA-CIL modules are flexible to be adapted to multimodal networks.
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection MM’2022, October 10–14, 2022, Lisbon, Portugal
Table 4: Ablation study for the hyperparameters in the pro- 4.0 MA-CIL Loss
BCE Loss 86.0
posed modality-aware contrastive instance learning. 2.0 Accuracy
1.0
82.0
0.5
Index Λ𝑣 2𝑎 Λ𝑣 2𝑏 ratio (𝑟 ) AP (%)
Accuracy
Loss
0.2 78.0
1 0.1 82.62
0.1
2 1.0 1.0 0.3 82.67 74.0
0.05
3 3.0 82.09
0.02
4 0.1 82.95 10 20 30 40 50
Epoch
5 1.5 1.0 0.3 81.37
6 3.0 82.15 Figure 4: Illustration of the accuracy and loss curves in 50
epochs during training. The red curve denotes the video-
7 0.1 83.21
level prediction accuracy. The ranges of BCE loss and con-
8 1.0 1.5 0.3 82.62
trastive loss are shown in blue and green curves, respec-
9 3.0 81.68
tively.
10 0.1 83.40
11 1.5 1.5 0.3 81.61 Vanilla Trained
12 3.0 82.14
83.5 83.4
83.0
82.57
82.49
82.5 82.40
82.35
AP(%)
82.32 82.34
82.22 ℎ𝑎 ℎ𝑎
82.0 82.06
1.92
81.53
81.5
81.32
81.0
0.12 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94
m (control hyperparameter)
non-violence
Anomaly Score
Time Time
non-violence
Anomaly Score
Time Time
Figure 6: Visualization of results on the XD-Violence test set. Red regions are the temporal ground-truths of violent events.
REFERENCES [25] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo-
[1] Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV. mentum contrast for unsupervised visual representation learning. In Proceedings
609–617. of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
[2] Relja Arandjelovic and Andrew Zisserman. 2018. Objects that sound. In ECCV. [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
435–451. learning for image recognition. In CVPR. 770–778.
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza- [27] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren
tion. arXiv preprint arXiv:1607.06450. Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan
[4] Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and Rahul Seybold, et al. 2017. CNN architectures for large-scale audio classification. In
Sukthankar. 2011. Violence detection in video using computer vision techniques. ICASSP. 131–135.
In International conference on Computer analysis of images and patterns. Springer, [28] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge
332–339. in a neural network. arXiv preprint arXiv:1503.02531 2, 7 (2015).
[5] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model [29] Judy Hoffman, Saurabh Gupta, and Trevor Darrell. 2016. Learning with side
compression. In Proceedings of the 12th ACM SIGKDD international conference on information through modality hallucination. In Proceedings of the IEEE conference
Knowledge discovery and data mining. 535–541. on computer vision and pattern recognition. 826–834.
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr [30] Samee Ullah Khan, Ijaz Ul Haq, Seungmin Rho, Sung Wook Baik, and Mi Young
Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised Lee. 2019. Cover the violence: A novel Deep-Learning-Based approach towards
vision transformers. In Proceedings of the IEEE/CVF International Conference on violence-detection in movies. Applied Sciences 9, 22 (2019), 4963.
Computer Vision. 9650–9660. [31] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
[7] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new mization. arXiv preprint arXiv:1412.6980 (2014).
model and the kinetics dataset. In proceedings of the IEEE Conference on Computer [32] Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative Learning
Vision and Pattern Recognition. 6299–6308. of Audio and Video Models from Self-Supervised Synchronization. In NeurIPS.
[8] Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao, and Lawrence 7774–7785.
Carin. 2021. Wasserstein contrastive representation distillation. In Proceedings [33] Shuo Li, Fang Liu, and Licheng Jiao. 2022. Self-Training Multi-Sequence Learning
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16296– with Transformer for Weakly Supervised Video Anomaly Detection. (2022).
16305. [34] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu,
[9] Long Chen, Yuhang Zheng, Yulei Niu, Hanwang Zhang, and Jun Xiao. 2021. and Haifeng Wang. 2020. Unimo: Towards unified-modal understanding and
Counterfactual samples synthesizing and training for robust visual question generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409
answering. arXiv preprint arXiv:2110.01013 (2021). (2020).
[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A [35] Jinyu Liu, Ying Cheng, Yuejie Zhang, Rui-Wei Zhao, and Rui Feng. 2022. Self-
simple framework for contrastive learning of visual representations. In Interna- Supervised Video Representation Learning with Motion-Contrastive Perception.
tional conference on machine learning. PMLR, 1597–1607. arXiv preprint arXiv:2204.04607 (2022).
[11] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E [36] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame
Hinton. 2020. Big self-supervised models are strong semi-supervised learners. prediction for anomaly detection–a new baseline. In Proceedings of the IEEE
Advances in neural information processing systems 33 (2020), 22243–22255. conference on computer vision and pattern recognition. 6536–6545.
[12] Tao Chen, Haizhou Shi, Siliang Tang, Zhigang Chen, Fei Wu, and Yueting Zhuang. [37] Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. 2021. Active
2021. CIL: Contrastive Instance Learning Framework for Distantly Supervised Contrastive Learning of Audio-Visual Video Representations. In ICLR. https:
Relation Extraction. arXiv preprint arXiv:2106.10855. //openreview.net/forum?id=OMizHuea_HB
[13] Yanbei Chen, Yongqin Xian, A Koepke, Ying Shan, and Zeynep Akata. 2021. [38] Oded Maron and Tomás Lozano-Pérez. 1997. A framework for multiple-instance
Distilling audio-visual knowledge by compositional contrastive learning. In learning. Advances in neural information processing systems 10.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [39] Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. 2021. Robust audio-visual
7016–7025. instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer
[14] Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, and Yuejie Zhang. 2020. Look, Vision and Pattern Recognition. 12934–12945.
listen, and attend: Co-attention network for self-supervised audio-visual repre- [40] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance
sentation learning. In ACM MM. 3884–3892. discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Con-
[15] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. ference on Computer Vision and Pattern Recognition. 12475–12486.
Electra: Pre-training text encoders as discriminators rather than generators. arXiv [41] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly super-
preprint arXiv:2003.10555 (2020). vised action localization by sparse temporal pooling network. In Proceedings of
[16] Bo Dai and Dahua Lin. 2017. Contrastive learning for image captioning. Advances the IEEE Conference on Computer Vision and Pattern Recognition. 6752–6761.
in Neural Information Processing Systems 30 (2017). [42] Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with
[17] Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca J Passonneau, and Rui Zhang. self-supervised multisensory features. In ECCV. 631–648.
2021. CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning. [43] Wen-Feng Pang, Qian-Hua He, Yong-jian Hu, and Yan-Xiong Li. 2021. Violence
arXiv preprint arXiv:2109.07589 (2021). Detection in Videos Based on Fusing Visual and Audio Information. In ICASSP
[18] Oscar Deniz, Ismael Serrano, Gloria Bueno, and Tae-Kyun Kim. 2014. Fast violence 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing
detection in video. In 2014 international conference on computer vision theory and (ICASSP). IEEE, 2260–2264.
applications (VISAPP), Vol. 2. IEEE, 478–485. [44] Bruno Peixoto, Bahram Lavi, Paolo Bestagini, Zanoni Dias, and Anderson Rocha.
[19] Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng 2020. Multimodal violence detection in videos. In ICASSP 2020-2020 IEEE Inter-
Liu. 2021. Seed: Self-supervised distillation for visual representation. arXiv national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
preprint arXiv:2101.04731. 2957–2961.
[20] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. 2021. Mist: Multiple in- [45] Bruno Peixoto, Bahram Lavi, João Paulo Pereira Martin, Sandra Avila, Zanoni
stance self-training framework for video anomaly detection. In Proceedings of the Dias, and Anderson Rocha. 2019. Toward subjective violence detection in videos.
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14009–14018. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
[21] Nuno C Garcia, Pietro Morerio, and Vittorio Murino. 2018. Modality distilla- Processing (ICASSP). IEEE, 8276–8280.
tion with multiple stream networks for action recognition. In Proceedings of the [46] Hao Peng, Tianyu Gao, Xu Han, Yankai Lin, Peng Li, Zhiyuan Liu, Maosong Sun,
European Conference on Computer Vision (ECCV). 103–118. and Jie Zhou. 2020. Learning from context or names? an empirical study on
[22] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, neural relation extraction. arXiv preprint arXiv:2010.01923 (2020).
R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology [47] Nicolae-Catalin Ristea, Neelu Madan, Radu Tudor Ionescu, Kamal Nasrollahi,
and human-labeled dataset for audio events. In 2017 IEEE international conference Fahad Shahbaz Khan, Thomas B Moeslund, and Mubarak Shah. 2021. Self-
on acoustics, speech and signal processing (ICASSP). IEEE, 776–780. Supervised Predictive Convolutional Attentive Block for Anomaly Detection.
[23] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre arXiv preprint arXiv:2111.09099.
Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan [48] Bernhard Schölkopf, Robert C Williamson, Alex Smola, John Shawe-Taylor, and
Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new John Platt. 1999. Support vector method for novelty detection. Advances in neural
approach to self-supervised learning. Advances in Neural Information Processing information processing systems 12 (1999).
Systems 33 (2020), 21271–21284. [49] Lei Shi, Kai Shuang, Shijie Geng, Peng Su, Zhengkai Jiang, Peng Gao, Zuohui
[24] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Fu, Gerard de Melo, and Sen Su. 2020. Contrastive visual-linguistic pretraining.
Larry S Davis. 2016. Learning temporal regularity in video sequences. In Proceed- arXiv preprint arXiv:2007.13135 (2020).
ings of the IEEE conference on computer vision and pattern recognition. 733–742. [50] Yixuan Su, Fangyu Liu, Zaiqiao Meng, Lei Shu, Ehsan Shareghi, and Nigel Col-
lier. 2021. TaCL: Improving BERT Pre-training with Token-aware Contrastive
Learning. arXiv preprint arXiv:2111.04198 (2021).
MM’2022, October 10–14, 2022, Lisbon, Portugal Jiashuo Yu1,* , Jinyu Liu1,* , Ying Cheng2 , Rui Feng1,2,3,† , Yuejie Zhang1,3,†
[51] Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly de- of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5632–
tection in surveillance videos. In Proceedings of the IEEE conference on computer 5641.
vision and pattern recognition. 6479–6488. [61] Peng Wu and Jing Liu. 2021. Learning causal temporal relation and feature
[52] Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation discrimination for anomaly detection. IEEE Transactions on Image Processing 30,
distillation. arXiv preprint arXiv:1910.10699. 3513–3527.
[53] Yapeng Tian, Dingzeyu Li, and Chenliang Xu. 2020. Unified multisensory per- [62] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei
ception: Weakly-supervised audio-visual video parsing. In European Conference Yang. 2020. Not only look, but also listen: Learning multimodal violence detection
on Computer Vision. Springer, 436–454. under weak supervision. In European Conference on Computer Vision. Springer,
[54] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and 322–339.
Gustavo Carneiro. 2021. Weakly-supervised video anomaly detection with robust [63] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised
temporal feature magnitude learning. In Proceedings of the IEEE/CVF International feature learning via non-parametric instance discrimination. In Proceedings of
Conference on Computer Vision. 4975–4986. the IEEE conference on computer vision and pattern recognition. 3733–3742.
[55] Aaron Van den Oord, Yazhe Li, Oriol Vinyals, et al. 2018. Representation learning [64] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. 2020. Self-
with contrastive predictive coding. arXiv preprint arXiv:1807.03748 2, 3 (2018), 4. training with noisy student improves imagenet classification. In Proceedings of
[56] Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. the IEEE/CVF conference on computer vision and pattern recognition. 10687–10698.
Journal of machine learning research 9, 11 (2008). [65] Zihui Xue, Sucheng Ren, Zhengqi Gao, and Hang Zhao. 2021. Multimodal knowl-
[57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, edge expansion. In Proceedings of the IEEE/CVF International Conference on Com-
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all puter Vision. 854–863.
you need. Advances in neural information processing systems 30. [66] Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda
[58] Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B Chan. 2022. On Zeng, Trishul Chilimbi, and Junzhou Huang. 2022. Vision-Language Pre-Training
Distinctive Image Captioning via Comparing and Reweighting. IEEE Transactions with Triple Contrastive Learning. arXiv preprint arXiv:2202.10401 (2022).
on Pattern Analysis and Machine Intelligence (2022). [67] Jiangong Zhang, Laiyun Qing, and Jun Miao. 2019. Temporal convolutional
[59] Keyu Wen, Jin Xia, Yuanyuan Huang, Linyang Li, Jiayan Xu, and Jie Shao. 2021. network with complementary inner bag loss for weakly supervised anomaly
COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision- detection. In 2019 IEEE International Conference on Image Processing (ICIP). IEEE,
Language Representation. In Proceedings of the IEEE/CVF International Conference 4030–4034.
on Computer Vision. 2208–2217. [68] Tao Zhang, Zhijie Yang, Wenjing Jia, Baoqing Yang, Jie Yang, and Xiangjian He.
[60] Spencer Whitehead, Hui Wu, Heng Ji, Rogerio Feris, and Kate Saenko. 2021. 2016. A new method for violence detection in surveillance scenes. Multimedia
Separating skills and concepts for novel visual question answering. In Proceedings Tools and Applications 75, 12 (2016), 7327–7349.