Audio-Visual Grouping Network For Sound Localization From Mixtures
Audio-Visual Grouping Network For Sound Localization From Mixtures
Abstract
10565
representations with learnable audio-visual class tokens as visual class tokens in learning compact representations for
guidance for sound localization. sound source localization.
Since multiple sound sources are mixed in the origi- Our main contributions can be summarized as follows:
nal space, recent researchers have tried to explore diverse
• We present a novel Audio-Visual Grouping Network,
pipelines to localize multiple sources on frames from a
namely AVGN, to disentangle the individual seman-
sound mixture. This multi-source task requires the model
tics from sound mixtures and images to guide source
to associate individual sources separated from the mixture
localization.
with each frame. Qian et al. [38] leveraged a two-stage
framework to capture cross-modal feature alignment be- • We introduce learnable audio-visual class tokens and
tween sound and vision representations in a coarse-to-fine category-aware grouping in sound localization to ag-
manner. DSOL [22] introduced a two-stage training frame- gregate category-wise source features with explicit
work to tackle with silence in category-aware sound source high-level semantics.
localization. More recently, Mix-and-Localize [23] pro-
posed to use a contrastive random walk in the graph with • Extensive experiments comprehensively demonstrate
images and separated sounds as nodes, where a random the state-of-the-art superiority of our AVGN over pre-
walker was trained to walk from each audio node to an vious baselines on both single-source and multi-source
image node with audio-visual similarity as the transition sounding object localization.
probability. Despite their promising performance, they can
only handle a fixed number of sources and they cannot learn 2. Related Work
compact class-aware representations for individual sources.
Audio-Visual Joint Learning. Audio-visual joint learning
In contrast, we can support a flexible number of sources as
has been addressed in many previous works [2, 4, 12, 14, 19,
input and learn class-aware representations for each source.
21, 26, 33, 34, 36, 37, 39, 50, 51] to learn the audio-visual
The main challenge is that sounds are naturally mixed correlation between two distinct modalities from videos.
in the audio space. This inspires us to disentangle the indi- Such cross-modal alignments are beneficial for many audio-
vidual semantics for each source from the mixture to guide visual tasks, such as audio-event localization [27,29,43,47],
source localization. To address the problem, our key idea is audio-visual spatialization [6, 17, 33, 35], audio-visual nav-
to disentangle individual source representation using audio- igation [6–8] and audio-visual parsing [28, 32, 42, 46]. In
visual grouping for source localization, which is different this work, our main focus is to learn compact audio-visual
from existing single-source and multi-source methods. Dur- representations for localizing individual sources on images
ing training, we aim to learn audio-visual category tokens from sound mixtures, which is more demanding than the
to aggregate category-aware source features from the sound tasks aforementioned above.
mixture and the image, where separated high-level seman- Audio-Visual Source Separation. Audio-visual source
tics for individual sources are learned. separation aims to separate individual sound sources from
To this end, we propose a novel audio-visual grouping the audio mixture given the image with sources on it. In
network, namely AVGN, that can directly learn category- recent years, researchers [12, 16, 19, 41, 44, 49, 51] have
wise semantic features for each source from the input audio tried to explore diverse pipelines to learn discriminative
mixture and frame to localize multiple sources simultane- visual representations from images for source separation.
ously. Specifically, our AVGN leverages learnable audio- Zhao et al. [51] first proposed a “Mix-and-Separate” net-
visual class tokens to aggregate class-aware source features. work to capture the alignment between pixels and the spec-
Then, the aggregated semantic features for each source will tral components of audio for the reconstruction of each in-
serve as guidance to localize the corresponding visual re- put source spectrogram. With the benefit of visual cues,
gions. Compared to previous multi-source baselines, our MP-Net [49] utilized a recursive MinusPlus Net to sepa-
new framework can support a flexible number of sources rate all salient sounds from the mixture. Tian et al. [41]
and shows the effectiveness of learning compact audio- used a cyclic co-learning framework with sounding object
visual representations with category-aware semantics. visual grounding to separate visual sound sources. More re-
Empirical experiments on MUSIC and VGGSound- cently, more types of modalities have been explored to boost
Instruments benchmarks comprehensively demonstrate the the performance of audio-visual source separation, such as
state-of-the-art performance against previous single-source motion in SoM [50], gesture composed of pose and key-
and multi-source baselines. In addition, qualitative visual- points in MG [15], and spatio-temporal visual scene graphs
izations of localization results vividly showcase the effec- in AVSGS [5]. Different from them, we do not need to re-
tiveness of our AVGN in localizing individual sources from cover the audio spectrogram of individual sources from the
mixtures. Extensive ablation studies also validate the im- mixture. Instead, we leverage the category-aware represen-
portance of category-aware grouping and learnable audio- tations of individual sources to localize the corresponding
10566
Dog Bow wow Cat Meow
Mixture Video
Spectrogram Classification Classification + Localization Frame
'
𝐠 &% '
%"# 𝐠 (% %"#
Audio Visual
Encoder Transformer Layers Encoder
$
𝒄! !"#
global audio feature spatial-level visual feature
𝐅& Audio-Visual Class Tokens 𝐅(
Figure 2. Illustration of the proposed Audio-Visual Grouping Network (AVGN). The Audio-Visual Grouping module takes as global audio
Fa = f a of the mixture spectrogram, spatial-level visual features Fv = {fpv }P
p=1 of the video frame from each encoder and learnable audio-
visual class tokens {ci }C i=1 of for C categories in the semantic space to generate disentangled class-aware audio-visual representations
{gna }N v N
n=1 , {gn }n=1 for N sources. Note that N source embeddings are chosen from C categories according to the ground-truth class.
Finally, two classification layers composed of an FC layer and a sigmoid function are separately used to predict audio and video categories,
and localization maps are generated by the cosine similarity between audio-visual class-aware embeddings.
regions for each source, where learnable audio-visual class ization, they can only handle a fixed number of sources
tokens are applied as the desirable guidance. and they cannot learn discriminative class-aware represen-
tations for individual sources. In contrast, we develop a
Visual Sound Source Localization. Visual sound source
fully novel framework to aggregate compact category-wise
localization is a typical and challenging problem that pre-
audio and visual source representations with explicit learn-
dicts the location of individual sound sources in a video.
able source class tokens. To the best of our knowledge, we
Early works [13, 20, 24] applied traditional machine learn-
are the first to leverage an explicit grouping mechanism for
ing approaches, such as statistical models [13] and canoni-
sound source localization. Our experiments in Section 4.2
cal correlation analysis [24] to learn low-level alignment be-
also demonstrate the effectiveness of AVGN in both single-
tween audio and visual representations. With the success of
source and multi-source localization.
deep neural nets, recent researchers [1, 9, 21, 30, 31, 38–40]
explored many architectures to learn the audio-visual cor- 3. Method
respondence for localizing single-source sounds. Atten- Given an image and a mixture of audio, our target is
tion10k [39] localized a sound source in the image using a to localize individual sound sources on the image. We
two-stream architecture with an attention mechanism. Hard propose a novel Audio-Visual Grouping Network named
sample mining was introduced in LVS [9] to optimize a dif- AVGN for disentangling individual semantics from the mix-
ferentiable threshold-based contrastive loss for predicting ture and image, which mainly consists of two modules,
discriminative audio-visual correspondence maps. More re- Audio-Visual Class Tokens in Section 3.2 and Audio-Visual
cently, a multiple-instance contrastive learning framework Grouping in Section 3.3.
was proposed in EZVSL [31] to align regions with the most 3.1. Preliminaries
corresponding audio without negative regions involved. In this section, we first describe the problem setup and
Due to the natural mixed property of sounds in our en- notations, and then revisit the multiple-instance contrastive
vironment, recent works [22, 23, 38] also have explored dif- learning in EZVSL [31] for single-source localization.
ferent frameworks to localize multiple sources on frames Problem Setup and Notations. Given a mixture spectro-
from a sound mixture simultaneously. DSOL [22] uti- gram and an image, our goal is to localize N individual
lized a two-stage training framework to deal with silence sound sources in the image spatially. For a video with
for category-aware sound source localization. More re- C source event categories, we have an audio-visual label,
cently, a contrastive random walk model was trained in which is denoted as {yi }C i=1 with yi for the ground-truth
Mix-and-Localize [23] to link each audio node with an category entry i as 1. During the training, we do not have
image node using a transition probability of audio-visual bounding boxes and mask-level annotations. Therefore, we
similarity. While those single-source and multi-source ap- can only use the video-level label for the mixture spectro-
proaches achieve promising performance in sound local- gram and image to perform weakly-supervised learning.
10567
Revisit Single-source Localization. To address the single- formulated as:
source localization problem, EZ-VSL [31] introduced a
multiple-instance contrastive learning framework to align \phi ^a(\mathbf {x}_j^a, \mathbf {X}^a, \mathbf {X}^a) = \mbox {Softmax}(\dfrac {\mathbf {x}_j^a{\mathbf {X}^a}^\top }{\sqrt {D}})\mathbf {X}^a (4)
the audio and visual features at locations corresponding to
sound sources. Given a set of global audio feature Fa =
f a ∈ R1×D and spatial-level visual features spanning all \phi ^v(\mathbf {x}_j^v, \mathbf {X}^v, \mathbf {X}^v) = \mbox {Softmax}(\dfrac {\mathbf {x}_j^v{\mathbf {X}^v}^\top }{\sqrt {D}})\mathbf {X}^v (5)
locations in an image Fv = {fpv }P v
p=1 , fp ∈ R
1×D
, EZ-VSL
applied the multiple-instance contrastive objective to align Then, in order to constrain the independence of each class
at least one location in the corresponding bag of visual fea- token ci in the audio-visual semantic space, we apply
tures with the audio representation in the same mini-batch, a fully-connected (FC) layer and add softmax operator
which is defined as: to predict the individual source class probability: ei =
Softmax(FC(ci )). Each audio-visual category probabil-
PC
ity is optimized by a cross-entropy loss i=1 CE(hi , ei ),
\label {eq:micl} \mathcal {L}_{\mbox {baseline}} = - \frac {1}{B}\sum _{b=1}^B \log \frac { \exp \left ( \frac {1}{\tau } \mathtt {sim}(\mathbf {F}^a_b, \mathbf {F}^v_b) \right ) }{ \sum _{m=1}^B \exp \left ( \frac {1}{\tau } \mathtt {sim}(\mathbf {F}^a_b, \mathbf {F}^v_m)\right )} where CE(·) is cross-entropy loss; hi denotes a one-hot en-
coding with its element for the target category entry i as 1.
(1)
Optimizing the loss will push the learned token embeddings
where the similarity sim(Fa , Fv ) denotes the max-pooled
to be category-aware and discriminative.
audio-visual cosine similarity of Fa and Fv = {fpv }P p=1
across all P spatial locations. B is the batch size, D is the 3.3. Audio-Visual Grouping
dimension size, and τ is a temperature hyper-parameter. With the benefit of the aforementioned category-
However, such a training objective will pose the main constraint objective, we propose a novel and explicit audio-
challenge for multi-source localization. The global audio visual grouping module composed of grouping blocks
representation extracted from the mixture is mixed and thus g a (·), g v (·) to take the learned audio-visual source class to-
they can not associate individual sources separated from kens and aggregated features as inputs to generate category-
the mixture with the corresponding regions. To address aware audio-visual embeddings as:
the challenge, we are inspired by [48] and propose a novel \begin {aligned} \{\mathbf {g}_i^a\}_{i=1}^C &= g^a(\{\hat {\mathbf {f}}^a, \{\hat {\mathbf {c}}_i^a\}_{i=1}^C),\\ \{\mathbf {g}_i^v\}_{i=1}^C &= g^v(\{\hat {\mathbf {f}}_p^v\}_{p=1}^P, \{\hat {\mathbf {c}}_i^v\}_{i=1}^C) \end {aligned}
Audio-Visual Grouping Network that can learn to disentan- (6)
gle the individual semantics from the mixture and image to
guide multi-source localization, as illustrated in Figure 2. During the phase of grouping, we merge all the audio-visual
3.2. Audio-Visual Class Tokens features from the same class token into a new class-aware
audio-visual feature, by calculating the global audio simi-
In order to explicitly disentangle individual semantics
larity vector Aa ∈ R1×C and spatial visual similarity ma-
from the mixed sound space and image, we introduce a
trix Av ∈ RP ×C between audio-visual features and audio-
novel learnable audio-visual class tokens {ci }C i=1 to help
visual class tokens via a softmax operation, which is formu-
group semantic-aware information from audio-visual repre-
lated as
sentations f a , {fpv }P
p=1 , where ci ∈ R
1×D
, C is the total
\begin {aligned} & \mathbf {A}^a_{i} = \mbox {Softmax}(W_q^a\hat {\mathbf {f}}^a \cdot W_k^a\hat {\mathbf {c}}_i^a), \\ & \mathbf {A}^v_{p,i} = \mbox {Softmax}(W_q^v\hat {\mathbf {f}}_p^v \cdot W_k^v\hat {\mathbf {c}}_i^v) \end {aligned}
number of source classes, P denotes the number of total (7)
locations in the spatial map.
With the categorical audio-visual tokens and raw rep- where Wqa , Wka ∈ RD×D and Wqv , Wkv ∈ RD×D de-
resentations, we first apply self-attention transformers note the learnable weights of linear projections for the fea-
ϕa (·), ϕv (·) to aggregate global audio and spatial visual fea- tures and class tokens of audio and visual modalities, sep-
tures from the raw input and align the features with the cat- arately. With this global audio similarity vector and spatial
egorical token embeddings as: visual similarity matrix, we compute the weighted sum of
all global audio and spatial visual features assigned to gen-
\begin {aligned} & \hat {\mathbf {f}}^a, \{\hat {\mathbf {c}}_i^a\}_{i=1}^C = \{\phi ^a(\mathbf {x}^a_{j}, \mathbf {X}^a, \mathbf {X}^a)\}_{j=1}^{1+C}, \\ & \mathbf {X}^a = \{\mathbf {x}^a_{j}\}_{j=1}^{1+C} = [\mathbf {f}^a; \{\mathbf {c}_i\}_{i=1}^C] \end {aligned} erate the category-aware representations as:
(2)
\label {eq:uni_group} \begin {aligned} \mathbf {g}_i^a & = g^a(\hat {\mathbf {f}}^a, \hat {\mathbf {c}}_i^a)=\hat {\mathbf {c}}_i^a + W_o^a\dfrac {\mathbf {A}^a_{i}W_v^a\hat {\mathbf {f}}^a}{\mathbf {A}^a_{i}}, \\ \mathbf {g}_i^v & = g^v(\{\hat {\mathbf {f}}_p^v\}_{p=1}^P, \hat {\mathbf {c}}_i^v) = \hat {\mathbf {c}}_i^v + W_o^v\dfrac {\sum _{p=1}^{P}\mathbf {A}^v_{p,i}W_v^v\hat {\mathbf {f}}_p^v}{\sum _{p=1}^{P}\mathbf {A}^v_{p,i}} \end {aligned}
\begin {aligned} & \{\hat {\mathbf {f}}_p^v\}_{p=1}^P, \{\hat {\mathbf {c}}_i^v\}_{i=1}^C = \{\phi ^v(\mathbf {x}^v_j, \mathbf {X}^v, \mathbf {X}^v)\}_{j=1}^{P+C}, \\ & \mathbf {X}^v = \{\mathbf {x}^v_{j}\}_{j=1}^{P+C} = [\{\mathbf {f}_p^v\}_{p=1}^P; \{\mathbf {c}_i\}_{i=1}^C] \end {aligned} (8)
(3)
where [ ; ] denotes the concatenation operator. where Woa , Wva ∈ RD×D and Wov , Wvv ∈ RD×D de-
f̂ a , f̂pv , ĉai , ĉvi ∈ R1×D , and and D is the dimension note the learned weights of linear projections for out-
of embeddings. The self-attention operators ϕa (·) is put and value in terms of audio and visual modalities,
10568
MUSIC-Solo VGGSound-Instruments VGGSound-Single
Method
AP(%) [email protected](%) AUC(%) AP(%) [email protected](%) AUC(%) AP(%) [email protected](%) AUC(%)
Attention10k [39] – 37.2 38.7 – 28.3 26.1 – 19.2 30.6
OTS [3] 69.3 26.1 35.8 47.5 25.7 24.6 29.8 32.8 35.7
DMC [21] – 29.1 38.0 – 26.5 25.7 – 23.9 27.6
CoarsetoFine [38] 70.7 33.6 39.8 40.2 27.2 26.5 28.2 29.1 34.8
LVS [9] 70.6 41.9 40.3 42.3 32.6 28.3 29.6 34.4 38.2
EZ-VSL [31] 71.5 45.8 41.2 43.8 38.5 30.6 31.3 38.9 39.5
Mix-and-Localize [23] 68.6 30.5 40.8 44.9 49.7 32.3 32.5 36.3 38.9
DSOL [22] – 51.4 43.7 – 50.2 32.9 – 35.7 37.2
AVGN (ours) 77.2 58.1 48.5 50.5 55.3 36.7 35.3 40.8 42.3
Table 1. Quantitative results of single-source localization on MUSIC-Solo, VGGSound-Instruments, and VGGSound-Single datasets.
previous works [9, 30, 31], the final localization map is gen- is smaller than the raw MUSIC dataset. For a fair comparison, we trained
erated through bilinear interpolation of the similarity map. all models on the same training data.
10569
MUSIC-Duet VGGSound-Instruments VGGSound-Duet
Method
CAP(%) PIAP(%) [email protected](%) AUC(%) CAP(%) PIAP(%) [email protected](%) AUC(%) CAP(%) PIAP(%) [email protected](%) AUC(%)
Attention10k [39] – – 21.6 19.6 – – 52.3 11.7 – – 11.5 15.2
OTS [3] 11.6 17.7 13.3 18.5 23.3 37.8 51.2 11.2 10.5 12.7 12.2 15.8
DMC [21] – – 17.5 21.1 – – 53.7 12.5 – – 13.8 17.1
CoarsetoFine [38] – – 17.6 20.6 – – 54.2 12.9 – – 14.7 18.5
LVS [9] – – 22.5 20.9 – – 57.3 13.3 – – 17.3 19.5
EZ-VSL [31] – – 24.3 21.3 – – 60.2 14.2 – – 20.5 20.2
Mix-and-Localize [23] 47.5 54.1 26.5 21.5 21.5 37.5 73.2 15.6 16.3 22.6 21.1 20.5
DSOL [22] – – 30.1 22.3 – – 74.3 15.9 – – 22.3 21.1
AVGN (ours) 50.6 57.2 32.5 24.6 27.3 42.8 77.5 18.2 21.9 28.1 26.2 23.8
Table 2. Quantitative results of multi-source localization on MUSIC-duet, VGGSound-Instruments, and VGGSound-Duet datasets.
ison with [23]. For the threshold of IoU and CIoU, we walk algorithm in the graph composed of images and sepa-
use [email protected] and [email protected] for MUSIC-Solo and MUSIC- rated sounds as nodes.
Duet, [email protected] and [email protected] for single-source and multi- For single-source localization, we report the quantitative
source localization on VGGSound-Instruments, [email protected] comparison results in Table 1. As can be seen, we achieve
and [email protected] for single-source and multi-source localiza- the best performance in terms of all metrics for three bench-
tion on VGGSound-Single and VGGSound-Duet. marks, compared to previous self-supervised and weakly-
Implementation. For input images, the resolution is re- supervised baselines. In particular, the proposed AVGN
sized to 224 × 224. For input audio, we take the log spec- significantly outperforms DSOL [22], the current state-
trograms extracted from 3s of audio at a sample rate of of-the-art weakly-supervised baseline, by 6.7 [email protected] &
22050Hz. We follow the prior work [31] and apply STFT to 4.8 AUC, 5.1 [email protected] & 3.8 AUC, and 5.1 [email protected] &
generate an input tensor of size 257 × 300 (257 frequency 5.1 AUC on three datasets. Moreover, we achieve su-
bands over 300 timesteps) using 50ms windows with a hop perior performance gains compared to EZ-VSL [31], the
size of 25ms. Following previous work [9, 21, 30, 31, 38], current state-of-the-art self-supervised baseline, which im-
we use the lightweight ResNet18 [18] as the audio and vi- plies the importance of extracting category-aware semantics
sual encoder, and initialize the visual model using weights from audio-visual inputs as the guidance for learning audio-
pre-trained on ImageNet [11]. D = 512, P = 49 for the visual alignment discriminatively. Meanwhile, our AVGN
7 × 7 spatial map from the visual encoder. The depth of outperforms than Mix-and-Localize [23] by a large mar-
self-attention transformers ϕa (·), ϕv (·) is 3. The model is gin, where we achieve the performance gains of 8.6 AP on
trained for 100 epochs using the Adam optimizer [25] with MUSIC-Solo, 5.6 AP on VGGSound-Instruments, and 2.7
a learning rate of 1e − 4 and a batch size of 128. AP on VGGSound-Single. These significant improvements
4.2. Comparison to prior work demonstrate the superiority of our method in single-source
In this work, we propose a novel and effective frame- localization.
work for sound source localization. In order to validate the In addition, significant gains in multi-source sound local-
effectiveness of the proposed AVGN, we comprehensively ization can be observed in Table 2. Compared to Mix-and-
compare it to previous single-source and multi-source base- Localize [23], the current state-of-the-art multi-source lo-
lines: 1) Attention 10k [39] (2018’CVPR): the first work calization baseline, we achieve the results gains of 5.8 CAP,
on single-source localization with a two-stream architecture 5.3 PIAP, 4.3 [email protected], and 2.6 AUC on VGGSound-
and attention mechanism; 2) OTS [3] (2018’ECCV): a sim- Instruments. Furthermore, when evaluated on the challeng-
ple baseline with audio-visual correspondence as the train- ing VGGSound-Duet benchmark, the proposed approach
ing objective; 3) DMC [21] (2019’CVPR): a deep multi- still outperforms Mix-and-Localize [23] by 5.6 CAP, 5.5
modal clustering network with audio-visual co-occurrences PIAP, 5.1 [email protected], and 3.3 AUC. We also achieve highly
to learn convolutional maps for each modality in different better results against DSOL [22], the weakly-supervised
embedding spaces; 4) CoarsetoFine [38] (2020’ECCV): a baseline with two training stages. These results validate
two-stage baseline with coarse-to-fine alignment for cross- the effectiveness of our approach in learning disentangled
modal features; 5) DSOL [22] (2020’NeurIPS): a two- individual source semantics from mixtures and images for
stage training framework with classes as weak supervision multi-source localization.
for category-aware sound source localization; 6) LVS [9] In order to qualitatively evaluate the localization maps,
(2021’CVPR): a contrastive network to learn audio-visual we compare the proposed AVGN with EZ-VSL [31], Mix-
correspondence maps with hard negative mining; 7) EZ- and-Localize [23], and DSOL [22] on both single-source
VSL [31] (2022’ECCV): a recent strong baseline with and multi-source localization in Figure 3. From compar-
multiple-instance contrastive learning for single-source lo- isons, three main observations can be derived: 1) With-
calization; 8) Mix-and-Localize [23] (2022’CVPR): a out explicit separation objectives, EZ-VSL [31], the strong
strong multi-source baseline using a contrastive random single-source baseline, performs worse on the multi-source
10570
Figure 3. Qualitative comparisons with single-source and multi-source baselines on multi-source localization. Note that blue color refers
to high attention values and red for low attention values. The proposed AVGN produces much more accurate and high-quality localization
maps for each source.
MUSIC-Solo MUSIC-Duet
AVCT AVG
AP(%) [email protected](%) AUC(%) CAP(%) PIAP(%) [email protected](%) AUC(%)
✗ ✗ 71.5 45.8 41.2 39.7 43.1 24.3 21.3
✓ ✗ 75.2 52.3 45.1 46.9 51.8 27.6 22.5
✗ ✓ 73.6 48.2 43.5 42.8 49.5 25.3 21.8
✓ ✓ 77.2 58.1 48.5 50.6 57.2 32.5 24.6
Table 3. Ablation studies on Audio-Visual Class Tokens (AVCT) and Audio-Visual Grouping (AVG).
localization. 2) the quality of localization maps gener- 6.5 [email protected], and 3.9 AUC) and multi-source localization
ated by our method is much better than the self-supervised (by 7.2 CAP, 8.7 PIAP, and 3.3 [email protected]), which demon-
multi-source baseline, Mix-and-Localize [23]. 3) the pro- strates the benefit of category tokens in extracting disen-
posed AVGN achieves competitive even better results on tangled high-level semantics for source localization. Mean-
predicted maps against the weakly-supervised multi-source while, introducing only AVG in the baseline also increases
baseline [22] by using category labels during training. the source localization performance in terms of all metrics.
These visualizations further showcase the superiority of our More importantly, incorporating AVCT and AVG together
simple AVGN in learning category-aware audio-visual rep- into the baseline significantly raises the performance by 5.7
resentations to guide localization for each source. AP, 12.3 [email protected] and 7.3 AUC on single-source, and by
4.3. Experimental analysis 10.9 CAP, 14.1 PIAP, 8.2 [email protected] and 3.3 AUC on multi-
In this section, we performed ablation studies to demon- source localization. These improving results validate the
strate the benefit of introducing the Audio-Visual Class To- importance of audio-visual class tokens and audio-visual
kens and Audio-Visual Grouping module. We also con- grouping in extracting category-aware semantics from the
ducted extensive experiments to explore a flexible num- mixture and image for sound localization.
ber of sound source localization, and learned disentangled Generalizing to Flexible Number of Sources. In order to
category-aware audio-visual representations. show the generalizability of the proposed AVGN to a flexi-
Audio-Visual Class Tokens & Audio-Visual Grouping. ble number of sources, we directly transfer the model with-
In order to validate the effectiveness of the introduced out additional training to test a mixture of 3 sources. We still
audio-visual class tokens (AVCT) and audio-visual group- achieve competitive results of 18.5 CAP, 23.7 PIAP, 22.7
ing (AVG), we ablate the necessity of each module and re- [email protected], and 21.8 AUC on the challenging VGGSound-
port the quantitative results in Table 3. We can observe Duet dataset. These results indicate that our AVGN can sup-
that adding bearable AVCT to the vanilla baseline highly in- port localizing a flexible number of sources from the mix-
creases the results of single-source localization (by 3.7 AP, ture, which is different from Mix-and-Localize [23] with a
10571
Figure 4. Qualitative comparisons of representations learned by Mix-and-Localize, DSOL, and the proposed AVGN. Note that each spot
denotes the feature of one source sound, and each color refers to one source category, such as “trumpet” in orange and “cello” in pink.
fixed number of sources as the number of nodes defined in One possible reason is that our model easily overfits across
the trained contrastive random walker. the training phase, and the solution is to incorporate dropout
Learned Category-aware Audio-Visual Representa- and momentum encoders together for multi-source localiza-
tions. Learning disentangled audio-visual representations tion. Meanwhile, we notice that if we transfer our model
with category-aware semantics is critical for us to local- to open-set source localization without additional training,
ize the sound source from a mixture. To better evaluate it would be hard to localize unseen categories as we need
the quality of learned category-aware features, we visual- to pre-define a set of categories during training and do not
ize the learned visual and audio representations of 6 cat- learn unseen category tokens to guide multi-source local-
egories in MUSIC-Duet by t-SNE [45], as shown in Fig- ization. The future work could be to add enough learnable
ure 4. Note that each color refers to one category of the class tokens or apply continual learning to new classes.
source sound, such as “trumpet” in orange and “cello”
5. Conclusion
in pink. As can be observed in the last column, audio-
visual representations extracted by the proposed AVGN In this work, we present AVGN, a novel audio-visual
are both intra-category compact and inter-category sepa- grouping network, that can directly learn category-wise se-
rable. In contrast to our disentangled embeddings in the mantic features for each source from audio and visual in-
audio-visual semantic space, there still exists mixtures of puts for localizing multiple sources in videos. We introduce
multiple audio-visual categories among features learned by learnable audio-visual category tokens to aggregate class-
Mix-and-Localize [23]. With the benefit of the weakly- aware source features. Then, we leverage the aggregated
supervised classes, DSOL [22] can extract clustered audio- semantic features for each source to guide localizing the
visual features on some classes, such as “cello” in pink. corresponding regions. Compared to existing multi-source
However, most categories are mixed together as they do not methods, our new framework can handle a flexible number
incorporate the explicit audio-visual grouping mechanism of sources and learns compact audio-visual semantic repre-
in our AVGN. These meaningful visualization results fur- sentations. Empirical experiments on MUSIC, VGGSound-
ther showcase the success of our AVGN in extracting com- Instruments, and VGG-Sound Sources benchmarks demon-
pact audio-visual representations with class-aware seman- strate the state-of-the-art performance of our AVGN on both
tics for sound source localization from the mixture. single-source and multi-source localization.
Broader Impact. The proposed method localizes sound
4.4. Limitation sources learning from user-uploaded web videos, which
Although the proposed AVGN achieves superior results might cause the model to learn internal biases in the data.
on both single-source and multi-source localization, the per- For example, the model could fail to localize certain rare
formance gains of our approach on the MUSIC-Duet bench- but crucial sound sources. These issues should be carefully
mark with a small number of categories are not significant. addressed for the deployment of real scenarios.
10572
References [13] John W Fisher III, Trevor Darrell, William Freeman, and
Paul Viola. Learning joint statistical models for audio-visual
[1] Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and fusion and segregation. In Proceedings of Advances in Neu-
Andrew Zisserman. Self-supervised learning of audio-visual ral Information Processing Systems (NeurIPS), 2000. 3
objects from video. In Proceedings of European Conference [14] Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenen-
on Computer Vision (ECCV), pages 208–224, 2020. 1, 3 baum, and Antonio Torralba. Music gesture for visual sound
[2] Relja Arandjelovic and Andrew Zisserman. Look, listen and separation. In IEEE/CVF Conference on Computer Vision
learn. In Proceedings of the IEEE International Conference and Pattern Recognition (CVPR), pages 10478–10487, 2020.
on Computer Vision (ICCV), pages 609–617, 2017. 2 2
[3] Relja Arandjelovic and Andrew Zisserman. Objects that [15] Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenen-
sound. In Proceedings of the European Conference on Com- baum, and Antonio Torralba. Music gesture for visual sound
puter Vision (ECCV), pages 435–451, 2018. 5, 6 separation. In Proceedings of the IEEE/CVF Conference on
[4] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Sound- Computer Vision and Pattern Recognition (CVPR), 2020. 2
net: Learning sound representations from unlabeled video. [16] Ruohan Gao, Rogerio Feris, and Kristen Grauman. Learning
In Proceedings of Advances in Neural Information Process- to separate object sounds by watching unlabeled video. In
ing Systems (NeurIPS), 2016. 2 Proceedings of the European Conference on Computer Vi-
[5] Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, sion (ECCV), pages 35–53, 2018. 2
and Anoop Cherian. Visual scene graphs for audio source [17] Ruohan Gao and Kristen Grauman. 2.5d visual sound. In
separation. In Proceedings of the IEEE/CVF International Proceedings of the IEEE/CVF Conference on Computer Vi-
Conference on Computer Vision (ICCV), pages 1204–1213, sion and Pattern Recognition (CVPR), pages 324–333, 2019.
2021. 2 2
[6] Changan Chen, Unnat Jain, Carl Schissler, S. V. A. Garı́, [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Deep residual learning for image recognition. In Proceedings
Kristen Grauman. Soundspaces: Audio-visual navigation in of IEEE/CVF Conference on Computer Vision and Pattern
3d environments. In Proceedings of European Conference Recognition (CVPR), pages 770–778, 2016. 6
on Computer Vision (ECCV), pages 17–36, 2020. 2 [19] John Hershey and Michael Casey. Audio-visual sound sepa-
[7] Changan Chen, Sagnik Majumder, Al-Halah Ziad, Ruohan ration via hidden markov models. Advances in Neural Infor-
Gao, Santhosh Kumar Ramakrishnan, and Kristen Grauman. mation Processing Systems, 14, 2001. 2
Learning to set waypoints for audio-visual navigation. In [20] John Hershey and Javier Movellan. Audio vision: Us-
Proceedings of International Conference on Learning Rep- ing audio-visual synchrony to locate sounds. In Proceed-
resentations (ICLR), 2021. 2 ings of Advances in Neural Information Processing Systems
[8] Changan Chen, Carl Schissler, Sanchit Garg, Philip (NeurIPS), 1999. 3
Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, [21] Di Hu, Feiping Nie, and Xuelong Li. Deep multimodal clus-
Philip W Robinson, and Kristen Grauman. Soundspaces 2.0: tering for unsupervised audiovisual learning. In Proceed-
A simulation platform for visual-acoustic learning. In Pro- ings of the IEEE Conference on Computer Vision and Pattern
ceedings of Advances in Neural Information Processing Sys- Recognition (CVPR), pages 9248–9257, 2019. 1, 2, 3, 5, 6
tems (NeurIPS) Datasets and Benchmarks Track, 2022. 2 [22] Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui
[9] Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Na- Ding, Weiyao Lin, and Dejing Dou. Discriminative sounding
grani, Andrea Vedaldi, and Andrew Zisserman. Localizing objects localization via self-supervised audiovisual match-
visual sounds the hard way. In Proceedings of the IEEE/CVF ing. In Proceedings of Advances in Neural Information Pro-
Conference on Computer Vision and Pattern Recognition cessing Systems (NeurIPS), pages 10077–10087, 2020. 2, 3,
(CVPR), pages 16867–16876, 2021. 1, 3, 5, 6 5, 6, 7, 8
[10] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew [23] Xixi Hu, Ziyang Chen, and Andrew Owens. Mix and local-
Zisserman. Vggsound: A large-scale audio-visual dataset. ize: Localizing sound sources in mixtures. In Proceedings of
In ICASSP 2020-2020 IEEE International Conference on the IEEE/CVF Conference on Computer Vision and Pattern
Acoustics, Speech and Signal Processing (ICASSP), pages Recognition (CVPR), pages 10483–10492, 2022. 1, 2, 3, 5,
721–725. IEEE, 2020. 5 6, 7, 8
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia. Li, Kai Li, and [24] Einat Kidron, Yoav Y Schechner, and Michael Elad. Pixels
Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image that sound. In Proceedings of IEEE Conference on Computer
Database. In Proceedings of IEEE/CVF Conference on Com- Vision and Pattern Recognition (CVPR), 2005. 3
puter Vision and Pattern Recognition (CVPR), pages 248– [25] Diederik P Kingma and Jimmy Ba. Adam: A method for
255, 2009. 6 stochastic optimization. arXiv preprint arXiv:1412.6980,
[12] Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin 2014. 6
Wilson, Avinatan Hassidim, William T Freeman, and [26] Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera-
Michael Rubinstein. Looking to listen at the cocktail party: tive learning of audio and video models from self-supervised
A speaker-independent audio-visual model for speech sepa- synchronization. In Proceedings of Advances in Neural In-
ration. arXiv preprint arXiv:1804.03619, 2018. 2 formation Processing Systems (NeurIPS), 2018. 2
10573
[27] Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang. Dual- tically similar samples. In Proceedings of IEEE Interna-
modality seq2seq network for audio-visual event localiza- tional Conference on Acoustics, Speech and Signal Process-
tion. In IEEE International Conference on Acoustics, Speech ing (ICASSP), 2022. 1, 3
and Signal Processing (ICASSP), pages 2002–2006, 2019. 2 [41] Yapeng Tian, Di Hu, and Chenliang Xu. Cyclic co-learning
[28] Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, of sounding object visual grounding and sound separation.
and Ming-Hsuan Yang. Exploring cross-video and cross- In Proceedings of the IEEE/CVF Conference on Computer
modality signals for weakly-supervised audio-visual video Vision and Pattern Recognition (CVPR), pages 2745–2754,
parsing. In Proceedings of Advances in Neural Information 2021. 2
Processing Systems (NeurIPS), 2021. 2 [42] Yapeng Tian, Dingzeyu Li, and Chenliang Xu. Unified mul-
[29] Yan-Bo Lin and Yu-Chiang Frank Wang. Audiovisual trans- tisensory perception: Weakly-supervised audio-visual video
former with instance attention for audio-visual event local- parsing. In Proceedings of European Conference on Com-
ization. In Proceedings of the Asian Conference on Com- puter Vision (ECCV), page 436–454, 2020. 2
puter Vision (ACCV), 2020. 2 [43] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chen-
[30] Shentong Mo and Pedro Morgado. A closer look at weakly- liang Xu. Audio-visual event localization in unconstrained
supervised audio-visual source localization. In Proceed- videos. In Proceedings of European Conference on Com-
ings of Advances in Neural Information Processing Systems puter Vision (ECCV), 2018. 2
(NeurIPS), 2022. 1, 3, 5, 6 [44] Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Her-
[31] Shentong Mo and Pedro Morgado. Localizing visual sounds shey, Tal Remez, Daniel PW Ellis, and John R Hershey. Into
the easy way. In Proceedings of European Conference on the wild with audioscope: Unsupervised audio-visual separa-
Computer Vision (ECCV), page 218–234, 2022. 1, 3, 4, 5, 6 tion of on-screen sounds. arXiv preprint arXiv:2011.01143,
[32] Shentong Mo and Yapeng Tian. Multi-modal grouping net- 2020. 2
work for weakly-supervised audio-visual video parsing. In [45] Laurens van der Maaten and Geoffrey Hinton. Visualizing
Proceedings of Advances in Neural Information Processing data using t-sne. Journal of Machine Learning Research,
Systems (NeurIPS), 2022. 2 9(86):2579–2605, 2008. 8
[33] Pedro Morgado, Yi Li, and Nuno Nvasconcelos. Learning [46] Yu Wu and Yi Yang. Exploring heterogeneous clues for
representations from audio-visual spatial alignment. In Pro- weakly-supervised audio-visual video parsing. In Proceed-
ceedings of Advances in Neural Information Processing Sys- ings of the IEEE Conference on Computer Vision and Pattern
tems (NeurIPS), pages 4733–4744, 2020. 2 Recognition (CVPR), pages 1326–1335, 2021. 2
[34] Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. Ro- [47] Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. Dual attention
bust audio-visual instance discrimination. In Proceedings of matching for audio-visual event localization. In Proceedings
the IEEE/CVF Conference on Computer Vision and Pattern of the IEEE International Conference on Computer Vision
Recognition (CVPR), pages 12934–12945, 2021. 2 (ICCV), pages 6291–6299, 2019. 2
[35] Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, and [48] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon,
Oliver Wang. Self-supervised generation of spatial audio for Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit:
360°video. In Proceedings of Advances in Neural Informa- Semantic segmentation emerges from text supervision. arXiv
tion Processing Systems (NeurIPS), 2018. 2 preprint arXiv:2202.11094, 2022. 4
[36] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio- [49] Xudong Xu, Bo Dai, and Dahua Lin. Recursive visual
visual instance discrimination with cross-modal agreement. sound separation using minus-plus net. In Proceedings of
In Proceedings of the IEEE/CVF Conference on Com- the IEEE/CVF International Conference on Computer Vision
puter Vision and Pattern Recognition (CVPR), pages 12475– (ICCV), 2019. 2
12486, June 2021. 2 [50] Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Tor-
[37] Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. ralba. The sound of motions. In Proceedings of the
Freeman, and Antonio Torralba. Ambient sound provides IEEE/CVF International Conference on Computer Vision
supervision for visual learning. In Proceedings of the Euro- (ICCV), pages 1735–1744, 2019. 2
pean Conference on Computer Vision (ECCV), pages 801– [51] Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Von-
816, 2016. 2 drick, Josh McDermott, and Antonio Torralba. The sound
[38] Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, of pixels. In Proceedings of the European Conference on
and Weiyao Lin. Multiple sound sources localization from Computer Vision (ECCV), pages 570–586, 2018. 1, 2, 5
coarse to fine. In Proceedings of European Conference on
Computer Vision (ECCV), pages 292–308, 2020. 2, 3, 5, 6
[39] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan
Yang, and In So Kweon. Learning to localize sound source
in visual scenes. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
4358–4366, 2018. 1, 2, 3, 5, 6
[40] Arda Senocak, Hyeonggon Ryu, Junsik Kim, and In So
Kweon. Learning sound localization better from seman-
10574