0% found this document useful (0 votes)

19 views10 pages

Audio-Visual Grouping Network For Sound Localization From Mixtures

Uploaded by

SONG TANG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views10 pages

Audio-Visual Grouping Network For Sound Localization From Mixtures

Uploaded by

SONG TANG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

Audio-Visual Grouping Network for Sound Localization from Mixtures

Shentong Mo Yapeng Tian*

Carnegie Mellon University University of Texas at Dallas

Abstract

Sound source localization is a typical and challenging

task that predicts the location of sound sources in a video.
Previous single-source methods mainly used the audio-
visual association as clues to localize sounding objects in
each image. Due to the mixed property of multiple sound
sources in the original space, there exist rare multi-source
approaches to localizing multiple sources simultaneously,
except for one recent work using a contrastive random walk
in the graph with images and separated sound as nodes.
Despite their promising performance, they can only handle
a fixed number of sources, and they cannot learn compact
Figure 1. Comparison of our AVGN with state-of-the-art methods
class-aware representations for individual sources. To al-
on single-source (Top Row) and multi-source (Bottom Row) sound
leviate this shortcoming, in this paper, we propose a novel localization on MUSIC [51], VGGSound-Instruments [23], and
audio-visual grouping network, namely AVGN, that can di- VGG-Sound Sources [9] benchmarks.
rectly learn category-wise semantic features for each source
from the input audio mixture and image to localize multi- ronment. This human perception intelligence attracts many
ple sources simultaneously. Specifically, our AVGN lever- researchers to explore audio-visual joint learning for visual
ages learnable audio-visual class tokens to aggregate class- sound source localization.
aware source features. Then, the aggregated semantic fea- Visual sound source localization is a typical and chal-
tures for each source can be used as guidance to localize lenging task that predicts the location of sound sources in
the corresponding visual regions. Compared to existing a video. To tackle this problem, early single-source meth-
multi-source methods, our new framework can localize a ods [1, 9, 21, 30, 31, 39, 40] mainly used the audio-visual as-
flexible number of sources and disentangle category-aware sociation as clues to localize sounding objects in the frame.
audio-visual representations for individual sound sources. Typically, Attention10k [39] introduced a two-stream ar-
We conduct extensive experiments on MUSIC, VGGSound- chitecture for audio and images to localize a sound source
Instruments, and VGG-Sound Sources benchmarks. The in the image using an attention mechanism. Based on the
results demonstrate that the proposed AVGN can achieve attention, Afouras et al. [1] incorporated the optical flow
state-of-the-art sounding object localization performance for more accurate localization in a video. To explicitly
on both single-source and multi-source scenarios. Code learn discriminative audio-visual corresponding fragments,
is available at https://github.com/stoneMo/ LVS [9] proposed hard sample mining with a differentiable
AVGN . threshold-based contrastive loss, while HardPos [40] uti-
lized hard positives from negative pairs in the loss. More
recently, EZVSL [31] developed a multiple-instance con-
1. Introduction trastive learning framework on the most aligned regions cor-
When we hear a dog barking, we are naturally aware of responding to the audio. SLAVC [30] adopted momentum
where the dog is in the room due to the strong correspon- encoders and extreme visual dropout to address overfitting
dence between audio signals and visual objects in the world. and silence issues in single-source sound localization. How-
In the meanwhile, we are capable of separating individual ever, those baselines are based on the single-source sound as
sources from a mixture of multiple sources in the daily envi- input and they achieve worse performance for sound local-
ization from mixtures. In this work, we will solve the prob-
* Corresponding author. lem in our approach by extracting disentangled and compact

10565
representations with learnable audio-visual class tokens as visual class tokens in learning compact representations for
guidance for sound localization. sound source localization.
Since multiple sound sources are mixed in the origi- Our main contributions can be summarized as follows:
nal space, recent researchers have tried to explore diverse
• We present a novel Audio-Visual Grouping Network,
pipelines to localize multiple sources on frames from a
namely AVGN, to disentangle the individual seman-
sound mixture. This multi-source task requires the model
tics from sound mixtures and images to guide source
to associate individual sources separated from the mixture
localization.
with each frame. Qian et al. [38] leveraged a two-stage
framework to capture cross-modal feature alignment be- • We introduce learnable audio-visual class tokens and
tween sound and vision representations in a coarse-to-fine category-aware grouping in sound localization to ag-
manner. DSOL [22] introduced a two-stage training frame- gregate category-wise source features with explicit
work to tackle with silence in category-aware sound source high-level semantics.
localization. More recently, Mix-and-Localize [23] pro-
posed to use a contrastive random walk in the graph with • Extensive experiments comprehensively demonstrate
images and separated sounds as nodes, where a random the state-of-the-art superiority of our AVGN over pre-
walker was trained to walk from each audio node to an vious baselines on both single-source and multi-source
image node with audio-visual similarity as the transition sounding object localization.
probability. Despite their promising performance, they can
only handle a fixed number of sources and they cannot learn 2. Related Work
compact class-aware representations for individual sources.
Audio-Visual Joint Learning. Audio-visual joint learning
In contrast, we can support a flexible number of sources as
has been addressed in many previous works [2, 4, 12, 14, 19,
input and learn class-aware representations for each source.
21, 26, 33, 34, 36, 37, 39, 50, 51] to learn the audio-visual
The main challenge is that sounds are naturally mixed correlation between two distinct modalities from videos.
in the audio space. This inspires us to disentangle the indi- Such cross-modal alignments are beneficial for many audio-
vidual semantics for each source from the mixture to guide visual tasks, such as audio-event localization [27,29,43,47],
source localization. To address the problem, our key idea is audio-visual spatialization [6, 17, 33, 35], audio-visual nav-
to disentangle individual source representation using audio- igation [6–8] and audio-visual parsing [28, 32, 42, 46]. In
visual grouping for source localization, which is different this work, our main focus is to learn compact audio-visual
from existing single-source and multi-source methods. Dur- representations for localizing individual sources on images
ing training, we aim to learn audio-visual category tokens from sound mixtures, which is more demanding than the
to aggregate category-aware source features from the sound tasks aforementioned above.
mixture and the image, where separated high-level seman- Audio-Visual Source Separation. Audio-visual source
tics for individual sources are learned. separation aims to separate individual sound sources from
To this end, we propose a novel audio-visual grouping the audio mixture given the image with sources on it. In
network, namely AVGN, that can directly learn category- recent years, researchers [12, 16, 19, 41, 44, 49, 51] have
wise semantic features for each source from the input audio tried to explore diverse pipelines to learn discriminative
mixture and frame to localize multiple sources simultane- visual representations from images for source separation.
ously. Specifically, our AVGN leverages learnable audio- Zhao et al. [51] first proposed a “Mix-and-Separate” net-
visual class tokens to aggregate class-aware source features. work to capture the alignment between pixels and the spec-
Then, the aggregated semantic features for each source will tral components of audio for the reconstruction of each in-
serve as guidance to localize the corresponding visual re- put source spectrogram. With the benefit of visual cues,
gions. Compared to previous multi-source baselines, our MP-Net [49] utilized a recursive MinusPlus Net to sepa-
new framework can support a flexible number of sources rate all salient sounds from the mixture. Tian et al. [41]
and shows the effectiveness of learning compact audio- used a cyclic co-learning framework with sounding object
visual representations with category-aware semantics. visual grounding to separate visual sound sources. More re-
Empirical experiments on MUSIC and VGGSound- cently, more types of modalities have been explored to boost
Instruments benchmarks comprehensively demonstrate the the performance of audio-visual source separation, such as
state-of-the-art performance against previous single-source motion in SoM [50], gesture composed of pose and key-
and multi-source baselines. In addition, qualitative visual- points in MG [15], and spatio-temporal visual scene graphs
izations of localization results vividly showcase the effec- in AVSGS [5]. Different from them, we do not need to re-
tiveness of our AVGN in localizing individual sources from cover the audio spectrogram of individual sources from the
mixtures. Extensive ablation studies also validate the im- mixture. Instead, we leverage the category-aware represen-
portance of category-aware grouping and learnable audio- tations of individual sources to localize the corresponding

10566
Dog Bow wow Cat Meow

Dog Bow wow + Cat Meow

Mixture Video
Spectrogram Classification Classification + Localization Frame

'
𝐠 &% '
%"# 𝐠 (% %"#

Audio-Visual Grouping Module

Audio Visual
Encoder Transformer Layers Encoder

$
𝒄! !"#
global audio feature spatial-level visual feature
𝐅& Audio-Visual Class Tokens 𝐅(

Figure 2. Illustration of the proposed Audio-Visual Grouping Network (AVGN). The Audio-Visual Grouping module takes as global audio
Fa = f a of the mixture spectrogram, spatial-level visual features Fv = {fpv }P
p=1 of the video frame from each encoder and learnable audio-
visual class tokens {ci }C i=1 of for C categories in the semantic space to generate disentangled class-aware audio-visual representations
{gna }N v N
n=1 , {gn }n=1 for N sources. Note that N source embeddings are chosen from C categories according to the ground-truth class.
Finally, two classification layers composed of an FC layer and a sigmoid function are separately used to predict audio and video categories,
and localization maps are generated by the cosine similarity between audio-visual class-aware embeddings.

regions for each source, where learnable audio-visual class ization, they can only handle a fixed number of sources
tokens are applied as the desirable guidance. and they cannot learn discriminative class-aware represen-
tations for individual sources. In contrast, we develop a
Visual Sound Source Localization. Visual sound source
fully novel framework to aggregate compact category-wise
localization is a typical and challenging problem that pre-
audio and visual source representations with explicit learn-
dicts the location of individual sound sources in a video.
able source class tokens. To the best of our knowledge, we
Early works [13, 20, 24] applied traditional machine learn-
are the first to leverage an explicit grouping mechanism for
ing approaches, such as statistical models [13] and canoni-
sound source localization. Our experiments in Section 4.2
cal correlation analysis [24] to learn low-level alignment be-
also demonstrate the effectiveness of AVGN in both single-
tween audio and visual representations. With the success of
source and multi-source localization.
deep neural nets, recent researchers [1, 9, 21, 30, 31, 38–40]
explored many architectures to learn the audio-visual cor- 3. Method
respondence for localizing single-source sounds. Atten- Given an image and a mixture of audio, our target is
tion10k [39] localized a sound source in the image using a to localize individual sound sources on the image. We
two-stream architecture with an attention mechanism. Hard propose a novel Audio-Visual Grouping Network named
sample mining was introduced in LVS [9] to optimize a dif- AVGN for disentangling individual semantics from the mix-
ferentiable threshold-based contrastive loss for predicting ture and image, which mainly consists of two modules,
discriminative audio-visual correspondence maps. More re- Audio-Visual Class Tokens in Section 3.2 and Audio-Visual
cently, a multiple-instance contrastive learning framework Grouping in Section 3.3.
was proposed in EZVSL [31] to align regions with the most 3.1. Preliminaries
corresponding audio without negative regions involved. In this section, we first describe the problem setup and
Due to the natural mixed property of sounds in our en- notations, and then revisit the multiple-instance contrastive
vironment, recent works [22, 23, 38] also have explored dif- learning in EZVSL [31] for single-source localization.
ferent frameworks to localize multiple sources on frames Problem Setup and Notations. Given a mixture spectro-
from a sound mixture simultaneously. DSOL [22] uti- gram and an image, our goal is to localize N individual
lized a two-stage training framework to deal with silence sound sources in the image spatially. For a video with
for category-aware sound source localization. More re- C source event categories, we have an audio-visual label,
cently, a contrastive random walk model was trained in which is denoted as {yi }C i=1 with yi for the ground-truth
Mix-and-Localize [23] to link each audio node with an category entry i as 1. During the training, we do not have
image node using a transition probability of audio-visual bounding boxes and mask-level annotations. Therefore, we
similarity. While those single-source and multi-source ap- can only use the video-level label for the mixture spectro-
proaches achieve promising performance in sound local- gram and image to perform weakly-supervised learning.

10567
Revisit Single-source Localization. To address the single- formulated as:
source localization problem, EZ-VSL [31] introduced a
multiple-instance contrastive learning framework to align \phi â(\mathbf {x}_jâ, \mathbf {X}â, \mathbf {X}â) = \mbox {Softmax}(\dfrac {\mathbf {x}_jâ{\mathbf {X}â}^\top }{\sqrt {D}})\mathbf {X}â (4)
the audio and visual features at locations corresponding to
sound sources. Given a set of global audio feature Fa =
f a ∈ R1×D and spatial-level visual features spanning all \phi ^v(\mathbf {x}_j^v, \mathbf {X}^v, \mathbf {X}^v) = \mbox {Softmax}(\dfrac {\mathbf {x}_j^v{\mathbf {X}^v}^\top }{\sqrt {D}})\mathbf {X}^v (5)
locations in an image Fv = {fpv }P v
p=1 , fp ∈ R
1×D
, EZ-VSL
applied the multiple-instance contrastive objective to align Then, in order to constrain the independence of each class
at least one location in the corresponding bag of visual fea- token ci in the audio-visual semantic space, we apply
tures with the audio representation in the same mini-batch, a fully-connected (FC) layer and add softmax operator
which is defined as: to predict the individual source class probability: ei =
Softmax(FC(ci )). Each audio-visual category probabil-
PC
ity is optimized by a cross-entropy loss i=1 CE(hi , ei ),
\label {eq:micl} \mathcal {L}_{\mbox {baseline}} = - \frac {1}{B}\sum _{b=1}^B \log \frac { \exp \left ( \frac {1}{\tau } \mathtt {sim}(\mathbf {F}â_b, \mathbf {F}^v_b) \right ) }{ \sum _{m=1}^B \exp \left ( \frac {1}{\tau } \mathtt {sim}(\mathbf {F}â_b, \mathbf {F}^v_m)\right )} where CE(·) is cross-entropy loss; hi denotes a one-hot en-
coding with its element for the target category entry i as 1.
(1)
Optimizing the loss will push the learned token embeddings
where the similarity sim(Fa , Fv ) denotes the max-pooled
to be category-aware and discriminative.
audio-visual cosine similarity of Fa and Fv = {fpv }P p=1
across all P spatial locations. B is the batch size, D is the 3.3. Audio-Visual Grouping
dimension size, and τ is a temperature hyper-parameter. With the benefit of the aforementioned category-
However, such a training objective will pose the main constraint objective, we propose a novel and explicit audio-
challenge for multi-source localization. The global audio visual grouping module composed of grouping blocks
representation extracted from the mixture is mixed and thus g a (·), g v (·) to take the learned audio-visual source class to-
they can not associate individual sources separated from kens and aggregated features as inputs to generate category-
the mixture with the corresponding regions. To address aware audio-visual embeddings as:
the challenge, we are inspired by [48] and propose a novel \begin {aligned} \{\mathbf {g}_iâ\}_{i=1}^C &= gâ(\{\hat {\mathbf {f}}â, \{\hat {\mathbf {c}}_iâ\}_{i=1}^C),\\ \{\mathbf {g}_i^v\}_{i=1}^C &= g^v(\{\hat {\mathbf {f}}_p^v\}_{p=1}^P, \{\hat {\mathbf {c}}_i^v\}_{i=1}^C) \end {aligned}
Audio-Visual Grouping Network that can learn to disentan- (6)
gle the individual semantics from the mixture and image to
guide multi-source localization, as illustrated in Figure 2. During the phase of grouping, we merge all the audio-visual
3.2. Audio-Visual Class Tokens features from the same class token into a new class-aware
audio-visual feature, by calculating the global audio simi-
In order to explicitly disentangle individual semantics
larity vector Aa ∈ R1×C and spatial visual similarity ma-
from the mixed sound space and image, we introduce a
trix Av ∈ RP ×C between audio-visual features and audio-
novel learnable audio-visual class tokens {ci }C i=1 to help
visual class tokens via a softmax operation, which is formu-
group semantic-aware information from audio-visual repre-
lated as
sentations f a , {fpv }P
p=1 , where ci ∈ R
1×D
, C is the total
\begin {aligned} & \mathbf {A}â_{i} = \mbox {Softmax}(W_qâ\hat {\mathbf {f}}â \cdot W_kâ\hat {\mathbf {c}}_iâ), \\ & \mathbf {A}^v_{p,i} = \mbox {Softmax}(W_q^v\hat {\mathbf {f}}_p^v \cdot W_k^v\hat {\mathbf {c}}_i^v) \end {aligned}
number of source classes, P denotes the number of total (7)
locations in the spatial map.
With the categorical audio-visual tokens and raw rep- where Wqa , Wka ∈ RD×D and Wqv , Wkv ∈ RD×D de-
resentations, we first apply self-attention transformers note the learnable weights of linear projections for the fea-
ϕa (·), ϕv (·) to aggregate global audio and spatial visual features and class tokens of audio and visual modalities, sep-
tures from the raw input and align the features with the cat- arately. With this global audio similarity vector and spatial
egorical token embeddings as: visual similarity matrix, we compute the weighted sum of
all global audio and spatial visual features assigned to gen-
\begin {aligned} & \hat {\mathbf {f}}â, \{\hat {\mathbf {c}}_iâ\}_{i=1}^C = \{\phi â(\mathbf {x}â_{j}, \mathbf {X}â, \mathbf {X}â)\}_{j=1}^{1+C}, \\ & \mathbf {X}â = \{\mathbf {x}â_{j}\}_{j=1}^{1+C} = [\mathbf {f}â; \{\mathbf {c}_i\}_{i=1}^C] \end {aligned} erate the category-aware representations as:
(2)
\label {eq:uni_group} \begin {aligned} \mathbf {g}_iâ & = gâ(\hat {\mathbf {f}}â, \hat {\mathbf {c}}_iâ)=\hat {\mathbf {c}}_iâ + W_oâ\dfrac {\mathbf {A}â_{i}W_vâ\hat {\mathbf {f}}â}{\mathbf {A}â_{i}}, \\ \mathbf {g}_i^v & = g^v(\{\hat {\mathbf {f}}_p^v\}_{p=1}^P, \hat {\mathbf {c}}_i^v) = \hat {\mathbf {c}}_i^v + W_o^v\dfrac {\sum _{p=1}^{P}\mathbf {A}^v_{p,i}W_v^v\hat {\mathbf {f}}_p^v}{\sum _{p=1}^{P}\mathbf {A}^v_{p,i}} \end {aligned}

\begin {aligned} & \{\hat {\mathbf {f}}_p^v\}_{p=1}^P, \{\hat {\mathbf {c}}_i^v\}_{i=1}^C = \{\phi ^v(\mathbf {x}^v_j, \mathbf {X}^v, \mathbf {X}^v)\}_{j=1}^{P+C}, \\ & \mathbf {X}^v = \{\mathbf {x}^v_{j}\}_{j=1}^{P+C} = [\{\mathbf {f}_p^v\}_{p=1}^P; \{\mathbf {c}_i\}_{i=1}^C] \end {aligned} (8)
(3)

where [ ; ] denotes the concatenation operator. where Woa , Wva ∈ RD×D and Wov , Wvv ∈ RD×D de-
f̂ a , f̂pv , ĉai , ĉvi ∈ R1×D , and and D is the dimension note the learned weights of linear projections for out-
of embeddings. The self-attention operators ϕa (·) is put and value in terms of audio and visual modalities,

10568
MUSIC-Solo VGGSound-Instruments VGGSound-Single
Method
AP(%) [email protected](%) AUC(%) AP(%) [email protected](%) AUC(%) AP(%) [email protected](%) AUC(%)
Attention10k [39] – 37.2 38.7 – 28.3 26.1 – 19.2 30.6
OTS [3] 69.3 26.1 35.8 47.5 25.7 24.6 29.8 32.8 35.7
DMC [21] – 29.1 38.0 – 26.5 25.7 – 23.9 27.6
CoarsetoFine [38] 70.7 33.6 39.8 40.2 27.2 26.5 28.2 29.1 34.8
LVS [9] 70.6 41.9 40.3 42.3 32.6 28.3 29.6 34.4 38.2
EZ-VSL [31] 71.5 45.8 41.2 43.8 38.5 30.6 31.3 38.9 39.5
Mix-and-Localize [23] 68.6 30.5 40.8 44.9 49.7 32.3 32.5 36.3 38.9
DSOL [22] – 51.4 43.7 – 50.2 32.9 – 35.7 37.2
AVGN (ours) 77.2 58.1 48.5 50.5 55.3 36.7 35.3 40.8 42.3

Table 1. Quantitative results of single-source localization on MUSIC-Solo, VGGSound-Instruments, and VGGSound-Single datasets.

separately. With class-aware audio-visual representations 4. Experiments

{gia }C v C
i=1 , {gi }i=1 as the inputs, we use an FC layer and sig- 4.1. Experimental setup
moid operator on each modality to predict the binary prob-
ability: pai = Sigmoid(FC(gia )), pvi = Sigmoid(FC(giv )) Datasets. MUSIC1 [51] contains 448 untrimmed YouTube
for ith class. By applying audio-visual source classes music videos of solos and duets from 11 instrument cat-
{yi }C
i=1 as the weak supervision and combining the class-
egories. 358 solo videos are applied for training, and 90
constraint loss, we formulate an audio-visual grouping loss solo videos are for evaluation. 124 duet videos are applied
as: for training, and 17 duet videos are for evaluation. Follow-
ing the prior work [23], we use MUSIC-Solo to evaluate the
\mathcal {L}_{\mbox {group}} = \sum _{i=1}^C\{\mbox {CE}(\mathbf {h}_i, \mathbf {e}_i) + \mbox {BCE}(y_i, p_i^a) + \mbox {BCE}(y_i, p_i^v)\}. performance of single-source localization, and use MUSIC-
Duet for evaluating multi-source localization. VGGSound-
(9) Instruments [23] consists of 32k video clips of 10s lengths
Since multiple audio sources could be in one mixture, we from 37 musical instruments categories, which is a sub-
use binary cross-entropy loss: BCE(·) for each category to set of VGG-Sound [10], and each video only has a single
handle this multi-label classification problem. instrument category label. For evaluation on multi-source
With the help of the proposed class-constrained loss, localization, we follow the prior work [23], and randomly
we generate category-aware audio and visual representa- concatenate two frames, resulting in one input image with
tions {gia }C v C
i=1 , {gi }i=1 for audio-visual alignment. Note a shape of 448 × 224, and summarize their waveforms
that global audio and visual representations for N source for the mixture. Beyond those musical datasets, we filter
embeddings {gna }N v N
n=1 , {gn }n=1 are chosen from C cate- 150k video clips of 10s lengths from the original VGG-
gories according to the corresponding ground-truth class. Sound [10], which is denoted as VGGSound-Single and
Therefore, the audio-visual similarity is calculated by max- includes 221 categories, such as nature, animals, vehicles,
pooling audio-visual cosine similarities of class-aware au- people, instruments, etc. For testing, we use the full VGG-
dio feature gna and the spatial-level {fpv ⊙ gnv }P
p=1 across all Sound Source [9] test set with 5158 videos for single-source
P locations for nth source. With this category-aware simi- localization. For multi-source evaluation, we follow a sim-
larity, we formulate the new localization loss as: ilar setting as VGGSound-Instruments, by randomly con-
catenating two frames as the input image with a size of
448 × 224, and summarizing their waveforms for the mix-
\label {eq:micl} \mathcal {L}_{\mbox {loc}} = - \frac {1}{BN}\sum _{b=1}^B\sum _{n=1}^N \log \frac { \exp \left ( \frac {1}{\tau } \mathtt {sim}(\mathbf {F}^a_{b,n}, \mathbf {F}^v_{b,n}) \right ) }{ \sum _{m=1}^B \exp \left ( \frac {1}{\tau } \mathtt {sim}(\mathbf {F}^c_{b,n}, \mathbf {F}^v_{m,n})\right )} ture. This results in 5158 mixed videos, which is more chal-
lenging than only 446 videos in VGGSound-Instruments.
(10) This test set is denoted as VGGSound-Duet.
where Fab,n = gna , Fvb,n = {fpv ⊙ gnv }P
p=1 denote the class- Evaluation Metrics. Following the prior work [23],
aware audio and visual features of nth source for bth sample we use the average precision at the pixel-wise aver-
in the mini-batch. The overall objective of our model is age precision (AP), Intersection over Union (IoU), and
simply optimized in an end-to-end manner as: Area Under Curve (AUC) for single-source localiza-
\mathcal {L} = \mathcal {L}_{\mbox {loc}} + \mathcal {L}_{\mbox {group}} (11) tion. When evaluating multi-source localization, we ap-
ply the class-aware average precision (CAP), permutation-
During inference, we follow the prior work [31] and use invariant average precision (PIAP), Class-aware IoU
audio-visual cosine similarity map between class-aware (CIoU), and Area Under Curve (AUC) for a fair compar-
audio-visual representations gna , {fpv ⊙ gnv }P
p=1 to generate
nth source localization map with P locations. Similarly to 1 Since many videos are no longer publicly available, the used dataset

previous works [9, 30, 31], the final localization map is gen- is smaller than the raw MUSIC dataset. For a fair comparison, we trained
erated through bilinear interpolation of the similarity map. all models on the same training data.

10569
MUSIC-Duet VGGSound-Instruments VGGSound-Duet
Method
CAP(%) PIAP(%) [email protected](%) AUC(%) CAP(%) PIAP(%) [email protected](%) AUC(%) CAP(%) PIAP(%) [email protected](%) AUC(%)
Attention10k [39] – – 21.6 19.6 – – 52.3 11.7 – – 11.5 15.2
OTS [3] 11.6 17.7 13.3 18.5 23.3 37.8 51.2 11.2 10.5 12.7 12.2 15.8
DMC [21] – – 17.5 21.1 – – 53.7 12.5 – – 13.8 17.1
CoarsetoFine [38] – – 17.6 20.6 – – 54.2 12.9 – – 14.7 18.5
LVS [9] – – 22.5 20.9 – – 57.3 13.3 – – 17.3 19.5
EZ-VSL [31] – – 24.3 21.3 – – 60.2 14.2 – – 20.5 20.2
Mix-and-Localize [23] 47.5 54.1 26.5 21.5 21.5 37.5 73.2 15.6 16.3 22.6 21.1 20.5
DSOL [22] – – 30.1 22.3 – – 74.3 15.9 – – 22.3 21.1
AVGN (ours) 50.6 57.2 32.5 24.6 27.3 42.8 77.5 18.2 21.9 28.1 26.2 23.8

Table 2. Quantitative results of multi-source localization on MUSIC-duet, VGGSound-Instruments, and VGGSound-Duet datasets.

ison with [23]. For the threshold of IoU and CIoU, we walk algorithm in the graph composed of images and sepa-
use [email protected] and [email protected] for MUSIC-Solo and MUSIC- rated sounds as nodes.
Duet, [email protected] and [email protected] for single-source and multi- For single-source localization, we report the quantitative
source localization on VGGSound-Instruments, [email protected] comparison results in Table 1. As can be seen, we achieve
and [email protected] for single-source and multi-source localiza- the best performance in terms of all metrics for three bench-
tion on VGGSound-Single and VGGSound-Duet. marks, compared to previous self-supervised and weakly-
Implementation. For input images, the resolution is re- supervised baselines. In particular, the proposed AVGN
sized to 224 × 224. For input audio, we take the log spec- significantly outperforms DSOL [22], the current state-
trograms extracted from 3s of audio at a sample rate of of-the-art weakly-supervised baseline, by 6.7 [email protected] &
22050Hz. We follow the prior work [31] and apply STFT to 4.8 AUC, 5.1 [email protected] & 3.8 AUC, and 5.1 [email protected] &
generate an input tensor of size 257 × 300 (257 frequency 5.1 AUC on three datasets. Moreover, we achieve su-
bands over 300 timesteps) using 50ms windows with a hop perior performance gains compared to EZ-VSL [31], the
size of 25ms. Following previous work [9, 21, 30, 31, 38], current state-of-the-art self-supervised baseline, which im-
we use the lightweight ResNet18 [18] as the audio and vi- plies the importance of extracting category-aware semantics
sual encoder, and initialize the visual model using weights from audio-visual inputs as the guidance for learning audio-
pre-trained on ImageNet [11]. D = 512, P = 49 for the visual alignment discriminatively. Meanwhile, our AVGN
7 × 7 spatial map from the visual encoder. The depth of outperforms than Mix-and-Localize [23] by a large mar-
self-attention transformers ϕa (·), ϕv (·) is 3. The model is gin, where we achieve the performance gains of 8.6 AP on
trained for 100 epochs using the Adam optimizer [25] with MUSIC-Solo, 5.6 AP on VGGSound-Instruments, and 2.7
a learning rate of 1e − 4 and a batch size of 128. AP on VGGSound-Single. These significant improvements
4.2. Comparison to prior work demonstrate the superiority of our method in single-source
In this work, we propose a novel and effective frame- localization.
work for sound source localization. In order to validate the In addition, significant gains in multi-source sound local-
effectiveness of the proposed AVGN, we comprehensively ization can be observed in Table 2. Compared to Mix-and-
compare it to previous single-source and multi-source base- Localize [23], the current state-of-the-art multi-source lo-
lines: 1) Attention 10k [39] (2018’CVPR): the first work calization baseline, we achieve the results gains of 5.8 CAP,
on single-source localization with a two-stream architecture 5.3 PIAP, 4.3 [email protected], and 2.6 AUC on VGGSound-
and attention mechanism; 2) OTS [3] (2018’ECCV): a sim- Instruments. Furthermore, when evaluated on the challeng-
ple baseline with audio-visual correspondence as the training VGGSound-Duet benchmark, the proposed approach
ing objective; 3) DMC [21] (2019’CVPR): a deep multi- still outperforms Mix-and-Localize [23] by 5.6 CAP, 5.5
modal clustering network with audio-visual co-occurrences PIAP, 5.1 [email protected], and 3.3 AUC. We also achieve highly
to learn convolutional maps for each modality in different better results against DSOL [22], the weakly-supervised
embedding spaces; 4) CoarsetoFine [38] (2020’ECCV): a baseline with two training stages. These results validate
two-stage baseline with coarse-to-fine alignment for cross- the effectiveness of our approach in learning disentangled
modal features; 5) DSOL [22] (2020’NeurIPS): a two- individual source semantics from mixtures and images for
stage training framework with classes as weak supervision multi-source localization.
for category-aware sound source localization; 6) LVS [9] In order to qualitatively evaluate the localization maps,
(2021’CVPR): a contrastive network to learn audio-visual we compare the proposed AVGN with EZ-VSL [31], Mix-
correspondence maps with hard negative mining; 7) EZ- and-Localize [23], and DSOL [22] on both single-source
VSL [31] (2022’ECCV): a recent strong baseline with and multi-source localization in Figure 3. From compar-
multiple-instance contrastive learning for single-source lo- isons, three main observations can be derived: 1) With-
calization; 8) Mix-and-Localize [23] (2022’CVPR): a out explicit separation objectives, EZ-VSL [31], the strong
strong multi-source baseline using a contrastive random single-source baseline, performs worse on the multi-source

10570
Figure 3. Qualitative comparisons with single-source and multi-source baselines on multi-source localization. Note that blue color refers
to high attention values and red for low attention values. The proposed AVGN produces much more accurate and high-quality localization
maps for each source.

MUSIC-Solo MUSIC-Duet
AVCT AVG
AP(%) [email protected](%) AUC(%) CAP(%) PIAP(%) [email protected](%) AUC(%)
✗ ✗ 71.5 45.8 41.2 39.7 43.1 24.3 21.3
✓ ✗ 75.2 52.3 45.1 46.9 51.8 27.6 22.5
✗ ✓ 73.6 48.2 43.5 42.8 49.5 25.3 21.8
✓ ✓ 77.2 58.1 48.5 50.6 57.2 32.5 24.6

Table 3. Ablation studies on Audio-Visual Class Tokens (AVCT) and Audio-Visual Grouping (AVG).

localization. 2) the quality of localization maps gener- 6.5 [email protected], and 3.9 AUC) and multi-source localization
ated by our method is much better than the self-supervised (by 7.2 CAP, 8.7 PIAP, and 3.3 [email protected]), which demon-
multi-source baseline, Mix-and-Localize [23]. 3) the pro- strates the benefit of category tokens in extracting disen-
posed AVGN achieves competitive even better results on tangled high-level semantics for source localization. Mean-
predicted maps against the weakly-supervised multi-source while, introducing only AVG in the baseline also increases
baseline [22] by using category labels during training. the source localization performance in terms of all metrics.
These visualizations further showcase the superiority of our More importantly, incorporating AVCT and AVG together
simple AVGN in learning category-aware audio-visual rep- into the baseline significantly raises the performance by 5.7
resentations to guide localization for each source. AP, 12.3 [email protected] and 7.3 AUC on single-source, and by
4.3. Experimental analysis 10.9 CAP, 14.1 PIAP, 8.2 [email protected] and 3.3 AUC on multi-
In this section, we performed ablation studies to demon- source localization. These improving results validate the
strate the benefit of introducing the Audio-Visual Class To- importance of audio-visual class tokens and audio-visual
kens and Audio-Visual Grouping module. We also con- grouping in extracting category-aware semantics from the
ducted extensive experiments to explore a flexible num- mixture and image for sound localization.
ber of sound source localization, and learned disentangled Generalizing to Flexible Number of Sources. In order to
category-aware audio-visual representations. show the generalizability of the proposed AVGN to a flexi-
Audio-Visual Class Tokens & Audio-Visual Grouping. ble number of sources, we directly transfer the model with-
In order to validate the effectiveness of the introduced out additional training to test a mixture of 3 sources. We still
audio-visual class tokens (AVCT) and audio-visual group- achieve competitive results of 18.5 CAP, 23.7 PIAP, 22.7
ing (AVG), we ablate the necessity of each module and re- [email protected], and 21.8 AUC on the challenging VGGSound-
port the quantitative results in Table 3. We can observe Duet dataset. These results indicate that our AVGN can sup-
that adding bearable AVCT to the vanilla baseline highly in- port localizing a flexible number of sources from the mix-
creases the results of single-source localization (by 3.7 AP, ture, which is different from Mix-and-Localize [23] with a

10571
Figure 4. Qualitative comparisons of representations learned by Mix-and-Localize, DSOL, and the proposed AVGN. Note that each spot
denotes the feature of one source sound, and each color refers to one source category, such as “trumpet” in orange and “cello” in pink.

fixed number of sources as the number of nodes defined in One possible reason is that our model easily overfits across
the trained contrastive random walker. the training phase, and the solution is to incorporate dropout
Learned Category-aware Audio-Visual Representa- and momentum encoders together for multi-source localiza-
tions. Learning disentangled audio-visual representations tion. Meanwhile, we notice that if we transfer our model
with category-aware semantics is critical for us to local- to open-set source localization without additional training,
ize the sound source from a mixture. To better evaluate it would be hard to localize unseen categories as we need
the quality of learned category-aware features, we visual- to pre-define a set of categories during training and do not
ize the learned visual and audio representations of 6 cat- learn unseen category tokens to guide multi-source local-
egories in MUSIC-Duet by t-SNE [45], as shown in Fig- ization. The future work could be to add enough learnable
ure 4. Note that each color refers to one category of the class tokens or apply continual learning to new classes.
source sound, such as “trumpet” in orange and “cello”
5. Conclusion
in pink. As can be observed in the last column, audio-
visual representations extracted by the proposed AVGN In this work, we present AVGN, a novel audio-visual
are both intra-category compact and inter-category sepa- grouping network, that can directly learn category-wise se-
rable. In contrast to our disentangled embeddings in the mantic features for each source from audio and visual in-
audio-visual semantic space, there still exists mixtures of puts for localizing multiple sources in videos. We introduce
multiple audio-visual categories among features learned by learnable audio-visual category tokens to aggregate class-
Mix-and-Localize [23]. With the benefit of the weakly- aware source features. Then, we leverage the aggregated
supervised classes, DSOL [22] can extract clustered audio- semantic features for each source to guide localizing the
visual features on some classes, such as “cello” in pink. corresponding regions. Compared to existing multi-source
However, most categories are mixed together as they do not methods, our new framework can handle a flexible number
incorporate the explicit audio-visual grouping mechanism of sources and learns compact audio-visual semantic repre-
in our AVGN. These meaningful visualization results fur- sentations. Empirical experiments on MUSIC, VGGSound-
ther showcase the success of our AVGN in extracting com- Instruments, and VGG-Sound Sources benchmarks demon-
pact audio-visual representations with class-aware seman- strate the state-of-the-art performance of our AVGN on both
tics for sound source localization from the mixture. single-source and multi-source localization.
Broader Impact. The proposed method localizes sound
4.4. Limitation sources learning from user-uploaded web videos, which
Although the proposed AVGN achieves superior results might cause the model to learn internal biases in the data.
on both single-source and multi-source localization, the per- For example, the model could fail to localize certain rare
formance gains of our approach on the MUSIC-Duet bench- but crucial sound sources. These issues should be carefully
mark with a small number of categories are not significant. addressed for the deployment of real scenarios.

10572
References [13] John W Fisher III, Trevor Darrell, William Freeman, and
Paul Viola. Learning joint statistical models for audio-visual
[1] Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and fusion and segregation. In Proceedings of Advances in Neu-
Andrew Zisserman. Self-supervised learning of audio-visual ral Information Processing Systems (NeurIPS), 2000. 3
objects from video. In Proceedings of European Conference [14] Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenen-
on Computer Vision (ECCV), pages 208–224, 2020. 1, 3 baum, and Antonio Torralba. Music gesture for visual sound
[2] Relja Arandjelovic and Andrew Zisserman. Look, listen and separation. In IEEE/CVF Conference on Computer Vision
learn. In Proceedings of the IEEE International Conference and Pattern Recognition (CVPR), pages 10478–10487, 2020.
on Computer Vision (ICCV), pages 609–617, 2017. 2 2
[3] Relja Arandjelovic and Andrew Zisserman. Objects that [15] Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenen-
sound. In Proceedings of the European Conference on Com- baum, and Antonio Torralba. Music gesture for visual sound
puter Vision (ECCV), pages 435–451, 2018. 5, 6 separation. In Proceedings of the IEEE/CVF Conference on
[4] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Sound- Computer Vision and Pattern Recognition (CVPR), 2020. 2
net: Learning sound representations from unlabeled video. [16] Ruohan Gao, Rogerio Feris, and Kristen Grauman. Learning
In Proceedings of Advances in Neural Information Process- to separate object sounds by watching unlabeled video. In
ing Systems (NeurIPS), 2016. 2 Proceedings of the European Conference on Computer Vi-
[5] Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, sion (ECCV), pages 35–53, 2018. 2
and Anoop Cherian. Visual scene graphs for audio source [17] Ruohan Gao and Kristen Grauman. 2.5d visual sound. In
separation. In Proceedings of the IEEE/CVF International Proceedings of the IEEE/CVF Conference on Computer Vi-
Conference on Computer Vision (ICCV), pages 1204–1213, sion and Pattern Recognition (CVPR), pages 324–333, 2019.
2021. 2 2
[6] Changan Chen, Unnat Jain, Carl Schissler, S. V. A. Garı́, [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Deep residual learning for image recognition. In Proceedings
Kristen Grauman. Soundspaces: Audio-visual navigation in of IEEE/CVF Conference on Computer Vision and Pattern
3d environments. In Proceedings of European Conference Recognition (CVPR), pages 770–778, 2016. 6
on Computer Vision (ECCV), pages 17–36, 2020. 2 [19] John Hershey and Michael Casey. Audio-visual sound sepa-
[7] Changan Chen, Sagnik Majumder, Al-Halah Ziad, Ruohan ration via hidden markov models. Advances in Neural Infor-
Gao, Santhosh Kumar Ramakrishnan, and Kristen Grauman. mation Processing Systems, 14, 2001. 2
Learning to set waypoints for audio-visual navigation. In [20] John Hershey and Javier Movellan. Audio vision: Us-
Proceedings of International Conference on Learning Rep- ing audio-visual synchrony to locate sounds. In Proceed-
resentations (ICLR), 2021. 2 ings of Advances in Neural Information Processing Systems
[8] Changan Chen, Carl Schissler, Sanchit Garg, Philip (NeurIPS), 1999. 3
Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, [21] Di Hu, Feiping Nie, and Xuelong Li. Deep multimodal clus-
Philip W Robinson, and Kristen Grauman. Soundspaces 2.0: tering for unsupervised audiovisual learning. In Proceed-
A simulation platform for visual-acoustic learning. In Pro- ings of the IEEE Conference on Computer Vision and Pattern
ceedings of Advances in Neural Information Processing Sys- Recognition (CVPR), pages 9248–9257, 2019. 1, 2, 3, 5, 6
tems (NeurIPS) Datasets and Benchmarks Track, 2022. 2 [22] Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui
[9] Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Na- Ding, Weiyao Lin, and Dejing Dou. Discriminative sounding
grani, Andrea Vedaldi, and Andrew Zisserman. Localizing objects localization via self-supervised audiovisual match-
visual sounds the hard way. In Proceedings of the IEEE/CVF ing. In Proceedings of Advances in Neural Information Pro-
Conference on Computer Vision and Pattern Recognition cessing Systems (NeurIPS), pages 10077–10087, 2020. 2, 3,
(CVPR), pages 16867–16876, 2021. 1, 3, 5, 6 5, 6, 7, 8
[10] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew [23] Xixi Hu, Ziyang Chen, and Andrew Owens. Mix and local-
Zisserman. Vggsound: A large-scale audio-visual dataset. ize: Localizing sound sources in mixtures. In Proceedings of
In ICASSP 2020-2020 IEEE International Conference on the IEEE/CVF Conference on Computer Vision and Pattern
Acoustics, Speech and Signal Processing (ICASSP), pages Recognition (CVPR), pages 10483–10492, 2022. 1, 2, 3, 5,
721–725. IEEE, 2020. 5 6, 7, 8
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia. Li, Kai Li, and [24] Einat Kidron, Yoav Y Schechner, and Michael Elad. Pixels
Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image that sound. In Proceedings of IEEE Conference on Computer
Database. In Proceedings of IEEE/CVF Conference on Com- Vision and Pattern Recognition (CVPR), 2005. 3
puter Vision and Pattern Recognition (CVPR), pages 248– [25] Diederik P Kingma and Jimmy Ba. Adam: A method for
255, 2009. 6 stochastic optimization. arXiv preprint arXiv:1412.6980,
[12] Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin 2014. 6
Wilson, Avinatan Hassidim, William T Freeman, and [26] Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera-
Michael Rubinstein. Looking to listen at the cocktail party: tive learning of audio and video models from self-supervised
A speaker-independent audio-visual model for speech sepa- synchronization. In Proceedings of Advances in Neural In-
ration. arXiv preprint arXiv:1804.03619, 2018. 2 formation Processing Systems (NeurIPS), 2018. 2

10573
[27] Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang. Dual- tically similar samples. In Proceedings of IEEE Interna-
modality seq2seq network for audio-visual event localiza- tional Conference on Acoustics, Speech and Signal Process-
tion. In IEEE International Conference on Acoustics, Speech ing (ICASSP), 2022. 1, 3
and Signal Processing (ICASSP), pages 2002–2006, 2019. 2 [41] Yapeng Tian, Di Hu, and Chenliang Xu. Cyclic co-learning
[28] Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, of sounding object visual grounding and sound separation.
and Ming-Hsuan Yang. Exploring cross-video and cross- In Proceedings of the IEEE/CVF Conference on Computer
modality signals for weakly-supervised audio-visual video Vision and Pattern Recognition (CVPR), pages 2745–2754,
parsing. In Proceedings of Advances in Neural Information 2021. 2
Processing Systems (NeurIPS), 2021. 2 [42] Yapeng Tian, Dingzeyu Li, and Chenliang Xu. Unified mul-
[29] Yan-Bo Lin and Yu-Chiang Frank Wang. Audiovisual trans- tisensory perception: Weakly-supervised audio-visual video
former with instance attention for audio-visual event local- parsing. In Proceedings of European Conference on Com-
ization. In Proceedings of the Asian Conference on Com- puter Vision (ECCV), page 436–454, 2020. 2
puter Vision (ACCV), 2020. 2 [43] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chen-
[30] Shentong Mo and Pedro Morgado. A closer look at weakly- liang Xu. Audio-visual event localization in unconstrained
supervised audio-visual source localization. In Proceed- videos. In Proceedings of European Conference on Com-
ings of Advances in Neural Information Processing Systems puter Vision (ECCV), 2018. 2
(NeurIPS), 2022. 1, 3, 5, 6 [44] Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Her-
[31] Shentong Mo and Pedro Morgado. Localizing visual sounds shey, Tal Remez, Daniel PW Ellis, and John R Hershey. Into
the easy way. In Proceedings of European Conference on the wild with audioscope: Unsupervised audio-visual separa-
Computer Vision (ECCV), page 218–234, 2022. 1, 3, 4, 5, 6 tion of on-screen sounds. arXiv preprint arXiv:2011.01143,
[32] Shentong Mo and Yapeng Tian. Multi-modal grouping net- 2020. 2
work for weakly-supervised audio-visual video parsing. In [45] Laurens van der Maaten and Geoffrey Hinton. Visualizing
Proceedings of Advances in Neural Information Processing data using t-sne. Journal of Machine Learning Research,
Systems (NeurIPS), 2022. 2 9(86):2579–2605, 2008. 8
[33] Pedro Morgado, Yi Li, and Nuno Nvasconcelos. Learning [46] Yu Wu and Yi Yang. Exploring heterogeneous clues for
representations from audio-visual spatial alignment. In Pro- weakly-supervised audio-visual video parsing. In Proceed-
ceedings of Advances in Neural Information Processing Sys- ings of the IEEE Conference on Computer Vision and Pattern
tems (NeurIPS), pages 4733–4744, 2020. 2 Recognition (CVPR), pages 1326–1335, 2021. 2
[34] Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. Ro- [47] Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. Dual attention
bust audio-visual instance discrimination. In Proceedings of matching for audio-visual event localization. In Proceedings
the IEEE/CVF Conference on Computer Vision and Pattern of the IEEE International Conference on Computer Vision
Recognition (CVPR), pages 12934–12945, 2021. 2 (ICCV), pages 6291–6299, 2019. 2
[35] Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, and [48] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon,
Oliver Wang. Self-supervised generation of spatial audio for Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit:
360°video. In Proceedings of Advances in Neural Informa- Semantic segmentation emerges from text supervision. arXiv
tion Processing Systems (NeurIPS), 2018. 2 preprint arXiv:2202.11094, 2022. 4
[36] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio- [49] Xudong Xu, Bo Dai, and Dahua Lin. Recursive visual
visual instance discrimination with cross-modal agreement. sound separation using minus-plus net. In Proceedings of
In Proceedings of the IEEE/CVF Conference on Com- the IEEE/CVF International Conference on Computer Vision
puter Vision and Pattern Recognition (CVPR), pages 12475– (ICCV), 2019. 2
12486, June 2021. 2 [50] Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Tor-
[37] Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. ralba. The sound of motions. In Proceedings of the
Freeman, and Antonio Torralba. Ambient sound provides IEEE/CVF International Conference on Computer Vision
supervision for visual learning. In Proceedings of the Euro- (ICCV), pages 1735–1744, 2019. 2
pean Conference on Computer Vision (ECCV), pages 801– [51] Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Von-
816, 2016. 2 drick, Josh McDermott, and Antonio Torralba. The sound
[38] Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, of pixels. In Proceedings of the European Conference on
and Weiyao Lin. Multiple sound sources localization from Computer Vision (ECCV), pages 570–586, 2018. 1, 2, 5
coarse to fine. In Proceedings of European Conference on
Computer Vision (ECCV), pages 292–308, 2020. 2, 3, 5, 6
[39] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan
Yang, and In So Kweon. Learning to localize sound source
in visual scenes. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
4358–4366, 2018. 1, 2, 3, 5, 6
[40] Arda Senocak, Hyeonggon Ryu, Junsik Kim, and In So
Kweon. Learning sound localization better from seman-

10574

Acoustic Scene Classification - A Comprehensive Survey
No ratings yet
Acoustic Scene Classification - A Comprehensive Survey
32 pages
A Closer Look at Audio-Visual Segmentation
No ratings yet
A Closer Look at Audio-Visual Segmentation
18 pages
Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
No ratings yet
Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
11 pages
Contrastive Positive Sample Propagation Along The Audio-Visual Event Line
No ratings yet
Contrastive Positive Sample Propagation Along The Audio-Visual Event Line
21 pages
Relja Arandjelovic Objects That Sound ECCV 2018 Paper
No ratings yet
Relja Arandjelovic Objects That Sound ECCV 2018 Paper
17 pages
Audio-Visual Scene Analysis With Self-Supervised Multisensory Features
No ratings yet
Audio-Visual Scene Analysis With Self-Supervised Multisensory Features
18 pages
Objects That Sound: Abstract
No ratings yet
Objects That Sound: Abstract
20 pages
GeneratingVisualDynamics fromSoundandContext
No ratings yet
GeneratingVisualDynamics fromSoundandContext
40 pages
Active Audio-Visual Separation of Dynamic Sound Sources: (Sagnik, Grauman) @cs - Utexas.edu
No ratings yet
Active Audio-Visual Separation of Dynamic Sound Sources: (Sagnik, Grauman) @cs - Utexas.edu
28 pages
Ramaswamy See The Sound Hear The Pixels WACV 2020 Paper
No ratings yet
Ramaswamy See The Sound Hear The Pixels WACV 2020 Paper
10 pages
A Proto-Object Based Audiovisual Saliency Map
No ratings yet
A Proto-Object Based Audiovisual Saliency Map
50 pages
Audio-Visual Event Localization in Unconstrained Videos
No ratings yet
Audio-Visual Event Localization in Unconstrained Videos
17 pages
Harwath IJCV 2019
No ratings yet
Harwath IJCV 2019
22 pages
Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
No ratings yet
Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
30 pages
AVIS A Connectionist-Based Framework For Integrated Auditory and Visual Information Processing
No ratings yet
AVIS A Connectionist-Based Framework For Integrated Auditory and Visual Information Processing
43 pages
Audio-Visual Spatial Integration and Recursive Attention For Robust Sound Source Localization
No ratings yet
Audio-Visual Spatial Integration and Recursive Attention For Robust Sound Source Localization
10 pages
Lecture12 1MultimodalFusion
No ratings yet
Lecture12 1MultimodalFusion
66 pages
Paper 10
No ratings yet
Paper 10
19 pages
Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
No ratings yet
Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
14 pages
Li 等 - 2025 - Sounding That Object Interactive Object-Aware Image to Audio Generation
No ratings yet
Li 等 - 2025 - Sounding That Object Interactive Object-Aware Image to Audio Generation
21 pages
Vosoughi 等 - 2025 - Can Sound Replace Vision in LLaVA With Token Substitution
No ratings yet
Vosoughi 等 - 2025 - Can Sound Replace Vision in LLaVA With Token Substitution
29 pages
Beam Learning 2022
No ratings yet
Beam Learning 2022
17 pages
NeurIPS 2021 Contrastive Learning of Global and Local Video Representations Paper
No ratings yet
NeurIPS 2021 Contrastive Learning of Global and Local Video Representations Paper
16 pages
Audio-Visual Transformer Based Crowd Counting
No ratings yet
Audio-Visual Transformer Based Crowd Counting
11 pages
Tavt Towards Transferable Audio Visual Text Generation 2h20y12957
No ratings yet
Tavt Towards Transferable Audio Visual Text Generation 2h20y12957
17 pages
Unified Cross-Modal Attention Robust Audio-Visual Speech Recognition and Beyond
No ratings yet
Unified Cross-Modal Attention Robust Audio-Visual Speech Recognition and Beyond
13 pages
Deep Cross-Modal Audio-Visual Generation: Lele Chen Sudhanshu Srivastava
No ratings yet
Deep Cross-Modal Audio-Visual Generation: Lele Chen Sudhanshu Srivastava
9 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Joint Learning of Latent Similarity and Local Embedding For Multi-View Clustering
No ratings yet
Joint Learning of Latent Similarity and Local Embedding For Multi-View Clustering
13 pages
Icml Mi Workshop Submission
No ratings yet
Icml Mi Workshop Submission
15 pages
Audio-Visual Cross-Attention Network For Robotic Speaker Tracking
No ratings yet
Audio-Visual Cross-Attention Network For Robotic Speaker Tracking
13 pages
Image and Audio Caps: Automated Captioning of Background Sounds and Images Using Deep Learning
No ratings yet
Image and Audio Caps: Automated Captioning of Background Sounds and Images Using Deep Learning
9 pages
Vision Transformer Based Audio Classification Using Patch-Level Feature Fusion
No ratings yet
Vision Transformer Based Audio Classification Using Patch-Level Feature Fusion
5 pages
Paper 10
No ratings yet
Paper 10
18 pages
One Deep Music Representation To Rule Them All? A Comparative Analysis of Different Representation Learning Strategies
No ratings yet
One Deep Music Representation To Rule Them All? A Comparative Analysis of Different Representation Learning Strategies
27 pages
Microwave and Radar Engineering M Kulkarni
50% (2)
Microwave and Radar Engineering M Kulkarni
3 pages
Attention Audio Visual Speech Recognition
No ratings yet
Attention Audio Visual Speech Recognition
5 pages
V - A G H A: Ideo TO Udio Eneration With Idden Lignment
No ratings yet
V - A G H A: Ideo TO Udio Eneration With Idden Lignment
17 pages
Daily Life Video Classification Using Multimodal
No ratings yet
Daily Life Video Classification Using Multimodal
5 pages
Gao Learning To Separate CVPR 2018 Paper
No ratings yet
Gao Learning To Separate CVPR 2018 Paper
4 pages
Sound Source Localization Using A Convolutional Neural
No ratings yet
Sound Source Localization Using A Convolutional Neural
17 pages
Vision For The Blind
No ratings yet
Vision For The Blind
6 pages
Paper - Bims 2
No ratings yet
Paper - Bims 2
6 pages
2265 TiVA Time Aligned Video T
No ratings yet
2265 TiVA Time Aligned Video T
10 pages
Gu Context-Guided Spatio-Temporal Video Grounding CVPR 2024 Paper
No ratings yet
Gu Context-Guided Spatio-Temporal Video Grounding CVPR 2024 Paper
10 pages
Differentiable Room Acoustic Rendering With Multi-View Vision Priors
No ratings yet
Differentiable Room Acoustic Rendering With Multi-View Vision Priors
11 pages
Icassp 2013 6639140
No ratings yet
Icassp 2013 6639140
4 pages
SoundtoVisualSceneGenerationbyAudio To VisualLatentAlignment
No ratings yet
SoundtoVisualSceneGenerationbyAudio To VisualLatentAlignment
11 pages
CNN Architectures For Large-Scale Audio Classification
No ratings yet
CNN Architectures For Large-Scale Audio Classification
5 pages
Maivar: Multimodal Audio-Image and Video Action Recognizer
No ratings yet
Maivar: Multimodal Audio-Image and Video Action Recognizer
6 pages
L3DAS23 Learning 3D Audio Sources For Audio-Visual Extended Reality
No ratings yet
L3DAS23 Learning 3D Audio Sources For Audio-Visual Extended Reality
9 pages
Efthymios Tzinis Scott Wisdom John R. Hershey Aren Jansen Daniel P. W. Ellis
No ratings yet
Efthymios Tzinis Scott Wisdom John R. Hershey Aren Jansen Daniel P. W. Ellis
5 pages
2021 Deep Learning Audio Book
No ratings yet
2021 Deep Learning Audio Book
38 pages
A Survey of Sound Source Localization and Detectio
No ratings yet
A Survey of Sound Source Localization and Detectio
25 pages
Ul - FM Approvals PDF
No ratings yet
Ul - FM Approvals PDF
2 pages
Hyderabad
No ratings yet
Hyderabad
43 pages
Video-to-Audio Generation With Fine-Grained Temporal Semantics
No ratings yet
Video-to-Audio Generation With Fine-Grained Temporal Semantics
5 pages
1 Base
No ratings yet
1 Base
5 pages
Prediction Paper - Non Calculator Paper 1
No ratings yet
Prediction Paper - Non Calculator Paper 1
16 pages
Arts Vi Long Quiz Fourth Periodical 2017-2018 With Answer Key
No ratings yet
Arts Vi Long Quiz Fourth Periodical 2017-2018 With Answer Key
5 pages
Actuarial Mathematic Bowers Chapter 3
No ratings yet
Actuarial Mathematic Bowers Chapter 3
43 pages
Video Genre Verification Using Both Acoustic and Visual Modes
No ratings yet
Video Genre Verification Using Both Acoustic and Visual Modes
4 pages
Prototypical Networks For Domain Adaptation in Acoustic Scene Classification
No ratings yet
Prototypical Networks For Domain Adaptation in Acoustic Scene Classification
5 pages
Parts IR5000-IR6000
No ratings yet
Parts IR5000-IR6000
256 pages
Supplier
No ratings yet
Supplier
117 pages
A Robust Audio Deepfake Detection System Via Multi-View Feature
No ratings yet
A Robust Audio Deepfake Detection System Via Multi-View Feature
5 pages
Prabharoop Interim Report
No ratings yet
Prabharoop Interim Report
4 pages
Citronix Ci700
100% (1)
Citronix Ci700
1 page
Introduction of Solidworks
No ratings yet
Introduction of Solidworks
13 pages
Linux Security
No ratings yet
Linux Security
45 pages
Quintum Configuration Guide DX
No ratings yet
Quintum Configuration Guide DX
47 pages
Ada Worksheet Patterson
No ratings yet
Ada Worksheet Patterson
2 pages
Zamfira Ioana Ruxandra - Raport
No ratings yet
Zamfira Ioana Ruxandra - Raport
10 pages
Information Technology Vocabulary
No ratings yet
Information Technology Vocabulary
2 pages
Emc Vmax Management - Elm Essentials KT
No ratings yet
Emc Vmax Management - Elm Essentials KT
19 pages
Internet Addiction Disorder by Slidesgo
No ratings yet
Internet Addiction Disorder by Slidesgo
46 pages
Cómo Escribir Un Ensayo Romano
100% (1)
Cómo Escribir Un Ensayo Romano
5 pages
AC PPT 4 Inverters and Its Types
No ratings yet
AC PPT 4 Inverters and Its Types
35 pages
Aod Assignment-2 (201 Batch) PDF
No ratings yet
Aod Assignment-2 (201 Batch) PDF
24 pages
Elliptic Curve Cryptography Master Thesis
100% (1)
Elliptic Curve Cryptography Master Thesis
6 pages
Imagine Videotek VSG 4mtg
No ratings yet
Imagine Videotek VSG 4mtg
8 pages
A Contribution To New ALE 2D Method Validation - Ansys
No ratings yet
A Contribution To New ALE 2D Method Validation - Ansys
12 pages
Linked List Programs
No ratings yet
Linked List Programs
6 pages
TCPIP Fundamentals
No ratings yet
TCPIP Fundamentals
1 page
Choppa Sravani: Professional Objective
No ratings yet
Choppa Sravani: Professional Objective
2 pages
Dashpute Smita A.: Brief Overview
No ratings yet
Dashpute Smita A.: Brief Overview
3 pages
Synergy2 Reinsurance Solution - Eurobase International
No ratings yet
Synergy2 Reinsurance Solution - Eurobase International
7 pages
DR Deepak02
No ratings yet
DR Deepak02
1 page
Wolfsmilkie - Tumblr Blog Tumgik
No ratings yet
Wolfsmilkie - Tumblr Blog Tumgik
4 pages

Audio-Visual Grouping Network For Sound Localization From Mixtures

Uploaded by

Audio-Visual Grouping Network For Sound Localization From Mixtures

Uploaded by

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

Audio-Visual Grouping Network for Sound Localization from Mixtures

Shentong Mo Yapeng Tian*

Sound source localization is a typical and challenging

Dog Bow wow + Cat Meow

Audio-Visual Grouping Module

separately. With class-aware audio-visual representations 4. Experiments

You might also like