Audio-Visual Spatial Integration and Recursive Attention For Robust Sound Source Localization
Audio-Visual Spatial Integration and Recursive Attention For Robust Sound Source Localization
ABSTRACT
The objective of the sound source localization task is to enable
Similarity
machines to detect the location of sound-making objects within Comparison
a visual scene. While the audio modality provides spatial cues to Visual Spatial
Feature
locate the sound source, existing approaches only use audio as an GAP
auxiliary role to compare spatial regions of the visual modality. Hu- Harp Image (a) Existing Methods
mans, on the other hand, utilize both audio and visual modalities Visual Modality 𝐌𝑣
as spatial cues to locate sound sources. In this paper, we propose
an audio-visual spatial integration network that integrates spatial
cues from both modalities to mimic human behavior when detect- Similarity 𝐌𝒂𝒗
Comparison
ing sound-making objects. Additionally, we introduce a recursive Audio Spatial
Feature
attention network to mimic human behavior of iterative focusing Harp Sound GAP (b) Proposed Method
on objects, resulting in more accurate attention regions. To effec-
Audio Modality 𝐌𝑎
tively encode spatial information from both modalities, we propose
audio-visual pair matching loss and spatial region alignment loss.
By utilizing the spatial cues of audio-visual modalities and recur- Figure 1: Conceptual comparison between (a) existing meth-
sively focusing objects, our method can perform more robust sound ods (red) and (b) the proposed method (blue). The existing
source localization. Comprehensive experimental results on the methods use the spatial information of visual modality as
Flickr SoundNet and VGG-Sound Source datasets demonstrate the the primary modality to estimate region of sound-making
superiority of our proposed method over existing approaches. Our objects (M𝑣 ). We observe that the audio modality itself also
code is available at: https://github.com/VisualAIKHU/SIRA-SSL. contains spatial information for estimating regions of the
sound-making object (M𝑎 ). In our work, we try to integrate
the spatial knowledge of the audio-visual modalities (M𝑎𝑣 )
CCS CONCEPTS
for more accurate sound source localization.
• Information systems → Multimedia information systems; •
Computing methodologies → Computer vision.
1 INTRODUCTION
KEYWORDS Sound source localization aims to identify the location of a sounding
Sound source localization, audio-visual spatial integration, recur- object within a visual scene [49]. This task is similar to the innate
sive attention, multimodal learning ability of humans to find the location by correlating sounds heard
ACM Reference Format:
with their ears and scenes seen with their eyes. Because of this
Sung Jin Um, Dongjin Kim, and Jung Uk Kim. 2023. Audio-Visual Spatial property, sound source localization has a wide range of applications,
Integration and Recursive Attention for Robust Sound Source Localization. such as multimodal robotics [30, 37], sound source separation [8],
In Proceedings of the 31st ACM International Conference on Multimedia (MM and indoor positioning [3].
’23), October 29-November 3, 2023, Ottawa, ON, Canada. ACM, New York, Since the sound source localization task utilizes multimodal
NY, USA, 10 pages. https://doi.org/10.1145/3581783.3611722 information (i.e., audio-visual), it is essential to consider how to
effectively combine the different two modal information for more
∗ Both authors have contributed equally to this work.
† Corresponding
accurate localization. In addition, while audio-visual data can be
author.
obtained in abundance, manually annotating object locations (e.g.,
bounding boxes or segmentation masks) is time-consuming and
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed labor-intensive. To address the two issues, several self-supervised
for profit or commercial advantage and that copies bear this notice and the full citation approaches [9, 15, 49, 51, 60, 63] have been proposed. Senocak et al.
on the first page. Copyrights for components of this work owned by others than the [49] proposed the attention mechanism with unsupervised learning
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission to match the audio-visual information. Chen et al. [9] introduced
and/or a fee. Request permissions from [email protected]. a network to explicitly mine the hard negative locations from the
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada foreground locations by using sound information. Xuan et al. [60]
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0108-5/23/10. . . $15.00 proposed a proposal-based method that focuses on the region inside
https://doi.org/10.1145/3581783.3611722 the bounding box of each object based on the given sound. In
3507
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Sung Jin Um, Dongjin Kim, and Jung Uk Kim
[15], the optical flow information was additionally incorporated to To sum up, the major contributions of this paper are summarized
effectively combine the audio-visual modalities. as follows:
However, the above-mentioned methods have in common that,
as shown in Figure 1(a), they utilize the audio modality only as an • We introduce audio-visual spatial integration network that
auxiliary role (red) in comparing whether each grid region of the exploits the spatial knowledge of audio-visual modalities. In
visual modality corresponds to the area of the sounding object. In addition, we propose recursive attention network to refine
fact, humans also have the ability to detect the location of an object the localization map in a recursive manner. To the best of
just by hearing the sound. For example, even when our eyes are our knowledge, it is the first work that considers the spatial
closed, we can still perceive the location of a car making a sound knowledge of audio modality for sound source localization.
by paying our attention to the corresponding spatial area. This is
because the spatial information can be inferred by relying on cues • To guide the feature representation of the single modality,
such as differences in arrival time, loudness, and spectral content of we propose audio-visual pair matching loss. Also, to enhance
the sound [18, 42, 50]. As shown in Figure 1(b), we observe that the spatial knowledge of the audio modality, we introduce spatial
audio modal itself also contains valuable spatial cues for inferring region alignment loss to resemble that of the attentive image.
the sound-making objects.
Moreover, according to [5, 26, 44], when humans receive both • Comprehensive quantitative and qualitative experimental
visual and auditory information, they naturally generate a region results on Flickr-SoundNet and VGG-Sound Source datasets
of interest (ROI) in each modality. These ROIs are then integrated validate the effectiveness of the proposed framework.
to form a region of attention, which is an indicator of where to
focus based on the combined audio-visual information. After fo- 2 RELATED WORK
cusing the attention region and eliminating the unnecessary areas,
humans identify the sound-making object by repeatedly engaging 2.1 Sound Source Localization
in a recursive recognition process [25, 41]. By doing so, we can Sound source localization aims to estimate the sound source lo-
make more accurate predictions. This cognitive process enables hu- cation using visual scenes. It requires an effective combination of
mans to effectively utilize visual and auditory information, leading visual and audio data, and various algorithms have been developed
to more accurate and comprehensive understanding of the world over the years to optimize this multimodal integration for accurate
around them. localization [1, 9, 15, 22, 46, 49, 51, 60, 63].
In this paper, based on our aforementioned motivations, we pro- One such approach is the use of attention mechanisms, which
pose a novel sound source localization framework that mimics the allow the network to selectively focus on relevant parts of the input
above-mentioned two cognitive psychological perspectives of hu- data. In [49], Senocak et al. propose a sound localization network
mans (i.e., potential of spatial cues in audio modality and the ability that incorporates an attention mechanism to focus on relevant
to recognize sound-making objects in a recursive manner). Our parts of the visual modality and audio modality, resulting in more
framework consists of two stages. First, we propose an audio-visual accurate sound source localization. In [9], Chen et al. introduce
spatial integration network that integrates spatial knowledge from tri-maps to incorporate background mining techniques for identify-
both audio-visual modalities to produce an integrated localization ing positive correlation region, no correlation region (background),
map. The aim of generating the integrated localization map is to and ignoring region to avoid uncertain areas in the visual scene.
contain rich spatial information about the sound-making objects. They utilize audio-visual pairs to create a tri-map highlighting posi-
Second, we introduce a recursive attention network to mimic the tive/negative regions. In [60], Xuan et al. adopt the selective search
human ability to recognize the objects in a recursive manner. Based [57] to utilize the proposal-based paradigm. Since the proposal re-
on the integrated localization map, the unnecessary regions of the gion contains information of sound-making objects, finding the
input image are eliminated and attentive input image is generated. candidate objects firstly rather than the location of the sound can be
Consequently, with the attentive input image, more precise local- superior. In [15], Fedorishin et al., assumed that most of the sound
ization of the sound-making object is possible in our recursive sources in visual scenes will be moving objects. Therefore, they
attention network. adopt the optical flow algorithm in the visual modality to achieve
In addition, within the recursive attention network, we devise an more effective sound source localization.
audio-visual pair matching loss to guide the feature representation In many studies on sound source localization task, the visual
of each single modality (audio and visual) to resemble that of the modality is usually considered to be a crucial modality (e.g., selective
attentive input image. By doing so, the features of both modalities search, optical flow, etc.). However, the audio modality is only
can embed more precise spatial knowledge. Moreover, although utilized as an auxiliary role, primarily being used for similarity
the spatial knowledge of the audio modality contains valuable in- measurements (e.g., cosine similarity) to generate the attention
formation, it may be relatively less precise than those of the visual region of the visual modality. Thus, we claim that the existing
modality. To address this issue, we introduce a spatial region align- methods tend to give weight to visual modality rather than audio
ment loss to guide the spatial representation of the audio modality modality. However, humans use both eyes and ears as important
to resemble that of the attentive input image. As a result, the feature factors to judge situations in the natural environment. Therefore,
representations of the audio modality are significantly enhanced, we propose a sound source localization framework that uses audio
leading to a more accurate final localization map generation. modality as well as visual modality for acquire more abundant
spatial knowledge of the audio-visual modalities.
3508
Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
𝒍𝒂𝒊 𝒍𝒂𝒕𝒕
𝒗𝒊 𝒍𝒗𝒊
Visual Attention 𝐅𝒗𝒂𝒕𝒕
𝒊
𝐅𝒗𝒊
Encoder Module pull pull
Audio Attentive
𝒂
𝑮
Audio Feature 𝐅𝒂𝒊 𝒊
Audio Spectrogram 𝑰𝒂𝒊 Feature 𝐅𝒂𝒗𝒊 Audio Attentive 𝐅𝒂𝒊 𝒂
𝑭 𝒊
Follow
Localization Map 𝑴𝒂𝒊
GAP
𝒍𝒂𝒊 ⋯
𝐅𝒗𝒂𝒕𝒕 𝒂𝒕𝒕
𝑭 𝒗𝒊
𝒂𝒕𝒕
𝑮 𝒗𝒊
Recursive Attention Network (Sec 3.3) 𝒊
Visual
Encoder
Attention 𝑴𝒗𝒊 ×𝑤1 𝑴𝒂𝒊 ×𝑤2 𝑴𝒂𝒕𝒕
𝒗𝒊 ×𝑤3
Module
𝐅𝒗𝒂𝒕𝒕
Input Image 𝑰𝒗𝒊 Attentive Input Image 𝑰𝒂𝒕𝒕
𝒗𝒊
𝒊
Localization Map 𝑴𝒂𝒕𝒕
𝒗𝒊 Final
Localization
𝒍𝒂𝒊 Map 𝑴𝒇𝒊𝒏𝒂𝒍𝒊
Figure 2: Network configuration of the proposed sound localization framework. ⊕ and ⊗ indicate element-wise addition and
element-wise multiplication, respectively. Note that final localization map M 𝑓 𝑖𝑛𝑎𝑙 is generated by combining M𝑣 , M𝑎 , and M𝑎𝑡𝑡
𝑣 .
2.2 Recursive Deep Learning Framework in For designing our method, we utilize the recursively refining
Computer Vision idea to mimic the behavior of humans that repeatedly focus sound-
making object for more accurate sound source localization. By
Recursive deep learning frameworks [2, 24, 29, 35, 52, 54] have
recursively refining a model, the proposed method can improve
become increasingly popular for their ability to handle complex
the attention region of the sound-making object, by eliminating
dependencies in sequential or structured data. Many works have
the unnecessary regions. As a result, our method achieves the
adopted a recursive approach and applied it to the various computer
outstanding performance over the state-of-the-art sound source
vision tasks to improve their performance, such as object detection
localization works.
[11, 27, 36] and recognition [6, 7, 53], image super-resolution [28, 56,
58], visual tracking [17, 23], and semantic segmentation [45, 61, 62].
For example, in the object detection, a recursive model with the
3 PROPOSED METHOD
multistage framework is proposed [36]. This approach uses an EM-
like group recursive learning technique to iteratively refine object 3.1 Overall Architecture
proposals and improve the spatial configuration of object detection. The overall architecture of our sound source localization framework
Socher et al. [53] proposed a model that combines convolutional is depicted in Figure 2. Our framework consists of two stages: (1)
and recursive neural networks to detect object in the RGB-D images. audio-visual spatial integration network and (2) recursive attention
In addition, for image super-resolution, Kim et al. [28] proposed network. First, in the audio-visual spatial integration network, in-
the deeply-recursive convolutional network (DRCN) to improve put image set 𝐼 𝑣 ∈ R𝑁 ×𝑊𝑣 ×𝐻 𝑣 ×3 (𝑁 indicates the number of batch,
the feature representation without adding more convolution pa- 𝑊𝑣 and 𝐻 𝑣 denote width and height of 𝐼 𝑣 , respectively) and the
rameters. To overcome the challenges of learning a DRCN, they corresponding audio spectrogram set 𝐼𝑎 ∈ R𝑁 ×𝑊𝑎 ×𝐻𝑎 ×1 (𝑊𝑎 and
introduce recursive supervision and skip connection. 𝐻𝑎 denote width and height of 𝐼𝑎 , respectively) pass through each
In the visual object tracking, Gao et al. [17] utilized recursive modal encoder (i.e., visual encoder and audio encoder) to generate
least-squares estimation (LSE) for online learning. By integrating the spatial features F𝑣 and F𝑎 , respectively. Then, image attentive
fully-connected layers with LSE and employing an enhanced mini- localization map M𝑣 and audio attentive localization map M𝑎 are
batch stochastic gradient descent algorithm, they enhanced the per- generated based on F𝑣 and F𝑎 through the attention module. M𝑣
formance of visual object tracking. For semantic segmentation and and M𝑎 are attention maps that focus on the location of a sound-
depth estimation tasks, Zhang et al. [62] introduced the Joint Task- ing object based on the spatial features encoded in each modality.
Recursive Learning (TRL) framework. It uses a Task-Attentional M𝑣 and M𝑎 are integrated to generate the audio-visual integrated
Module (TAM) to recursively refine the results. localization map M𝑎𝑣 .
3509
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Sung Jin Um, Dongjin Kim, and Jung Uk Kim
Second, the recursive attention network takes the resized M𝑎𝑣 M𝑎𝑣 ∈ R𝑁 ×𝑤×ℎ , which can be obtained as follows:
and multiplies it with 𝐼 𝑣 to generate attentive input image 𝐼 𝑣𝑎𝑡𝑡 . 𝐼 𝑣𝑎𝑡𝑡 M𝑎 + M𝑣
is passed through the visual encoder to generate visual attention M𝑎𝑣 = . (4)
2
feature F𝑎𝑡𝑡
𝑣 . Note that the weight parameters of the visual encoder Since M𝑎𝑣 contains the spatial information of both audio and visual
in the audio-visual spatial integration network and recursive at- modalities, it provides a more precise localization map compared
tention network are shared. With F𝑎𝑡𝑡 𝑣 and 𝐼𝑎 , the localization map to using either modality alone. By combining the spatial cues from
M𝑎𝑡𝑡
𝑣 is generated. More details are in the following subsections. both modalities, the proposed method is able to effectively miti-
gate the limitations of each modality and produce a more accurate
3.2 Audio-Visual Spatial Integration Network localization result.
When humans see a visual scene with their eyes while listening
to a sounding object, they can acquire spatial cue information not 3.3 Recursive Attention Network
only through vision but also through sound [18, 50]. We mimic the Given the visual and audio modal information, humans can in-
behaviors of humans for more accurate localizing sound source tegrate attention regions across different modalities, such as vi-
objects. To this end, we propose an audio-visual spatial integration sual and auditory information, to concentrate on a specific region
network to exploit the spatial cues of both visual modality and [13, 14, 55, 59]. It is called multisensory integration. By doing so,
audio modalities. humans can concentrate their attention on specific regions of the
As shown in Figure 2, our audio-visual spatial integration net- environment that correspond to the presented sensory information.
work consists of two streams: (1) visual stream and (2) audio stream. This allows them to more effectively process and respond to stimuli
In the visual stream, the visual spatial feature F𝑣 ∈ R𝑁 ×𝑤×ℎ×𝑐 (𝑤, from both modalities [16, 34, 43].
ℎ, and 𝑐 are the width, height, and channel number) is mainly used Therefore, we build the recursive attention network to mimic
to localize sound-making object. Specifically, the audio spatial fea- the above-mentioned behaviors of humans. The recursive attention
ture F𝑎 ∈ R𝑁 ×𝑤×ℎ×𝑐 is subject to a global average pooling (GAP) network utilizes the audio-visual integrated localization map M𝑎𝑣
operation to generate 𝑙𝑎 ∈ R𝑁 ×𝑐 . Then, F𝑣 and 𝑙𝑎 are compared derived from the audio-visual spatial integration network to pro-
using a similarity calculation in the attention module to generate duce an attentive input image 𝐼 𝑣𝑎𝑡𝑡 . Specifically, M𝑎𝑣 ∈ R𝑁 ×𝑤×ℎ
S𝑣 = {𝑆 𝑣𝑖 𝑗 }𝑖=1,...,ℎ,𝑗=1,...,𝑤 ∈ R𝑁 ×𝑤×ℎ , which is measured as: is resized to be M𝑟𝑎𝑣 ∈ R𝑁 ×𝑊𝑣 ×𝐻 𝑣 . To the next, M𝑟𝑎𝑣 and 𝐼 𝑣 are
𝑆𝑖𝑚(F𝑣𝑖 𝑗 , 𝑙𝑎 ) F𝑣𝑖 𝑗 · 𝑙𝑎 conducted element-wise multiplication to focus the attention re-
𝑆 𝑣𝑖 𝑗 = Í Í , 𝑆𝑖𝑚(F𝑣𝑖 𝑗 , 𝑙𝑎 ) = . (1) gion of the image, i.e., 𝐼 𝑣𝑎𝑡𝑡 . We feed this attentive input image 𝐼 𝑣𝑎𝑡𝑡
ℎ 𝑤 𝑆𝑖𝑚(F , 𝑙 )
𝑣𝑖 𝑗 𝑎 ||F 𝑣𝑖 𝑗 || ||𝑙𝑎 || into the visual encoder to encode visual attention feature 𝐹 𝑣𝑎𝑡𝑡 . The
𝑖=1 𝑗=1
Then, S𝑣 is normalized by the softmax to generate the image atten- attention module calculates the similarity between 𝐹 𝑣𝑎𝑡𝑡 and 𝑙𝑎 to
generate S𝑎𝑡𝑡 𝑎𝑡𝑡 𝑁 ×𝑤×ℎ . Note that S𝑎𝑡𝑡 is
tive localization map M𝑣 ∈ R𝑁 ×𝑤×ℎ . 𝑣 = {𝑆 𝑣𝑖 𝑗 }𝑖=1,...,ℎ,𝑗=1,...,𝑤 ∈ R 𝑣
In the audio stream, the audio spatial feature F𝑎 ∈ R𝑁 ×𝑤×ℎ×𝑐 is calculated similarly to Eq. (1) and Eq. (3). Also, S𝑎𝑡𝑡 𝑣 is normalized
mainly used to localize sound-making objects. However, while the by the softmax to make the localization map M𝑎𝑡𝑡 𝑣 .
audio modality contains the spatial cues for localizing objects, it Finally, we combine the M𝑣 , M𝑎 , and M𝑎𝑡𝑡 𝑣 to generate the final
generally lacks the levels of detail compared to the visual modality. localization map M 𝑓 𝑖𝑛𝑎𝑙 , which can be represented as:
For example, if we hear an object sound with our eyes closed, we M 𝑓 𝑖𝑛𝑎𝑙 = 𝑤 1 M𝑣 + 𝑤 2 M𝑎 + 𝑤 3 M𝑎𝑡𝑡 (5)
𝑣 ,
can roughly estimate its location, but it is typically less precise than
if we were to open our eyes and visually locate the object. Thus, where 𝑤 1 , 𝑤 2 , and 𝑤 3 are the hyper-parameters that indicate the
we transfer the spatial knowledge of F𝑣 to F𝑎 while maintaining importance of each modality in contributing to the M 𝑓 𝑖𝑛𝑎𝑙 . M𝑎
the area that F𝑎 focuses on by generating F𝑎𝑣 . F𝑎𝑣 is obtained as: and M𝑣 contain the spatial cues of each modality (i.e., audio and
visual modalities), and M𝑎𝑡𝑡 𝑣 contains the spatial cues of the more
F𝑎𝑣 = F𝑣 ◦ F̄𝑎 , (2) attentive region from the audio-visual modalities. Therefore, by
combining M𝑎 and M𝑣 , the spatial cues from both modalities can be
where F̄𝑎 denotes the normalized version of F𝑎 (min-max normal-
obtained. Additionally, by combining M𝑎𝑡𝑡𝑣 , more interested regions
ization is conducted with the value between 0 and 1), and ◦ indicates
can be obtained. The recursive combination of the localization maps
the element-wise multiplication.
can utilize abundant spatial information, leading to more accurate
Next, similar to Eq. (1), the S𝑎𝑣 = {𝑆𝑎𝑣𝑖 𝑗 }𝑖=1,...,ℎ,𝑗=1,...,𝑤 ∈ R𝑁 ×𝑤×ℎ
sound source localization.
is obtained as:
𝑆𝑖𝑚(F𝑎𝑣𝑖 𝑗 , 𝑙𝑎 ) F𝑎𝑣𝑖 𝑗 · 𝑙𝑎 3.4 Audio-Visual Pair Matching Loss
𝑆𝑎𝑣𝑖 𝑗 = Í Í , 𝑆𝑖𝑚(F𝑎𝑣𝑖 𝑗 , 𝑙𝑎 ) = .
ℎ 𝑤 𝑆𝑖𝑚(F
𝑎𝑣𝑖 𝑗 , 𝑙𝑎 ) ||F𝑎𝑣𝑖 𝑗 || ||𝑙𝑎 || Humans can make more accurate predictions by removing unnec-
𝑖=1 𝑗=1
(3) essary areas by focusing attention through their eyes and ears.
S𝑎𝑣 is also normalized by the softmax to make the audio attentive Similarly, in our method, the attentive input image 𝐼 𝑣𝑎𝑡𝑡 concen-
localization map M𝑎 . The two localization maps, M𝑣 and M𝑎 , gen- trates the area that is generated by the audio-visual modality in
erated by the proposed audio-visual spatial integration network, the audio-visual spatial integration network. This enables us to
provide information about the spatial regions in each modality that localize the sounding objects more accurately. This is similar to the
are being focused on to localize the sounding objects. Therefore, fact that the two-stage detectors [19, 31, 32, 38, 48], which first ex-
we integrate the knowledge of the audio-visual modalities to make tract the region of interest (ROI) for more accurate object detection,
3510
Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
generally outperform the one-stage object detectors [39, 40, 47]. Table 1: Experimental results on Flickr test set when the
Therefore, compared to M𝑣 , M𝑎 and M𝑎𝑡𝑡 𝑎𝑡𝑡 training sets are Flickr10k and Flickr144k, respectively.
𝑣 , M𝑣 usually contains
more meaningful regions than M𝑣 and M𝑎 . As a result, we propose
Method Training Set cIoU0.5 ↑ AUC↑
an audio-visual pair matching loss to guide the feature represen-
tations of the visual modality F𝑣 and the audio modality F𝑎 to be Attention [49] (CVPR’18) 0.436 0.449
similar to that of the visual attention feature F𝑎𝑡𝑡 DMC [22] (CVPR’19) 0.414 0.450
𝑣 .
To this end, we first conduct global average pooling (GAP) of CoarseToFine [46] (ECCV’20) 0.522 0.496
F𝑎𝑡𝑡 𝑎𝑡𝑡 AVObject [1] (ECCV’20) 0.546 0.504
𝑣 , F𝑣 , and F𝑎 and normalize them to generate 𝑙 𝑣 , 𝑙 𝑣 , and 𝑙𝑎 ,
respectively. Next, we adopt the triplet loss [21] for the audio-visual LVS [9] (CVPR’21) Flickr10k 0.582 0.525
pair matching loss L𝑎𝑣𝑝𝑚 , which can be represented as: Zhou et al. [63] (WACV’23) 0.631 0.551
Shi et al. [51] (WACV’22) 0.734 0.576
𝑇 (𝑙 𝑣𝑎𝑡𝑡
𝑖
, 𝑙𝑎𝑖 , 𝑙𝑎 𝑗 ) = 𝐷 (𝑙 𝑣𝑎𝑡𝑡
𝑖
, 𝑙𝑎𝑖 ) + 𝑚𝑎𝑥 (𝛿 − 𝐷 (𝑙 𝑣𝑎𝑡𝑡
𝑖
, 𝑙𝑎 𝑗 ), 0), SSPL [60] (CVPR’22) 0.743 0.587
𝑇 (𝑙 𝑣𝑎𝑡𝑡
𝑖
, 𝑙 𝑣𝑖 , 𝑙 𝑣 𝑗 ) = 𝐷 (𝑙 𝑣𝑎𝑡𝑡
𝑖
, 𝑙 𝑣𝑖 ) + 𝑚𝑎𝑥 (𝛿 − 𝐷 (𝑙 𝑣𝑎𝑡𝑡
𝑖
, 𝑙 𝑣 𝑗 ), 0), HTF [15] (WACV’23) 0.860 0.634
1
𝑁
∑︁ 𝑁
∑︁ Proposed Method 0.876 0.641
L𝑎𝑣𝑝𝑚 = 𝑇 (𝑙 𝑣𝑎𝑡𝑡 , 𝑙𝑎𝑖 , 𝑙𝑎 𝑗 ) + 𝑇 (𝑙 𝑣𝑎𝑡𝑡 , 𝑙 𝑣𝑖 , 𝑙 𝑣 𝑗 ), Attention [49] (CVPR’18) 0.660 0.558
𝑁 (𝑁 − 1) 𝑖=1 𝑖 𝑖
𝑗=1( 𝑗≠𝑖 ) DMC [22] (CVPR’19) 0.671 0.568
(6)
LVS [9] (CVPR’21) Flickr144k 0.699 0.573
where 𝐷 (𝛼, 𝛽) = ||(𝛼 − 𝛽)/𝜏 || 22 denotes the L2 norm to calculate
SSPL [60] (CVPR’22) 0.759 0.610
the distance between two features with temperature parameter 𝜏,
HTF [15] (WACV’23) 0.865 0.639
𝑙 𝑣𝑎𝑡𝑡
𝑖
, 𝑙𝑎𝑖 , and 𝑙𝑎 𝑗 are the features of anchor, positive, and negative
samples, respectively, and 𝛿 is the margin. Proposed Method 0.881 0.652
The aim of 𝑇 (𝑙 𝑣𝑎𝑡𝑡𝑖
, 𝑙𝑎𝑖 , 𝑙𝑎 𝑗 ) and 𝑇 (𝑙 𝑣𝑎𝑡𝑡
𝑖
, 𝑙 𝑣𝑖 , 𝑙 𝑣 𝑗 ) is to make the an-
chor (𝑙 𝑣𝑎𝑡𝑡 ) and the positive pair (𝑙 𝑎𝑖 𝑣𝑖 similar while pushing the
, 𝑙 ) the audio-visual modality and combining all attention maps in a
𝑖
negative pair (𝑙𝑎 𝑗 , 𝑙 𝑣 𝑗 ) apart. By doing so, L𝑎𝑣𝑝𝑚 can guide the recursive manner.
feature representation of F𝑣 and F𝑎 to be similar that of F𝑎𝑡𝑡 𝑣 . As a
result, the feature representation of F𝑣 and F𝑎 improve the perfor- 4 EXPERIMENTS
mance of sound source localization (please see Section 4.5). 4.1 Datasets and Evaluation Metrics
Flickr-SoundNet. Flickr-SoundNet [4] consists more than 2 mil-
3.5 Spatial Region Alignment Loss
lion videos from Flickr. In the training phase, to enable direct com-
Although we can infer spatial information using sound, it is rela- parison with prior research, we train our models with two random
tively less accurate than visual information. Therefore, we introduce subsets of 10k and 144k image-audio pairs. In the inference phase,
a spatial region alignment loss in order to guide the spatial regions we use Flickr-SoundNet test set. It contains 250 annotated pairs with
that audio feature F𝑎 focus on to be similar to that of the F𝑎𝑡𝑡
𝑣 . To labeled bounding box, manually annotated by the annotators [9, 49].
this end, we add all 𝑐 channels of F𝑎 and F𝑎𝑡𝑡
𝑣 to normalize them to
𝑎𝑡𝑡
generate F̂𝑎 ∈ R𝑁 ×𝑤×ℎ and F̂𝑣 ∈ R𝑁 ×𝑤×ℎ . After that, they are VGG-Sound Source. VGG-Sound dataset [10] consists of 200k
flattened to conduct softmax function to generate Ĝ𝑎 ∈ R𝑁 ×𝑤ℎ video clips from 300 different sound categories. Following [15], we
𝑎𝑡𝑡 𝑎𝑡𝑡
and Ĝ𝑣 ∈ R𝑁 ×𝑤ℎ , respectively. Based on the Ĝ𝑣 and Ĝ𝑎 , the use a training dataset with 10k and 144k image-audio pairs. For
spatial region alignment loss L𝑠𝑟𝑎 is represented as follows: evaluation, we use VGG-Sound Source (VGG-SS) dataset [9] with
𝑁 5,000 annotated image-audio pairs from 220 classes. Compared with
1 ∑︁
𝑎𝑡𝑡
Flickr-SoundNet, which contains 50 audio categories, the VGG-SS
L𝑠𝑟𝑎 = 𝐷𝐾𝐿 Ĝ𝑣𝑖 ∥ Ĝ𝑎𝑖 ,
𝑁 𝑖=1 (7) dataset set offers a larger number of sound sources. Therefore, it
| {z }
audio to attentive visual contains more challenging scenario for sound source localization.
where 𝐷𝐾𝐿 (·) indicates the Kullback-Leibler (KL) divergence. L𝑠𝑟𝑎
makes the spatial representation of F𝑎 to be similar to that of F𝑎𝑡𝑡 Evaluation Metrics. To compare our method with the existing
𝑣
in the training phase. By doing so, when generating F𝑎 , our method methods, we adopt consensus Intersection over Union (cIoU) [49]
can effectively estimate the spatial regions by hearing sounds. and Area Under Curve (AUC) as evaluation metrics, which are the
widely adopted metrics for sound source localization task [9, 15, 49].
3.6 Total Loss Function For calculating cIoU, the IoU threshold is fixed to be 0.5 (i.e., cIoU0.5 ),
following [9, 15, 49]. Note that, in our experiments, we additionally
To train our method, the total loss function is composed as follows: introduce a mcIoU metric to measure the performance by varying
L𝑇 𝑜𝑡𝑎𝑙 = L𝑆𝑆𝐿 + 𝜆1 L𝑎𝑣𝑝𝑚 + 𝜆2 L𝑠𝑟𝑎 , (8) the IoU threshold to 0.5:0.05:0.95. More details are in Section 4.6.
where L𝑆𝑆𝐿 is the unsupervised loss function of the sound source
localization that tries to impose the audio-visual feature pairs are 4.2 Implementation Details
close to each other, following [9], 𝜆1 and 𝜆2 denote the balanc- For both datasets, we resize the input image for the visual modality
ing parameter. Through L𝑇 𝑜𝑡𝑎𝑙 , our method can perform effective to be 𝑊𝑣 = 224, 𝐻 𝑣 = 224. It is extracted from the middle frame of
sound source localization by leveraging the spatial knowledge of the 3-seconds video clips. For audio modality input, we resample
3511
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Sung Jin Um, Dongjin Kim, and Jung Uk Kim
Table 2: Experimental results on VGG-SS test set when the Table 3: Effect of the proposed audio-visual pair matching
training sets are VGG-Sound10k and VGG-Sound144k, re- loss L𝑎𝑣𝑝𝑚 and spatial region alignment loss L𝑠𝑟𝑎 on Flickr
spectively. test set, where models are trained on the Flickr144k.
Method Training Set cIoU0.5 ↑ AUC↑ Method L𝑎𝑣𝑝𝑚 L𝑠𝑟𝑎 cIoU0.5 ↑ AUC↑
LVS [9] (CVPR’21) 0.303 0.364 Baseline - - 0.865 0.642
SSPL [60] (CVPR’22) 0.314 0.369 ✓ - 0.876 0.643
VGG-Sound10k Proposed
Zhou et al. [63] (WACV’23) 0.350 0.376 - ✓ 0.871 0.648
HTF [15] (WACV’23) 0.393 0.398 Method
✓ ✓ 0.881 0.652
Proposed Method 0.403 0.403
LVS [9] (CVPR’21) 0.344 0.382 Table 4: Experimental results on Flickr test set according
SSPL [60] (CVPR’22) VGG-Sound144k 0.339 0.380 to the hyper-parameters 𝑤 1 , 𝑤 2 , and 𝑤 3 for M 𝑓 𝑖𝑛𝑎𝑙 in Eq. (5),
HTF [15] (WACV’23) 0.394 0.400 where models are trained on the Flickr144k.
Proposed Method 0.406 0.405 𝑤 1 (M𝑣 ) 𝑤 2 (M𝑎 ) 𝑤 3 (M𝑎𝑡𝑡
𝑣 ) cIoU0.5 ↑ AUC↑
1 1 4 0.866 0.645
1 1 2 0.875 0.649
the 3-seconds raw audio signal to 16kHz and transform it into a log-
1 1 1 0.881 0.652
scale spectrogram, yielding a final shape𝑊𝑎 = 257 and 𝐻𝑎 = 276. At
1 2 1 0.871 0.639
this time, to enable a direct comparison with visual modal features,
2 1 1 0.876 0.650
we resize F𝑎 to be 7 × 7 × 512 (𝑤 = 7, ℎ = 7, and 𝑐 = 512) using
bilinear interpolation. 2 2 1 0.871 0.643
Following [9], we employ ResNet-18 [20] for both the visual and
audio feature backbones to construct our baseline. Since the number
of sound spectrogram channel is 1. we modify the input channel 4.4 Ablation Study
of ResNet-18 [20] conv1 from 3 to 1. We use the ImageNet [12] We conduct various ablation studies to investigate (1) effect of the
pretrained for the visual encoder. When our baseline is HTF [15], we proposed losses (i.e., L𝑎𝑣𝑝𝑚 and L𝑠𝑟𝑎 ), and (2) variation of the
additionally consider the optical flow [15] for the attention module hyper-parameter 𝑤 1 , 𝑤 2 , 𝑤 3 for 𝑀 𝑓 𝑖𝑛𝑎𝑙 . All the ablation studies are
(more details are in Section 4.6). Our sound source localization conducted using Flickr144k training set and Flickr test set.
framework is trained using the Adam optimizer [33] with a learning
rate of 10 −4 and a batch size of 128. Following [15], we train our Effect of the Proposed Losses. We measure the performance
model for 100 epochs for Flickr and VGG-Sound datasets. We use by changing two types of the proposed losses L𝑎𝑣𝑝𝑚 and L𝑠𝑟𝑎 .
4 synchronized RTX 3090 GPUs. The weights for 𝑀 𝑓 𝑖𝑛𝑎𝑙 in Eq. (5) The results are shown in Table 3. When each loss is considered,
are set as 𝑤 1 = 𝑤 2 = 𝑤 3 = 1. Also, we use 𝜆1 = 1, 𝜆2 = 10, 𝜏 = 0.03, our method shows the improved performance agains the baseline
and 𝛿 = 25 for our proposed loss functions (L𝑎𝑣𝑝𝑚 and L𝑠𝑟𝑎 ). in which those losses are not considered. When all the proposed
losses are taken into account, we show the highest performance.
4.3 Performance Comparison By incorporating the proposed losses in the training phase, our
We conduct the experiments to compare the effectiveness of our method is able to learn more robust and discriminative features
proposed method with the state-of-the-art sound source localization that are better suited for the sound source localization task.
works [1, 9, 15, 22, 46, 49, 51, 60, 63]. Table 1 shows the performance
of our method with the existing methods on Flickr-SoundNet. When Variation of w1 , w2 , w3 We conduct additional ablation study to
the training set is Flickr10k, our method achieves 0.876 and 0.641 investigate the effect of our method to the parameters 𝑤 1 , 𝑤 2 , and
for cIoU0.5 and AUC, respectively. Specifically, when compared to 𝑤 3 as described in Section 3.3. The results of Table 4 show that
the HTF [15] which shows the highest performance among the the optimal results are obtained when 𝑤 1 , 𝑤 2 , and 𝑤 3 are set to 1.
existing methods, our method is 1.6% higher for cIoU and 0.7% However, it’s important to note that our method still outperforms
higher for AUC metrices. Similar tendency is observed when the existing methods even with different values for these parameters.
training set is Flickr144k training set. The experimental results on These results suggest that the model is robust to parameter changes,
Table 1 demonstrate that our approach that considering the spatial but there may be an optimal combination that maximizes its effec-
knowledge of the audio-visual modalities and recursively refining tiveness. In our future work, we are planning to build a framework
the localization map leads to better localization performance. that considers weight of the localization maps.
The experimental results on the VGG-Sound dataset are shown
in Table 2. For the VGG-Sound Source test set, our method achieved 4.5 Visualization Results
improvements of 1.0% cIoU and 0.5% AUC in the VGG-Sound10k We compare our method with the current state-of-the-art approach,
dataset, and 1.2% cIoU and 0.5% AUC in the VGG-Sound144k dataset HTF [15], by visualizing their sound source localization results on
over the HTF [15]. The results validate that our method outperforms the Flickr-SoundNet and VGG-SS test set. The results are shown in
existing methods and achieves a state-of-the-art performance over Figure 3. Through the visualization results, our method can accu-
the existing sound source localization works. rately localize the sound-making objects (GT annotation indicates
3512
Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
Original Image GT Annotation HTF Ours Original Image GT Annotation HTF Ours
(a)
(b)
the region of the sound-making objects). Since our method consid- 4.6 Discussions
ers the spatial information of both audio and visual modalities and Cross Dataset Evaluation. To show the effectiveness of our method
recursively updates the localization map, more precise attention on the cross dataset environment, we train our model on the VGG-
maps are obtained. Sound training set and evaluate it on the Flickr-SoundNet test set.
Furthermore, Figure 4 shows visualization results of the various This cross-dataset evaluation enables us to assess the ability of
localization maps M𝑣 , M𝑎 , M𝑎𝑡𝑡
𝑣 , and M 𝑓 𝑖𝑛𝑎𝑙 of our method. The model to generalize and to check to new and diverse data sources.
visualization results show that considering audio-visual localiza- The results of Table 5 show the results when the training sets
tion map and recursively updating them contributes to 𝑀 𝑓 𝑖𝑛𝑎𝑙 for are VGG-Sound10k and VGG-Sound144k, respectively. The results
concentrating on a more accurate location. By doing so, our method show that the performances of our method are significantly higher
shows the improving performance. than the existing methods. As a result, our model demonstrates
3513
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Sung Jin Um, Dongjin Kim, and Jung Uk Kim
Table 5: Experimental results on the cross-dataset evaluation. Table 8: The comparisons of training time, inference time,
Note that we trained the model on the VGG-Sound10k and and the number of parameters.
VGG-Sound144k and evaluated on the Flickr test set. ‘∗’ de-
notes our faithful reproduction of the method. Training (s) Inference (s)
Method #params
(per iter) (per image)
Method Training Set cIoU0.5 ↑AUC↑
HTF [15] (WACV’23) 0.385 0.039 33.85M
LVS [9] (CVPR’21) 0.618 0.536
Proposed Method 0.398 0.042 34.50M
SSPL [60] (CVPR’22) 0.763 0.591
VGG-Sound10k
Zhou et al. [63] (WACV’23) 0.775 0.596
∗
HTF [15] (WACV’23) 0.842 0.628
Proposed Method 0.875 0.640 consistently improved. The results demonstrate that the effective-
ness of our method considers spatial cues of audio modality and
LVS [9] (CVPR’21) 0.719 0.582
performs sound source localization in a recursive manner.
SSPL [60] (CVPR’22) VGG-Sound144k 0.767 0.605
HTF [15] (WACV’23) 0.848 0.640 Computational Costs. In this section, we compare training time,
Proposed Method 0.881 0.651 inference time, and the number of parameters. It is shown in Table
8. We compare our method with HTF [15] which shows the highest
Table 6: Experimental results on Flickr test set with respect performance among the existing methods. Since our method adopts
to various sound source localization frameworks. the recursive method, the training time, inference time, and the
number of parameters of our method are slightly increased (3.38%,
Method Training Set cIoU0.5 ↑ AUC↑ 7.69% and 1.92% for training, inference, and parameters, respec-
LVS [9] (CVPR’21) 0.699 0.573 tively). Nevertheless, we claim that the increased times of training
Flickr144k
Proposed Method (LVS) 0.718 0.577 and inference time and the number of parameters are marginal
HTF [15] (WACV’23) 0.865 0.639 compared to the HTF [15].
Flickr144k
Proposed Method (HTF) 0.881 0.652
5 CONCLUSION
Table 7: Experimental results on Flickr test set using mcIoU In this paper, we propose a novel sound source localization frame-
metric. work that considers the inherent spatial information of the audio
modality as well as the visual modality for exploiting more abun-
Method Training Set cIoU0.5 ↑ mcIoU↑
dant spatial knowledge. To this end, our framework consists of two
LVS [9] (CVPR’21) 0.699 0.231 stages: (1) audio-visual spatial integration network and (2) recursive
Flickr144k
HTF [15] (WACV’23) 0.865 0.363 attention network. The audio-visual spatial integration network is
Proposed Method 0.881 0.381 designed to incorporate the spatial information of the audio-visual
modalities. By focusing on the attention region generated by the
audio-visual spatial integration network, the recursive attention
the potential to demonstrate sufficient generalization capabilities network aims to perform more precise sound source localization.
essential for real-world applications involving diverse data sources. At this time, we devise audio-visual pair matching loss and spa-
tial region alignment loss to effectively guide the features from
Generalization Ability of Our Method. In this subsection, we the audio-visual modalities to resemble the features of the atten-
conduct experiments to see the generalization ability of our method tive information. We believe that our approach, integrating spatial
by varying the baseline. To this end, we adopt the two baselines: knowledge of audio-visual modalities and recursively refining the
LVS [9] and HTF [15]. The results are shown in Table 6. All the results leads to more improved accuracy and it can be utilized in
methods are trained with Flickr144k and evaluated on Flickr test various practical applications.
set. As shown in the table, when our baseline is LVS [9], we achieve
1.19% cIoU and 0.4% AUC improvement compared to the original
LVS. The results on HTF [15] also show a similar tendency. The ACKNOWLEDGMENTS
results indicate that our method has broader applicability and can This work was supported in part by the National Research Founda-
be integrated with various sound source localization frameworks. tion of Korea (NRF) grant funded by the Korea government (MSIT)
(No. RS-2023-00252391) and by Institute of Information & communi-
Evaluation on the Proposed mcIoU Metric. Note that consensus cations Technology Planning and Evaluation (IITP) grant funded by
intersection over union (cIoU) [49] metric has been widely used for the Korea government (MSIT) (No. 2022-0-00124: Development of
comparing sound source localization methods. In this subsection, Artificial Intelligence Technology for Self-Improving Competency-
we additionally introduce a new metric called mcIoU (mean cIoU) Aware Learning Capabilities, No. RS-2022-00155911: Artificial Intel-
to investigate the performance while varying the IoU threshold. ligence Convergence Innovation Human Resources Development
For calculating mcIoU metric, we take the average cIoU across IoU (Kyung Hee University)) and by the MSIT (Ministry of Science
threshold 0.5:0.05:0.95. The results are shown in Table 7. Compared and ICT), Korea, under the National Program for Excellence in SW
to the existing methods [9, 15], the performances of our method (2023-0-00042) supervised by the IITP in 2023.
3514
Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada
REFERENCES [26] Bill Jones and Boris Kabanoff. 1975. Eye movements in auditory space perception.
[1] Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Perception & Psychophysics 17 (1975), 241–245.
2020. Self-supervised learning of audio-visual objects from video. In European [27] Yun Yi Ke and Takahiro Tsubono. 2022. Recursive contour-saliency blending
Conference on Computer Vision (ECCV). Springer, 208–224. network for accurate salient object detection. In IEEE/CVF Winter Conference on
[2] Ahmad Al-Sallab, Ramy Baly, Hazem Hajj, Khaled Bashir Shaban, Wassim El-Hajj, Applications of Computer Vision (WACV). 2940–2950.
and Gilbert Badaro. 2017. Aroma: A recursive deep learning model for opinion [28] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016. Deeply-recursive convolu-
mining in arabic as a low resource language. ACM Transactions on Asian and tional network for image super-resolution. In IEEE/CVF Conference on Computer
Low-Resource Language Information Processing 16, 4 (2017), 1–20. Vision and Pattern Recognition (CVPR). 1637–1645.
[3] Sebastià V Amengual Garí, Winfried Lachenmayr, and Eckard Mommertz. 2017. [29] Jung Uk Kim, Hak Gu Kim, and Yong Man Ro. 2017. Iterative deep convolutional
Spatial analysis and auralization of room acoustics using a tetrahedral micro- encoder-decoder network for medical image segmentation. In International Con-
phone. The Journal of the Acoustical Society of America 141, 4 (2017), EL369– ference of the IEEE Engineering in Medicine and Biology Society (EMBC). 685–688.
EL374. [30] Jung Uk Kim and Seong Tae Kim. 2023. Towards Robust Audio-Based Vehicle De-
[4] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning tection Via Importance-Aware Audio-Visual Learning. In International Conference
sound representations from unlabeled video. Advances in Neural Information on Acoustics, Speech and Signal Processing (ICASSP). 1–5.
Processing Systems (NeurIPS). [31] Jung Uk Kim, Sungjune Park, and Yong Man Ro. 2021. Robust small-scale pedes-
[5] Robert S Bolia, William R D’Angelo, and Richard L McKinley. 1999. Aurally aided trian detection with cued recall via memory learning. In IEEE/CVF International
visual search in three-dimensional space. Human factors 41, 4 (1999), 664–669. Conference on Computer Vision (ICCV). 3050–3059.
[6] Hieu Minh Bui, Margaret Lech, Eva Cheng, Katrina Neville, and Ian S Burnett. [32] Jung Uk Kim, Sungjune Park, and Yong Man Ro. 2021. Uncertainty-guided cross-
2016. Object recognition using deep convolutional features transformed by a modal learning for robust multispectral pedestrian detection. IEEE Transactions
recursive network structure. IEEE Access 4 (2016), 10059–10066. on Circuits and Systems for Video Technology 32, 3 (2021), 1510–1523.
[7] Hieu Minh Bui, Margaret Lech, Eva Cheng, Katrina Neville, and Ian S Burnett. [33] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
2016. Using grayscale images for object recognition with convolutional-recursive mization. arXiv preprint arXiv:1412.6980 (2014).
neural network. In International Conference on Communications and Electronics [34] Sangmin Lee, Sungjune Park, and Yong Man Ro. 2022. Audio-Visual Mismatch-
(ICCE). 321–325. Aware Video Retrieval via Association and Adjustment. In European Conference
[8] Shlomo E Chazan, Hodaya Hammer, Gershon Hazan, Jacob Goldberger, and on Computer Vision (ECCV). Springer, 497–514.
Sharon Gannot. 2019. Multi-microphone speaker separation based on deep DOA [35] Changliang Li, Bo Xu, Gaowei Wu, Saike He, Guanhua Tian, and Hongwei
estimation. In European Signal Processing Conference (EUSIPCO). 1–5. Hao. 2014. Recursive deep learning for sentiment analysis over social data. In
[9] Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and
and Andrew Zisserman. 2021. Localizing visual sounds the hard way. In IEEE/CVF Intelligent Agent Technologies (IAT). 180–185.
Conference on Computer Vision and Pattern Recognition (CVPR). 16867–16876. [36] Jianan Li, Xiaodan Liang, Jianshu Li, Yunchao Wei, Tingfa Xu, Jiashi Feng, and
[10] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vg- Shuicheng Yan. 2017. Multistage object detection with group recursive learning.
gsound: A large-scale audio-visual dataset. In IEEE International Conference on IEEE Transactions on Multimedia 20, 7 (2017), 1645–1655.
Acoustics, Speech and Signal Processing (ICASSP). 721–725. [37] Xiaofei Li, Laurent Girin, Fabien Badeig, and Radu Horaud. 2016. Reverberant
[11] Zhe Chen, Jing Zhang, and Dacheng Tao. 2021. Recursive context routing for sound localization with a robot head based on direct-path relative transfer func-
object detection. International Journal of Computer Vision 129, 1 (2021), 142–160. tion. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: 2819–2826.
A large-scale hierarchical image database. In IEEE/CVF Conference on Computer [38] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and
Vision and Pattern Recognition (CVPR). 248–255. Serge Belongie. 2017. Feature pyramid networks for object detection. In IEEE/CVF
[13] Jon Driver and Charles Spence. 1998. Cross–modal links in spatial attention. Conference on Computer Vision and Pattern Recognition (CVPR). 2117–2125.
Philosophical Transactions of the Royal Society of London. Series B: Biological [39] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.
Sciences 353, 1373 (1998), 1319–1331. Focal loss for dense object detection. In IEEE/CVF International Conference on
[14] Martin Eimer and Erich Schröger. 1998. ERP effects of intermodal attention and Computer Vision (ICCV). 2980–2988.
cross-modal links in spatial attention. Psychophysiology 35, 3 (1998), 313–327. [40] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
[15] Dennis Fedorishin, Deen Dayal Mohan, Bhavin Jawade, Srirangaraj Setlur, and Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.
Venu Govindaraju. 2023. Hear The Flow: Optical Flow-Based Self-Supervised In European Conference on Computer Vision (ECCV). Springer, 21–37.
Visual Sound Source Localization. In IEEE/CVF Winter Conference on Applications [41] Ren C Luo and Michael G Kay. 1989. Multisensor integration and fusion in
of Computer Vision (WACV). 2278–2287. intelligent systems. IEEE Transactions on Systems, Man, and Cybernetics 19, 5
[16] John J Foxe, Istvan A Morocz, Micah M Murray, Beth A Higgins, Daniel C Javitt, (1989), 901–931.
and Charles E Schroeder. 2000. Multisensory auditory–somatosensory interac- [42] Piotr Majdak, Matthew J Goupell, and Bernhard Laback. 2010. 3-D localization
tions in early cortical processing revealed by high-density electrical mapping. of virtual sound sources: Effects of visual environment, pointing method, and
Cognitive Brain Research 10, 1-2 (2000), 77–83. training. Attention, Perception, & Psychophysics 72, 2 (2010), 454–469.
[17] Jin Gao, Weiming Hu, and Yan Lu. 2020. Recursive least-squares estimator-aided [43] M-Marchsel Mesulam. 1981. A cortical network for directed attention and unilat-
online learning for visual tracking. In IEEE/CVF Conference on Computer Vision eral neglect. Annals of Neurology: Official Journal of the American Neurological
and Pattern Recognition (CVPR). 7386–7395. Association and the Child Neurology Society 10, 4 (1981), 309–325.
[18] William W Gaver. 1993. What in the world do we hear?: An ecological approach [44] David R Perrott, John Cisneros, Richard L Mckinley, and William R D’Angelo.
to auditory event perception. Ecological Psychology 5, 1 (1993), 1–29. 1996. Aurally aided visual search under virtual and free-field listening conditions.
[19] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. 2019. Nas-fpn: Learning scalable Human Factors 38, 4 (1996), 702–715.
feature pyramid architecture for object detection. In IEEE/CVF Conference on [45] Pedro Pinheiro and Ronan Collobert. 2014. Recurrent convolutional neural
Computer Vision and Pattern Recognition (CVPR). 7036–7045. networks for scene labeling. In International Conference on Machine Learning
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual (ICML). 82–90.
learning for image recognition. In IEEE/CVF Conference on Computer Vision and [46] Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, and Weiyao Lin. 2020.
Pattern Recognition (CVPR). 770–778. Multiple sound sources localization from coarse to fine. In European Conference
[21] Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. on Computer Vision (ECCV). Springer, 292–308.
In International Workshop on Similarity-Based Pattern Recognition (SIMBAD). [47] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You
Springer, 84–92. only look once: Unified, real-time object detection. In IEEE/CVF Conference on
[22] Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep multimodal clustering for Computer Vision and Pattern Recognition (CVPR). 779–788.
unsupervised audiovisual learning. In IEEE/CVF Conference on Computer Vision [48] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
and Pattern Recognition (CVPR). 9248–9257. Towards real-time object detection with region proposal networks. Advances in
[23] Zhiyong Huang, Yuanlong Yu, and Miaoxing Xu. 2019. Bidirectional tracking Neural Information Processing Systems (NeurIPS).
scheme for visual object tracking based on recursive orthogonal least squares. [49] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon.
IEEE Access 7 (2019), 159199–159213. 2018. Learning to localize sound source in visual scenes. In IEEE/CVF Conference
[24] Ozan Irsoy and Claire Cardie. 2014. Deep recursive neural networks for com- on Computer Vision and Pattern Recognition (CVPR). 4358–4366.
positionality in language. Advances in Neural Information Processing Systems [50] BR Shelton and CL Searle. 1980. The influence of vision on the absolute identifi-
(NeurIPS). cation of sound-source position. Perception & Psychophysics 28 (1980), 589–596.
[25] Laurent Itti and Christof Koch. 2001. Computational modelling of visual attention. [51] Jiayin Shi and Chao Ma. 2022. Unsupervised Sounding Object Localization
Nature Reviews Neuroscience 2, 3 (2001), 194–203. with Bottom-Up and Top-Down Attention. In IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV). 1737–1746.
3515
MM ’23, October 29-November 3, 2023, Ottawa, ON, Canada Sung Jin Um, Dongjin Kim, and Jung Uk Kim
[52] Richard Socher. 2014. Recursive deep learning for natural language processing and on Computational Imaging 6 (2020), 1233–1244.
computer vision. Stanford University. [59] Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye, and
[53] Richard Socher, Brody Huval, Bharath Bath, Christopher D Manning, and Andrew Mike Zheng Shou. 2022. Ava-avd: Audio-visual speaker diarization in the wild.
Ng. 2012. Convolutional-recursive deep learning for 3d object classification. In ACM International Conference on Multimedia (ACM MM). 3838–3847.
Advances in Neural Information Processing Systems (NeurIPS). [60] Hanyu Xuan, Zhiliang Wu, Jian Yang, Yan Yan, and Xavier Alameda-Pineda.
[54] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. 2011. Parsing nat- 2022. A proposal-based paradigm for self-supervised sound source localization
ural scenes and natural language with recursive neural networks. In International in videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
Conference on Machine Learning (ICML). 129–136. (CVPR). 1029–1038.
[55] Barry E Stein and Terrence R Stanford. 2008. Multisensory integration: current [61] Yue Zhang, Xianrui Li, Mingquan Lin, Bernard Chiu, and Mingbo Zhao. 2020.
issues from the perspective of the single neuron. Nature Reviews Neuroscience 9, Deep-recursive residual network for image semantic segmentation. Neural
4 (2008), 255–266. Computing and Applications 32 (2020), 12935–12947.
[56] Ying Tai, Jian Yang, and Xiaoming Liu. 2017. Image super-resolution via deep [62] Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. 2018.
recursive residual network. In IEEE/CVF Conference on Computer Vision and Joint task-recursive learning for semantic segmentation and depth estimation. In
Pattern Recognition (CVPR). 3147–3155. Proceedings of the European Conference on Computer Vision (ECCV). 235–251.
[57] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeul- [63] Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, and Wanli Ouyang. 2023.
ders. 2013. Selective search for object recognition. International Journal of Exploiting Visual Context Semantics for Sound Source Localization. In IEEE/CVF
Computer Vision 104 (2013), 154–171. Winter Conference on Applications of Computer Vision (WACV). 5199–5208.
[58] Wei Wei, Jiangtao Nie, Yong Li, Lei Zhang, and Yanning Zhang. 2020. Deep
recursive network for hyperspectral image super-resolution. IEEE Transactions
3516