Computer Science > Computer Vision and Pattern Recognition
[Submitted on 25 May 2019 (v1), last revised 7 Jan 2020 (this version, v2)]
Title:DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction
View PDFAbstract:This paper studies audio-visual deep saliency prediction. It introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed ``DAVE" in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named ``AVE". Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we investigate the applicability of audio cues in conjunction with visual ones in predicting saliency maps using deep neural networks. To this end, the proposed model is intentionally designed to be simple. Two baseline models are developed on the same architecture which consists of an encoder-decoder. The encoder projects the input into a feature space followed by a decoder that infers saliency. We conduct an extensive analysis on different modalities and various aspects of multi-model dynamic saliency prediction. Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature representations for the input space leads to more powerful predictions even in absence of more sophisticated saliency decoders, and (4) Audio-Visual model improves over 53.54\% of the frames predicted by the best Visual model (our baseline). Our endeavour demonstrates that audio is an important cue that boosts dynamic video saliency prediction and helps models to approach human performance. The code is available at this https URL
Submission history
From: Hamed R. Tavakoli [view email][v1] Sat, 25 May 2019 23:16:57 UTC (1,604 KB)
[v2] Tue, 7 Jan 2020 21:52:07 UTC (1,384 KB)
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.