Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Bird, Jordan J.; Faria, Diego R.; Premebida, Cristiano; Ekárt, Anikó; Vogiatzis, George

Computer Science > Computer Vision and Pattern Recognition

arXiv:2007.10175 (cs)

[Submitted on 11 Jul 2020]

Title:Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Authors:Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Anikó Ekárt, George Vogiatzis

View PDF

Abstract:The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

Comments:	6 pages, 10 figures, 3 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2007.10175 [cs.CV]
	(or arXiv:2007.10175v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2007.10175

Submission history

From: Jordan J. Bird [view email]
[v1] Sat, 11 Jul 2020 16:47:05 UTC (3,806 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators