Audio-visual speech synchronization detection using a bimodal linear prediction model

Kshitiz Kumar; Jiri Navratil; Etienne Marcheret; Vit Libal; Ganesh Ramaswamy; Gerasimos Potamianos

doi:10.1109/CVPRW.2009.5204303

2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

Audio-visual speech synchronization detection using a bimodal linear prediction model

Year: 2009, Pages: 53-59

DOI Bookmark: 10.1109/CVPRW.2009.5204303

Authors

Kshitiz Kumar, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Jiri Navratil, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA
Etienne Marcheret, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA
Vit Libal, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA
Ganesh Ramaswamy, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA
Gerasimos Potamianos, Institute of Informatics and Telecommunications, NCSR ¿Demokritos¿, 15310 Athens, Greece

Abstract

In this work, we study the problem of detecting audio-visual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two types of visual features extracted from the speaker's facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.

Gender classification in two Emotional Speech databases
ICPR 2008 19th International Conference on Pattern Recognition
Audio-Visual Speaker Recognition via Multi-modal Correlated Neural Networks
2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW)
Performance Improvement of Audio-Visual Speech Recognition with Optimal Reliability Fusion
2011 International Conference on Internet Computing and Information Services (ICICIS 2011)
Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing
Recovering audio-to-video synchronization by audiovisual correlation analysis
ICPR 2008 19th International Conference on Pattern Recognition
Crossmodal Matching of Speakers Using Lip and Voice Features in Temporally Non-overlapping Audio and Video Streams
Pattern Recognition, International Conference on
Efficient Audio-Visual Speaker Recognition Via Deep Multi-Modal Feature Fusion
2021 17th International Conference on Computational Intelligence and Security (CIS)
Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN
IEEE Transactions on Artificial Intelligence
Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Audio-visual speech synchronization detection using a bimodal linear prediction model

Authors

Abstract

Related Articles