Estimating articulatory movements in speech production with transformer networks

Udupa, Sathvik; Roy, Anwesha; Singh, Abhayjeet; Illa, Aravind; Ghosh, Prasanta Kumar

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2104.05017 (eess)

[Submitted on 11 Apr 2021 (v1), last revised 12 Jun 2021 (this version, v2)]

Title:Estimating articulatory movements in speech production with transformer networks

Authors:Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

View PDF

Abstract:We estimate articulatory movements in speech production from different modalities - acoustics and phonemes. Acoustic-to articulatory inversion (AAI) is a sequence-to-sequence task. On the other hand, phoneme to articulatory (PTA) motion estimation faces a key challenge in reliably aligning the text and the articulatory movements. To address this challenge, we explore the use of a transformer architecture - FastSpeech, with explicit duration modelling to learn hard alignments between the phonemes and articulatory movements. We also train a transformer model on AAI. We use correlation coefficient (CC) and root mean squared error (rMSE) to assess the estimation performance in comparison to existing methods on both tasks. We observe 154%, 11.8% & 4.8% relative improvement in CC with subject-dependent, pooled and fine-tuning strategies, respectively, for PTA estimation. Additionally, on the AAI task, we obtain 1.5%, 3% and 3.1% relative gain in CC on the same setups compared to the state-of-the-art baseline. We further present the computational benefits of having transformer architecture as representation blocks.

Comments:	accepted for oral presentation at INTERSPEECH 2021
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2104.05017 [eess.AS]
	(or arXiv:2104.05017v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2104.05017

Submission history

From: Sathvik Udupa [view email]
[v1] Sun, 11 Apr 2021 13:56:10 UTC (376 KB)
[v2] Sat, 12 Jun 2021 08:56:20 UTC (485 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Estimating articulatory movements in speech production with transformer networks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Estimating articulatory movements in speech production with transformer networks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators