"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Dong, Qianqian; Ye, Rong; Wang, Mingxuan; Zhou, Hao; Xu, Shuang; Xu, Bo; Li, Lei

Computer Science > Computation and Language

arXiv:2009.09704 (cs)

[Submitted on 21 Sep 2020 (v1), last revised 5 Apr 2021 (this version, v3)]

Title:"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Authors:Qianqian Dong, Rong Ye, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li

View PDF

Abstract:An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose Listen-Understand-Translate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech-to-text translation task. LUT is able to guide the acoustic encoder to extract as much information from the auditory input. In addition, LUT utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. We perform experiments on a diverse set of speech translation benchmarks, including Librispeech English-French, IWSLT English-German and TED English-Chinese. Our results demonstrate LUT achieves the state-of-the-art performance, outperforming previous methods. The code is available at this https URL.

Comments:	Accepted by AAAI 2021
Subjects:	Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2009.09704 [cs.CL]
	(or arXiv:2009.09704v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2009.09704

Submission history

From: Qianqian Dong [view email]
[v1] Mon, 21 Sep 2020 09:19:07 UTC (7,683 KB)
[v2] Mon, 28 Dec 2020 07:28:44 UTC (671 KB)
[v3] Mon, 5 Apr 2021 12:36:42 UTC (1,296 KB)

Computer Science > Computation and Language

Title:"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators