Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Boito, Marcely Zanon; Yusuf, Bolaji; Ondel, Lucas; Villavicencio, Aline; Besacier, Laurent

Computer Science > Computation and Language

arXiv:2106.04298 (cs)

[Submitted on 8 Jun 2021 (v1), last revised 18 May 2022 (this version, v2)]

Title:Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Authors:Marcely Zanon Boito, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier

View PDF

Abstract:Documenting languages helps to prevent the extinction of endangered dialects, many of which are otherwise expected to disappear by the end of the century. When documenting oral languages, unsupervised word segmentation (UWS) from speech is a useful, yet challenging, task. It consists in producing time-stamps for slicing utterances into smaller segments corresponding to words, being performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units that can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. For the UWS task, we experiment with two models, using as our target language the Mboshi (Bantu C25), an unwritten language from Congo-Brazzaville. Additionally, we report results for Finnish, Hungarian, Romanian and Russian in equally low-resource settings, using only 4 hours of speech. Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using Bayesian models that produce high quality, yet compressed, discrete representations of the input speech signal.

Comments:	Accepted to SIGUL 2022
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2106.04298 [cs.CL]
	(or arXiv:2106.04298v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2106.04298

Submission history

From: Marcely Zanon Boito [view email]
[v1] Tue, 8 Jun 2021 12:50:37 UTC (2,322 KB)
[v2] Wed, 18 May 2022 13:10:17 UTC (2,234 KB)

Computer Science > Computation and Language

Title:Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators