Online End-to-End Neural Diarization with Speaker-Tracing Buffer

Xue, Yawen; Horiguchi, Shota; Fujita, Yusuke; Watanabe, Shinji; Nagamatsu, Kenji

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2006.02616v1 (eess)

[Submitted on 4 Jun 2020 (this version), latest version 7 Mar 2021 (v2)]

Title:Online End-to-End Neural Diarization with Speaker-Tracing Buffer

Authors:Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

View PDF

Abstract:End-to-end speaker diarization using a fully supervised self-attention mechanism (SA-EEND) has achieved significant improvement from the state-of-art clustering-based methods, especially for the overlapping case. However, applications of original SA-EEND are limited since it has been developed based on offline self-attention algorithms. In this paper, we propose a novel speaker-tracing mechanism to extend SA-EEND to online speaker diarization for practical use. First, this paper demonstrates oracle experiments to show that a straightforward online extension, in which SA-EEND is performed independently for each chunked recording, results in degrading the diarization error rate (DER) due to the speaker permutation inconsistency across the chunk. To circumvent this inconsistency issue, our proposed method, called speaker-tracing buffer, maintains the speaker permutation information determined in previous chunks within the self-attention mechanism for correct speaker-tracing. Our experimental results show that the proposed online SA-EEND with speaker-tracing buffer achieved the DERs of 12.84% for CALLHOME and 21.64% for Corpus of Spontaneous Japanese with 1s latency. These results are significantly better than the conventional online clustering method based on x-vector with 1.5s latency, which achieved the DERs of 26.90% and 25.45%, respectively.

Comments:	Submitted to Interspeech 2020
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2006.02616 [eess.AS]
	(or arXiv:2006.02616v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2006.02616

Submission history

From: Yawen Xue [view email]
[v1] Thu, 4 Jun 2020 02:25:07 UTC (607 KB)
[v2] Sun, 7 Mar 2021 04:40:59 UTC (588 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Online End-to-End Neural Diarization with Speaker-Tracing Buffer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Online End-to-End Neural Diarization with Speaker-Tracing Buffer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators