Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Xu, Shitong; Yang, Yiyuan; Trigoni, Niki; Markham, Andrew

Computer Science > Sound

arXiv:2502.16611 (cs)

[Submitted on 23 Feb 2025 (v1), last revised 17 Jun 2025 (this version, v2)]

Title:Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Authors:Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham

View PDF HTML (experimental)

Abstract:Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60\%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments.

Comments:	11 pages, 6 figures
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2502.16611 [cs.SD]
	(or arXiv:2502.16611v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2502.16611

Submission history

From: Shitong Xu [view email]
[v1] Sun, 23 Feb 2025 15:33:44 UTC (5,159 KB)
[v2] Tue, 17 Jun 2025 06:10:07 UTC (778 KB)

Computer Science > Sound

Title:Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators