Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Jung, Youngmoon; Kim, Younggwan; Lim, Hyungjun; Choi, Yeunju; Kim, Hoirin

doi:10.21437/Interspeech.2019-2177

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:1906.08333 (eess)

[Submitted on 19 Jun 2019]

Title:Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Authors:Youngmoon Jung, Younggwan Kim, Hyungjun Lim, Yeunju Choi, Hoirin Kim

View PDF

Abstract:In this paper, we propose a new pooling method called spatial pyramid encoding (SPE) to generate speaker embeddings for text-independent speaker verification. We first partition the output feature maps from a deep residual network (ResNet) into increasingly fine sub-regions and extract speaker embeddings from each sub-region through a learnable dictionary encoding layer. These embeddings are concatenated to obtain the final speaker representation. The SPE layer not only generates a fixed-dimensional speaker embedding for a variable-length speech segment, but also aggregates the information of feature distribution from multi-level temporal bins. Furthermore, we apply deep length normalization by augmenting the loss function with ring loss. By applying ring loss, the network gradually learns to normalize the speaker embeddings using model weights themselves while preserving convexity, leading to more robust speaker embeddings. Experiments on the VoxCeleb1 dataset show that the proposed system using the SPE layer and ring loss-based deep length normalization outperforms both i-vector and d-vector baselines.

Comments:	5 pages, 2 figures, Interspeech 2019
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)
Cite as:	arXiv:1906.08333 [eess.AS]
	(or arXiv:1906.08333v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.1906.08333
Journal reference:	Proc. of Interspeech 2019, 2019, pp. 4030-4034
Related DOI:	https://doi.org/10.21437/Interspeech.2019-2177

Submission history

From: Youngmoon Jung [view email]
[v1] Wed, 19 Jun 2019 20:13:27 UTC (756 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators