WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Rao, Rajath; Ganesan, Adithya; Kjell, Oscar; Luby, Jonah; Raghavan, Akshay; Feltman, Scott; Ringwald, Whitney; Boyd, Ryan L.; Luft, Benjamin; Ruggero, Camilo; Ryant, Neville; Kotov, Roman; Schwartz, H. Andrew

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2501.16344 (eess)

[Submitted on 15 Jan 2025 (v1), last revised 31 May 2025 (this version, v4)]

Title:WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Authors:Rajath Rao, Adithya Ganesan, Oscar Kjell, Jonah Luby, Akshay Raghavan, Scott Feltman, Whitney Ringwald, Ryan L. Boyd, Benjamin Luft, Camilo Ruggero, Neville Ryant, Roman Kotov, H. Andrew Schwartz

View PDF HTML (experimental)

Abstract:Current speech encoding pipelines often rely on an additional text-based LM to get robust representations of human communication, even though SotA speech-to-text models often have a LM within. This work proposes an approach to improve the LM within an audio model such that the subsequent text-LM is unnecessary. We introduce WhiSPA (Whisper with Semantic and Psychological Alignment), which leverages a novel audio training objective: contrastive loss with a language model embedding as a teacher. Using over 500k speech segments from mental health audio interviews, we evaluate the utility of aligning Whisper's latent space with semantic representations from a text autoencoder (SBERT) and lexically derived embeddings of basic psychological dimensions: emotion and personality. Over self-supervised affective tasks and downstream psychological tasks, WhiSPA surpasses current speech encoders, achieving an average error reduction of 73.4% and 83.8%, respectively. WhiSPA demonstrates that it is not always necessary to run a subsequent text LM on speech-to-text output in order to get a rich psychological representation of human communication.

Comments:	16 pages, 8 figures, ACL 2025
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2501.16344 [eess.AS]
	(or arXiv:2501.16344v4 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2501.16344

Submission history

From: Rajath Rao [view email]
[v1] Wed, 15 Jan 2025 06:30:17 UTC (3,541 KB)
[v2] Sun, 16 Feb 2025 23:25:21 UTC (10,462 KB)
[v3] Wed, 21 May 2025 00:00:05 UTC (10,416 KB)
[v4] Sat, 31 May 2025 16:37:32 UTC (1,490 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators