Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Segev, Eliya; Alroy, Maya; Katsir, Ronen; Wies, Noam; Shenhav, Ayana; Ben-Oren, Yael; Zar, David; Tadmor, Oren; Bitterman, Jacob; Shashua, Amnon; Rosenwein, Tal

Computer Science > Computation and Language

arXiv:2307.01715 (cs)

[Submitted on 4 Jul 2023 (v1), last revised 7 Mar 2024 (this version, v3)]

Title:Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Authors:Eliya Segev, Maya Alroy, Ronen Katsir, Noam Wies, Ayana Shenhav, Yael Ben-Oren, David Zar, Oren Tadmor, Jacob Bitterman, Amnon Shashua, Tal Rosenwein

View PDF HTML (experimental)

Abstract:Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.

Comments:	ICLR 2024
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2307.01715 [cs.CL]
	(or arXiv:2307.01715v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2307.01715

Submission history

From: Noam Wies [view email]
[v1] Tue, 4 Jul 2023 13:34:47 UTC (1,079 KB)
[v2] Thu, 6 Jul 2023 07:02:45 UTC (1,079 KB)
[v3] Thu, 7 Mar 2024 17:59:25 UTC (3,058 KB)

Computer Science > Computation and Language

Title:Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators