Keyword Transformer: A Self-Attention Model for Keyword Spotting

Berg, Axel; O'Connor, Mark; Cruz, Miguel Tairum

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2104.00769v1 (eess)

[Submitted on 1 Apr 2021 (this version), latest version 15 Jun 2021 (v3)]

Title:Keyword Transformer: A Self-Attention Model for Keyword Spotting

Authors:Axel Berg, Mark O'Connor, Miguel Tairum Cruz

View PDF

Abstract:The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.

Comments:	Submitted to INTERSPEECH
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2104.00769 [eess.AS]
	(or arXiv:2104.00769v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2104.00769

Submission history

From: Axel Berg [view email]
[v1] Thu, 1 Apr 2021 21:15:30 UTC (1,438 KB)
[v2] Thu, 15 Apr 2021 14:28:41 UTC (1,436 KB)
[v3] Tue, 15 Jun 2021 13:06:01 UTC (609 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Keyword Transformer: A Self-Attention Model for Keyword Spotting

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Keyword Transformer: A Self-Attention Model for Keyword Spotting

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators