2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA
{yuangong, andyyuan, glass}@mit.edu
[7] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumb- [29] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
ley, “PANNs: Large-scale pretrained audio neural networks for Cubuk, and Q. V. Le, “SpecAugment: A simple data augmen-
audio pattern recognition,” IEEE/ACM TASLP, vol. 28, pp. 2880– tation method for automatic speech recognition,” in Interspeech,
2894, 2020. 2019.
[8] Y. Gong, Y.-A. Chung, and J. Glass, “PSLA: Improving audio [30] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G.
event classification with pretraining, sampling, labeling, and ag- Wilson, “Averaging weights leads to wider optima and better gen-
gregation,” arXiv preprint arXiv:2102.01243, 2021. eralization,” in UAI, 2018.
[9] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and [31] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24,
S. Laurenzo, “Streaming keyword spotting on mobile devices,” in no. 2, pp. 123–140, 1996.
Interspeech, 2020. [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[10] P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “An mization,” in ICLR, 2015.
attention pooling based representation learning method for speech [33] H. B. Sailor, D. M. Agrawal, and H. A. Patil, “Unsupervised filter-
emotion recognition,” in Interspeech, 2018. bank learning using convolutional restricted boltzmann machine
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, for environmental sound classification.” in Interspeech, 2017.
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, [34] S. Majumdar and B. Ginsburg, “Matchboxnet–1d time-channel
S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 separable convolutional neural network architecture for speech
words: Transformers for image recognition at scale,” in ICLR, commands recognition,” arXiv preprint arXiv:2004.08531, 2020.
2021.
[35] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and spotters with limited and synthesized speech data,” in ICASSP,
H. Jégou, “Training data-efficient image transformers & distilla- 2020.
tion through attention,” arXiv preprint arXiv:2012.12877, 2020.
[13] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and
S. Yan, “Tokens-to-token ViT: Training vision transformers from
scratch on ImageNet,” arXiv preprint arXiv:2101.11986, 2021.
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” in CVPR,
2009.
[15] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,
R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology
and human-labeled dataset for audio events,” in ICASSP, 2017.
[16] K. J. Piczak, “ESC: Dataset for environmental sound classifica-
tion,” in Multimedia, 2015.
[17] P. Warden, “Speech commands: A dataset for limited-vocabulary
speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
in NIPS, 2017.
[19] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda,
and K. Takeda, “Convolution augmented transformer for semi-
supervised sound event detection,” in DCASE, 2020.
[20] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Sound event
detection of weakly labelled data with CNN-transformer and au-
tomatic threshold optimization,” IEEE/ACM TASLP, vol. 28, pp.
2450–2460, 2020.