2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass

The document introduces the Audio Spectrogram Transformer (AST), a convolution-free, purely attention-based model for audio classification. AST achieves state-of-the-art results on various audio classification benchmarks, outperforming CNN and CNN-attention hybrid models. It also supports variable-length inputs and transfers knowledge from pretrained Vision Transformers to further improve performance.

Uploaded by

Manuel Alvarez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

151 views5 pages

2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass

Uploaded by

Manuel Alvarez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

AST: Audio Spectrogram Transformer

Yuan Gong, Yu-An Chung, James Glass

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA
{yuangong, andyyuan, glass}@mit.edu

Abstract Linear Output

In the past decade, convolutional neural networks (CNNs) Transformer Encoder
have been widely adopted as the main building block for end-
to-end audio classification models, which aim to learn a direct P[0] P[1] P[2] P[3] P[4] P[5] P[6] P[7] P[8]
mapping from audio spectrograms to corresponding labels. To
better capture long-range global context, a recent trend is to
E[CLS] E[1] E[2] E[3] E[4] E[5] E[6] E[7] E[8]
add a self-attention mechanism on top of the CNN, forming a
CNN-attention hybrid model. However, it is unclear whether
the reliance on a CNN is necessary, and if neural networks Linear Projection
arXiv:2104.01778v3 [cs.SD] 8 Jul 2021

purely based on attention are sufficient to obtain good perfor- 1 2 3 4 5 6 7 8

mance in audio classification. In this paper, we answer the ques-
tion by introducing the Audio Spectrogram Transformer (AST), 1 2 3 4
the first convolution-free, purely attention-based model for au- Input Spectrogram
dio classification. We evaluate AST on various audio classifi- 5 6 7 8
cation benchmarks, where it achieves new state-of-the-art re- Patch Split with Overlap
sults of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50,
and 98.1% accuracy on Speech Commands V2. Figure 1: The proposed audio spectrogram transformer (AST)
Index Terms: audio classification, self-attention, Transformer architecture. The 2D audio spectrogram is split into a sequence
of 16×16 patches with overlap, and then linearly projected to
1. Introduction a sequence of 1-D patch embeddings. Each patch embedding
is added with a learnable positional embedding. An additional
With the advent of deep neural networks, over the last decade
classification token is prepended to the sequence. The output
audio classification research has moved from models based
embedding is input to a Transformer, and the output of the clas-
on hand-crafted features [1, 2] to end-to-end models that di-
sification token is used for classification with a linear layer.
rectly map audio spectrograms to corresponding labels [3, 4, 5].
Specifically, convolutional neural networks (CNNs) [6] have
been widely used to learn representations from raw spectro- models we use for all aforementioned tasks have the same archi-
grams for end-to-end modeling, as the inductive biases inherent tecture while the input lengths vary from 1 sec. (Speech Com-
to CNNs such as spatial locality and translation equivariance mands) to 10 sec. (AudioSet). In contrast, CNN-based models
are believed to be helpful. In order to better capture long-range typically require architecture tuning to obtain optimal perfor-
global context, a recent trend is to add a self-attention mech- mance for different tasks. Third, comparing with SOTA CNN-
anism on top of the CNN. Such CNN-attention hybrid mod- attention hybrid models, AST features a simpler architecture
els have achieved state-of-the-art (SOTA) results for many au- with fewer parameters, and converges faster during training. To
dio classification tasks such as audio event classification [7, 8], the best of our knowledge, AST is the first purely attention-
speech command recognition [9], and emotion recognition [10]. based audio classification model.
However, motivated by the success of purely attention-based Related Work The proposed Audio Spectrogram Trans-
models in the vision domain [11, 12, 13], it is reasonable to ask former, as the name suggests, is based on the Transformer ar-
whether a CNN is still essential for audio classification. chitecture [18], which was originally proposed for natural lan-
To answer the question, we introduce the Audio Spectro- guage processing tasks. Recently, the Transformer has also
gram Transformer (AST), a convolution-free, purely attention- been adapted for audio processing, but is typically used in
based model that is directly applied to an audio spectrogram conjunction with a CNN [19, 20, 21]. In [19, 20], the au-
and can capture long-range global context even in the lowest thors stack a Transformer on top of a CNN, while in [21],
layers. Additionally, we propose an approach for transferring the authors combine a Transformer and a CNN in each model
knowledge from the Vision Transformer (ViT) [12] pretrained block. Other efforts combine CNNs with simpler attention
on ImageNet [14] to AST, which can significantly improve the modules [8, 7, 9]. The proposed AST differs from these stud-
performance. The advantages of AST are threefold. First, AST ies in that it is convolution-free and purely based on attention
has superior performance: we evaluate AST on a variety of au- mechanisms. The closest work to ours is the Vision Trans-
dio classification tasks and datasets including AudioSet [15], former (ViT) [11, 12, 13], which is a Transformer architecture
ESC-50 [16] and Speech Commands [17]. AST outperforms for vision tasks. AST and ViT have similar architectures but
state-of-the-art systems on all these datasets. Second, AST nat- ViT has only been applied to fixed-dimensional inputs (images)
urally supports variable-length inputs and can be applied to dif- while AST can process variable-length audio inputs. In addi-
ferent tasks without any change of architecture. Specifically, the tion, we propose an approach to transfer knowledge from Ima-
geNet pretrained ViT to AST. We also conduct extensive exper-
Code at https://github.com/YuanGongND/ast. iments to show the design choice of AST on audio tasks.
2. Audio Spectrogram Transformer shelf pretrained Vision Transformer (ViT) to AST.
While ViT and AST have similar architectures (e.g., both
2.1. Model Architecture
use a standard Transformer, same patch size, same embedding
Figure 1 illustrates the proposed Audio Spectrogram Trans- size), they are not same. Therefore, a few modifications need to
former (AST) architecture. First, the input audio waveform of t make for the adaptation. First, the input of ViT is a 3-channel
seconds is converted into a sequence of 128-dimensional log image while the input to the AST is a single-channel spectro-
Mel filterbank (fbank) features computed with a 25ms Ham- gram, we average the weights corresponding to each of the
ming window every 10ms. This results in a 128 × 100t spectro- three input channels of the ViT patch embedding layer and use
gram as input to the AST. We then split the spectrogram into a them as the weights of the AST patch embedding layer. This
sequence of N 16×16 patches with an overlap of 6 in both time is equivalent to expanding a single-channel spectrogram to 3-
and frequency dimension, where N = 12d(100t − 16)/10e is channels with the same content, but is computationally more
the number of patches and the effective input sequence length efficient. We also normalize the input audio spectrogram so that
for the Transformer. We flatten each 16×16 patch to a 1D patch the dataset mean and standard deviation are 0 and 0.5, respec-
embedding of size 768 using a linear projection layer. We re- tively. Second, the input shape of ViT is fixed (either 224 × 224
fer to this linear projection layer as the patch embedding layer. or 384 × 384), which is different from a typical audio spectro-
Since the Transformer architecture does not capture the input gram. In addition, the length of an audio spectrogram can be
order information and the patch sequence is also not in tem- variable. While the Transformer naturally supports variable in-
poral order, we add a trainable positional embedding (also of put length and can be directly transferred from ViT to AST, the
size 768) to each patch embedding to allow the model to cap- positional embedding needs to be carefully processed because
ture the spatial structure of the 2D audio spectrogram. it learns to encode the spatial information during the ImageNet
Similar to [22], we append a [CLS] token at the beginning training. We propose a cut and bi-linear interpolate method for
of the sequence. The resulting sequence is then input to the positional embedding adaptation. For example, for a ViT that
Transformer. A Transformer consists of several encoder and takes 384 × 384 image input and uses a patch size of 16 × 16,
decoder layers. Since AST is designed for classification tasks, the number of patches and corresponding positional embedding
we only use the encoder of the Transformer. Intentionally, we is 24 × 24 = 576 (ViT splits patches without overlap). An
use the original Transformer encoder [18] architecture without AST that takes 10-second audio input has 12 × 100 patches,
modification. The advantages of this simple setup are 1) the each patch needs a positional embedding. We therefore cut
standard Transformer architecture is easy to implement and re- the first dimension and interpolate the second dimension of the
produce as it is off-the-shelf in TensorFlow and PyTorch, and 24 × 24 ViT positional embedding to 12 × 100 and use it as
2) we intend to apply transfer learning for AST, and a stan- the positional embedding for the AST. We directly reuse the
dard architecture makes transfer learning easier. Specifically, positional embedding for the [CLS] token. By doing this we
the Transformer encoder we use has an embedding dimension are able to transfer the 2D spatial knowledge from a pretrained
of 768, 12 layers, and 12 heads, which are the same as those ViT to the AST even when the input shapes are different. Fi-
in [12, 11]. The Transformer encoder’s output of the [CLS] nally, since the classification task is essentially different, we
token serves as the audio spectrogram representation. A linear abandon the last classification layer of the ViT and reinitialize
layer with sigmoid activation maps the audio spectrogram rep- a new one for AST. With this adaptation framework, the AST
resentation to labels for classification. can use various pretrained ViT weights for initialization. In this
Strictly speaking, the patch embedding layer can be viewed work, we use pretrained weights of a data-efficient image Trans-
as a single convolution layer with a large kernel and stride size, former (DeiT) [12], which is trained with CNN knowledge dis-
and the projection layer in each Transformer block is equivalent tillation, 384 × 384 images, has 87M parameters, and achieves
to 1×1 convolution. However, the design is different from con- 85.2% top-1 accuracy on ImageNet 2012. During ImageNet
ventional CNNs that have multiple layers and small kernel and training, DeiT has two [CLS] tokens; we average them as a
stride sizes. These Transformer models are usually referred to single [CLS] token for audio training.
as convolution-free to distinguish them from CNNs [11, 12].
3. Experiments
2.2. ImageNet Pretraining
In this section, we focus on evaluating the AST on AudioSet
One disadvantage of the Transformer compared with CNNs is (Section 3.1) as weakly-labeled audio event classification is one
that the Transformer needs more data to train [11]. In [11], of the most challenging audio classification tasks. We present
the authors point out that the Transformer only starts to out- our primary AudioSet results and ablation study in Section 3.1.2
perform CNNs when the amount of data is over 14 million for and Section 3.1.3, respectively. We then present our experi-
image classification tasks. However, audio datasets typically ments on ESC-50 and Speech Commands V2 in Section 3.2.
do not have such large amounts of data, which motivates us
to apply cross-modality transfer learning to AST since images 3.1. AudioSet Experiments
and audio spectrograms have similar formats. Transfer learn-
3.1.1. Dataset and Training Details
ing from vision tasks to audio tasks has been previously stud-
ied in [23, 24, 25, 8], but only for CNN-based models, where AudioSet [15] is a collection of over 2 million 10-second au-
ImageNet-pretrained CNN weights are used as initial CNN dio clips excised from YouTube videos and labeled with the
weights for audio classification training. In practice, it is com- sounds that the clip contains from a set of 527 labels. The bal-
putationally expensive to train a state-of-the-art vision model, anced training, full training, and evaluation set contains 22k,
but many commonly used architectures (e.g., ResNet [26], Ef- 2M, and 20k samples, respectively. For AudioSet experiments,
ficientNet [27]) have off-the-shelf ImageNet-pretrained mod- we use the exact same training pipeline with [8]. Specifically,
els for both TensorFlow and PyTorch, making transfer learning we use ImageNet pretraining (as described in Section 2.2), bal-
much easier. We also follow this regime by adapting an off-the- anced sampling (for full set experiments only), data augmenta-
Table 1: Performance comparison of AST and previous methods Table 2: Performance impact due to ImageNet pretraining.
on AudioSet. “Used” denotes the setting used by our optimal AST model.

Model Balanced Full Balanced Set Full Set

Architecture mAP mAP
No Pretrain 0.148 0.366
Baseline [15] CNN+MLP - 0.314 ImageNet Pretrain (Used) 0.347 0.459
PANNs [7] CNN+Attention 0.278 0.439
PSLA [8] (Single) CNN+Attention 0.319 0.444
PSLA (Ensemble-S) CNN+Attention 0.345 0.464 Table 3: Performance of AST models initialized with different
PSLA (Ensemble-M) CNN+Attention 0.362 0.474 ViT weights on balanced AudioSet and corresponding ViT mod-
0.347 0.459 els’ top-1 accuracy on ImageNet 2012. (* Model is trained
AST (Single) Pure Attention without patch split overlap due to memory limitation.)
± 0.001 ± 0.000
AST (Ensemble-S) Pure Attention 0.363 0.475
AST (Ensemble-M) Pure Attention 0.378 0.485 # Params ImageNet AudioSet
ViT Base [11] 86M 0.846 0.320
ViT Large [11]* 307M 0.851 0.330
tion (including mixup [28] with mixup ratio=0.5 and spectro-
DeiT w/o Distill [12] 86M 0.829 0.330
gram masking [29] with max time mask length of 192 frames DeiT w/ Distill (Used) 87M 0.852 0.347
and max frequency mask length of 48 bins), and model aggrega-
tion (including weight averaging [30] and ensemble [31]). We
train the model with a batch size of 12, the Adam optimizer [32], last 20 epochs. For Ensemble-S, we follow the same setting
and use binary cross-entropy loss. We conduct experiments on used for the full AudioSet experiment; for Ensemble-M, we in-
the official balanced and full training set and evaluate on the Au- clude 11 models trained with different random seeds (Table 1),
dioSet evaluation set. For balanced set experiments, we use an different pretrained weights (Table 3), different positional em-
initial learning rate of 5e-5 and train the model for 25 epochs, bedding interpolation (Table 4), and different patch split strate-
the learning rate is cut into half every 5 epoch after the 10th gies (Table 5). The single, Ensemble-S, and Ensemble-M model
epoch. For full set experiments, we use an initial learning rate achieve 0.347±0.001, 0.363, and 0.378, respectively, all outper-
of 1e-5 and train the model for 5 epochs, the learning rate is form the previous best system. This demonstrates that AST can
cut into half every epoch after the 2nd epoch. We use the mean work better than CNN-attention hybrid models even when the
average precision (mAP) as our main evaluation metric. training set is relatively small.

3.1.2. AudioSet Results 3.1.3. Ablation Study

We repeat each experiment three times with the same setup but We conduct a series of ablation studies to illustrate the design
different random seeds and report the mean and standard devi- choices for the AST. To save compute, we mainly conduct ab-
ation. When AST is trained with the full AudioSet, the mAP at lation studies with the balanced AudioSet. For all experiments,
the last epoch is 0.448±0.001. As in [8], we also use weight we use weight averaging but do not use ensembles.
averaging [30] and ensemble [31] strategies to further improve
the performance of AST. Specifically, for weight averaging, we Impact of ImageNet Pretraining. We compare ImageNet pre-
average all weights of the model checkpoints from the first to trained AST and randomly initialized AST. As shown in Ta-
the last epoch. The weight-averaged model achieves an mAP of ble 2, ImageNet pretrained AST noticeably outperforms ran-
0.459±0.000, which is our best single model (weight averag- domly initialized AST for both balanced and full AudioSet ex-
ing does not increase the model size). For ensemble, we eval- periments. The performance improvement of ImageNet pre-
uate two settings: 1) Ensemble-S: we run the experiment three training is more significant when the training data volume is
times with the exact same setting, but with a different random smaller, demonstrating that ImageNet pretraining can greatly
seed. We then average the output of the last checkpoint model of reduce the demand for in-domain audio data for AST. We fur-
each run. In this setting, the ensemble model achieves an mAP ther study the impact of pretrained weights used. As shown in
of 0.475; 2) Ensemble-M: we ensemble models trained with Table 3, we compare the performance of AST models initialized
different settings, specifically, we ensemble the three models with pretrained weights of ViT-Base, ViT-Large, and DeiT mod-
in Ensemble-S together with another three models trained with els. These models have similar architectures but are trained with
different patch split strategies (described in Section 3.1.3 and different settings. We made the necessary architecture modifi-
shown in Table 5). In this setting, the ensemble model achieves cations for AST to reuse the weights. We find that AST using
an mAP of 0.485, this is our best full model on AudioSet. As the weights of the DeiT model with distillation that performs
shown in Table 1, the proposed AST outperforms the previous best on ImageNet2012 also performs best on AudioSet.
best system in [8] in all settings. Note that we use the same Impact of Positional Embedding Adaptation. As mentioned
training pipeline with [8] and [8] also use ImageNet pretrain- in Section 2.2, we use a cut and bi-linear interpolation approach
ing, so it is a fair comparison. In addition, we use fewer models for positional embedding adaptation when transferring knowl-
(6) for our best ensemble models than [8] (10). Finally, it is edge from the Vision Transformer to the AST. We compare it
worth mentioning that AST training converges quickly; AST with a pretrained AST model with a randomly initialized posi-
only needs 5 training epochs, while in [8], the CNN-attention tional embedding. As shown in Table 4, we find reinitializing
hybrid model is trained for 30 epochs. the positional embedding does not completely break the pre-
We also conduct experiments with the balanced AudioSet trained model as the model still performs better than a fully
(about 1% of the full set) to evaluate the performance of AST randomly reinitialized model, but it does lead to a noticeable
when the training data volume is smaller. For weight averag- performance drop compared with the proposed adaptation ap-
ing, we average all weights of the model checkpoints of the proach. This demonstrates the importance of transferring spatial
Table 4: Performance impact due to various positional embed- Table 7: Comparing AST and SOTA models on ESC-50 and
ding adaptation settings. Speech Commands. “-S” and “-P” denotes model trained with-
out and with additional audio data, respectively.
Balanced Set
Reinitialize 0.305 ESC-50 Speech Commands V2 (35 classes)
Nearest Neighbor Interpolation 0.346 SOTA-S 86.5 [33] 97.4 [34]
Bilinear Interpolation (Used) 0.347 SOTA-P 94.7 [7] 97.7 [35]
AST-S 88.7±0.7 98.11±0.05
Table 5: Performance impact due to various patch overlap size. AST-P 95.6±0.4 97.88±0.03

# Patches Balanced Set Full Set

epochs. We use an initial learning rate of 1e-4 and 1e-5 for AST-
No Overlap 512 0.336 0.451 S and AST-P, respectively, and decrease the learning rate with
Overlap-2 657 0.342 0.456 a factor of 0.85 every epoch after the 5th epoch. We follow the
Overlap-4 850 0.344 0.455 standard 5-fold cross-validation to evaluate our model, repeat
Overlap-6 (Used) 1212 0.347 0.459
each experiment three times, and report the mean and standard
deviation. As shown in Table 7, AST-S achieves 88.7±0.7 and
Table 6: Performance impact due to various patch shape and AST-P achieves 95.6±0.4, both outperform SOTA models in
size. All models are trained with no patch split overlap. the same setting. Of note, although ESC-50 has 1,600 training
samples for each fold, AST still works well with such a small
# Patches w/o Pretrain w/ Pretrain amount of data even without AudioSet pretraining.
Speech Commands V2 [17] is a dataset consists of 105,829
128×2 512 0.154 -
1-second recordings of 35 common speech commands. The
16×16 (Used) 512 0.143 0.336
32×32 128 0.139 - training, validation, and test set contains 84,843, 9,981, and
11,005 samples, respectively. We focus on the 35-class clas-
sification task, the SOTA model on Speech Commands V2 (35-
knowledge. Bi-linear interpolation and nearest-neighbor inter- class classification) without additional audio data pretraining is
polation do not result in a big difference. the time-channel separable convolutional neural network [34],
which achieves 97.4% on the test set. In [35], a CNN model
Impact of Patch Split Overlap. We compare the performance pretrained with additional 200 million YouTube audio achieves
of models trained with different patch split overlap [13]. As 97.7% on the test set. We also evaluate AST in these two set-
shown in Table 5, the performance improves with the overlap tings. Specifically, we train an AST model with only ImageNet
size for both balanced and full set experiments. However, in- pretraining (AST-S) and an AST model with ImageNet and
creasing the overlap also leads to longer patch sequence inputs AudioSet pretraining (AST-P). We train both models with fre-
to the Transformer, which will quadratically increase the com- quency and time masking [29], random noise, and mixup [28]
putational overhead. Even with no patch split overlap, AST can augmentation, a batch size of 128, and the Adam optimizer [32].
still outperform the previous best system in [8]. We use an initial learning rate of 2.5e-4 and decrease the learn-
Impact of Patch Shape and Size. As mentioned in Sec- ing rate with a factor of 0.85 every epoch after the 5th epoch.
tion 2.1, we split the audio spectrogram into 16 × 16 square We train the model for up to 20 epochs, and select the best
patches, so the input sequence to the Transformer cannot be in model using the validation set, and report the accuracy on
temporal order. We hope the positional embedding can learn the test set. We repeat each experiment three times and re-
to encode the 2D spatial information. An alternative way to port the mean and standard deviation. AST-S model achieves
split the patch is slicing the audio spectrogram into rectangu- 98.11±0.05, outperforms the SOTA model in [9]. In addition,
lar patches in the temporal order. We compare both methods we find AudioSet pretraining unnecessary for the speech com-
in Table 6, when the area of the patch is the same (256), using mand classification task as AST-S outperforms AST-P. To sum-
128 × 2 rectangle patches leads to better performance than us- marize, while the input audio length varies from 1 sec. (Speech
ing 16 × 16 square patches when both models are trained from Commands), 5 sec. (ESC-50) to 10 sec. (AudioSet) and content
scratch. However, considering there is no 128 × 2 patch based varies from speech (Speech Commands) to non-speech (Au-
ImageNet pretrained models, using 16 × 16 patches is still the dioSet and ESC-50), we use a fixed AST architecture for all
current optimal solution. We also compare using patches with three benchmarks and achieve SOTA results on all of them. This
different sizes, smaller size patches lead to better performance. indicates the potential for AST use as a generic audio classifier.

3.2. Results on ESC-50 and Speech Commands 4. Conclusions

The ESC-50 [16] dataset consists of 2,000 5-second environ- Over the last decade, CNNs have become a common model
mental audio recordings organized into 50 classes. The cur- component for audio classification. In this work, we find CNNs
rent best results on ESC-50 are 86.5% accuracy (trained from are not indispensable, and introduce the Audio Spectrogram
scratch, SOTA-S) [33] and 94.7% accuracy (with AudioSet pre- Transformer (AST), a convolution-free, purely attention-based
training, SOTA-P) [7]. We compare AST with the SOTA mod- model for audio classification which features a simple architec-
els in these two settings, specifically, we train an AST model ture and superior performance.
with only ImageNet pretraining (AST-S) and an AST model
with ImageNet and AudioSet pretraining (AST-P). We train 5. Acknowledgements
both models with frequency/time masking [29] data augmen-
tation, a batch size of 48, and the Adam optimizer [32] for 20 This work is partly supported by Signify.
6. References [21] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu,
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:
[1] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent de- Convolution-augmented transformer for speech recognition,” in
velopments in openSMILE, the Munich open-source multimedia Interspeech, 2020.
feature extractor,” in Multimedia, 2013.
[22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
[2] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer,
training of deep bidirectional transformers for language under-
F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi,
standing,” in NAACL-HLT, 2019.
M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. K.
Kim, “The Interspeech 2013 computational paralinguistics chal- [23] G. Gwardys and D. M. Grzywczak, “Deep image features in mu-
lenge: Social signals, conflict, emotion, autism,” in Interspeech, sic information retrieval,” IJET, vol. 60, no. 4, pp. 321–326, 2014.
2013. [24] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “ESResNet: Envi-
[3] N. Jaitly and G. Hinton, “Learning a better representation of ronmental sound classification based on visual domain models,”
speech soundwaves using restricted boltzmann machines,” in in ICPR, 2020.
ICASSP, 2011. [25] K. Palanisamy, D. Singhania, and A. Yao, “Rethinking CNN mod-
[4] S. Dieleman and B. Schrauwen, “End-to-end learning for music els for audio classification,” arXiv preprint arXiv:2007.11154,
audio,” in ICASSP, 2014. 2020.
[5] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico- [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end image recognition,” in CVPR, 2016.
speech emotion recognition using a deep convolutional recurrent
[27] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for
network,” in ICASSP, 2016.
convolutional neural networks,” in ICML, 2019.
[6] Y. LeCun and Y. Bengio, “Convolutional networks for images,
speech, and time series,” The Handbook of Brain Theory and Neu- [28] Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-
ral Networks, vol. 3361, no. 10, p. 1995, 1995. class examples for deep sound recognition,” in ICLR, 2018.

[7] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumb- [29] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
ley, “PANNs: Large-scale pretrained audio neural networks for Cubuk, and Q. V. Le, “SpecAugment: A simple data augmen-
audio pattern recognition,” IEEE/ACM TASLP, vol. 28, pp. 2880– tation method for automatic speech recognition,” in Interspeech,
2894, 2020. 2019.
[8] Y. Gong, Y.-A. Chung, and J. Glass, “PSLA: Improving audio [30] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G.
event classification with pretraining, sampling, labeling, and ag- Wilson, “Averaging weights leads to wider optima and better gen-
gregation,” arXiv preprint arXiv:2102.01243, 2021. eralization,” in UAI, 2018.
[9] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and [31] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24,
S. Laurenzo, “Streaming keyword spotting on mobile devices,” in no. 2, pp. 123–140, 1996.
Interspeech, 2020. [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[10] P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “An mization,” in ICLR, 2015.
attention pooling based representation learning method for speech [33] H. B. Sailor, D. M. Agrawal, and H. A. Patil, “Unsupervised filter-
emotion recognition,” in Interspeech, 2018. bank learning using convolutional restricted boltzmann machine
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, for environmental sound classification.” in Interspeech, 2017.
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, [34] S. Majumdar and B. Ginsburg, “Matchboxnet–1d time-channel
S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 separable convolutional neural network architecture for speech
words: Transformers for image recognition at scale,” in ICLR, commands recognition,” arXiv preprint arXiv:2004.08531, 2020.
2021.
[35] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and spotters with limited and synthesized speech data,” in ICASSP,
H. Jégou, “Training data-efficient image transformers & distilla- 2020.
tion through attention,” arXiv preprint arXiv:2012.12877, 2020.
[13] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and
S. Yan, “Tokens-to-token ViT: Training vision transformers from
scratch on ImageNet,” arXiv preprint arXiv:2101.11986, 2021.
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” in CVPR,
2009.
[15] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,
R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology
and human-labeled dataset for audio events,” in ICASSP, 2017.
[16] K. J. Piczak, “ESC: Dataset for environmental sound classifica-
tion,” in Multimedia, 2015.
[17] P. Warden, “Speech commands: A dataset for limited-vocabulary
speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
in NIPS, 2017.
[19] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda,
and K. Takeda, “Convolution augmented transformer for semi-
supervised sound event detection,” in DCASE, 2020.
[20] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Sound event
detection of weakly labelled data with CNN-transformer and au-
tomatic threshold optimization,” IEEE/ACM TASLP, vol. 28, pp.
2450–2460, 2020.

Grade 1 Mathematics Lesson Plan
100% (33)
Grade 1 Mathematics Lesson Plan
4 pages
Possession and Predation - Aliens, - Flyers - Clones, and Reptilians PDF
100% (1)
Possession and Predation - Aliens, - Flyers - Clones, and Reptilians PDF
19 pages
Codex Adeptus Custodes 1.5
85% (13)
Codex Adeptus Custodes 1.5
40 pages
Deep Learning For Audio Signal Processing
No ratings yet
Deep Learning For Audio Signal Processing
14 pages
Online Platforms For Ict Content Development
100% (1)
Online Platforms For Ict Content Development
21 pages
Sinamics s120 Function Manual
No ratings yet
Sinamics s120 Function Manual
560 pages
What Is A Centrifugal Compressor Surge
100% (3)
What Is A Centrifugal Compressor Surge
8 pages
PHD Thesis Sound Event Detection With Weakly Labelled Data - v2.0
No ratings yet
PHD Thesis Sound Event Detection With Weakly Labelled Data - v2.0
102 pages
30XW Catalogue
No ratings yet
30XW Catalogue
4 pages
Report Environment Cleanliness
0% (1)
Report Environment Cleanliness
2 pages
BTR Is Really Easy For Some Charts
No ratings yet
BTR Is Really Easy For Some Charts
3 pages
Applied Acoustics: Özkan - Inik
No ratings yet
Applied Acoustics: Özkan - Inik
25 pages
Emotion Tagging in An Audio Signal Using Weakly Supervised Learning
No ratings yet
Emotion Tagging in An Audio Signal Using Weakly Supervised Learning
46 pages
Learning Representations From Audio Using Autoencoders
No ratings yet
Learning Representations From Audio Using Autoencoders
11 pages
Entropy 21 00479 PDF
No ratings yet
Entropy 21 00479 PDF
17 pages
Hartree Fock Intro
No ratings yet
Hartree Fock Intro
51 pages
Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription
No ratings yet
Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription
6 pages
Enterprise Network Products Recommended Version List 2015Q4
No ratings yet
Enterprise Network Products Recommended Version List 2015Q4
36 pages
Thorax Anatomy PDF
No ratings yet
Thorax Anatomy PDF
3 pages
Lecture 06
No ratings yet
Lecture 06
40 pages
CNN Architectures For Large-Scale Audio Classification
No ratings yet
CNN Architectures For Large-Scale Audio Classification
5 pages
Vision Transformer Based Audio Classification Using Patch-Level Feature Fusion
No ratings yet
Vision Transformer Based Audio Classification Using Patch-Level Feature Fusion
5 pages
Case Study: Amazon AWS: CSE 40822 - Cloud Computing Prof. Douglas Thain University of Notre Dame
No ratings yet
Case Study: Amazon AWS: CSE 40822 - Cloud Computing Prof. Douglas Thain University of Notre Dame
33 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
ViT Explained
No ratings yet
ViT Explained
15 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
Statistics Assignment Sample Solutions
No ratings yet
Statistics Assignment Sample Solutions
9 pages
HTS-AT A Hierarchical Token-Semantic Audio Transformer For Sound Classification and Detection
No ratings yet
HTS-AT A Hierarchical Token-Semantic Audio Transformer For Sound Classification and Detection
5 pages
Espi2015 Article ExploitingSpectro-temporalLoca
No ratings yet
Espi2015 Article ExploitingSpectro-temporalLoca
12 pages
GUI Miner Read Me
No ratings yet
GUI Miner Read Me
3 pages
RDUK-Family-Report.pdfyfulllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllliugggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg
No ratings yet
RDUK-Family-Report.pdfyfulllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllliugggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg
37 pages
Job Satisfaction, Job Performance, and Effort and Reexamination Using Agency Theory
No ratings yet
Job Satisfaction, Job Performance, and Effort and Reexamination Using Agency Theory
15 pages
Muzic Genre Classification
No ratings yet
Muzic Genre Classification
4 pages
Exam Question Evaluation
No ratings yet
Exam Question Evaluation
16 pages
Paper 10
No ratings yet
Paper 10
19 pages
Classification of Lung Sounds Using CNN
No ratings yet
Classification of Lung Sounds Using CNN
10 pages
Paper 10
No ratings yet
Paper 10
18 pages
Arno and Thomas 2016
No ratings yet
Arno and Thomas 2016
11 pages
2309 13504
No ratings yet
2309 13504
5 pages
Portfolio PDF
No ratings yet
Portfolio PDF
15 pages
Productflyer - 978 3 642 12014 5
No ratings yet
Productflyer - 978 3 642 12014 5
1 page
Deep Convolutional Neural Network With Mixup
No ratings yet
Deep Convolutional Neural Network With Mixup
12 pages
Experimental Research (Scientific Inquiry) : Mcgraw-Hill
No ratings yet
Experimental Research (Scientific Inquiry) : Mcgraw-Hill
38 pages
BPBLA-135 Assignment July 2023 & Jan 2024
No ratings yet
BPBLA-135 Assignment July 2023 & Jan 2024
4 pages
The Relationship Between Facebook and Uncertainty: Chris Watson Mena Shenouda
No ratings yet
The Relationship Between Facebook and Uncertainty: Chris Watson Mena Shenouda
28 pages
Code of Ethics For Che
No ratings yet
Code of Ethics For Che
2 pages
Audio Classification Using Deep Learning Report
No ratings yet
Audio Classification Using Deep Learning Report
25 pages
Emotion Recognition in Audio and Video Using Deep Neural Networks
No ratings yet
Emotion Recognition in Audio and Video Using Deep Neural Networks
9 pages
An Audio Spectrogram Transformer For All Length and Resolutions
No ratings yet
An Audio Spectrogram Transformer For All Length and Resolutions
5 pages
10 1109@JSTSP 2019 2909479
No ratings yet
10 1109@JSTSP 2019 2909479
13 pages
CNN Bilstm 2021
No ratings yet
CNN Bilstm 2021
5 pages
Paper 10
No ratings yet
Paper 10
7 pages
RUBRIC - Student S Self and Co-Evaluation
No ratings yet
RUBRIC - Student S Self and Co-Evaluation
2 pages
The Veldt Discussion Questions
No ratings yet
The Veldt Discussion Questions
2 pages
2103 - ICML - Perceiver General Perception With Iterative Attention
No ratings yet
2103 - ICML - Perceiver General Perception With Iterative Attention
16 pages
2021 Deep Learning Audio Book
No ratings yet
2021 Deep Learning Audio Book
38 pages
Artigo de Como IA Remove Vocal
No ratings yet
Artigo de Como IA Remove Vocal
5 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
A New Deep CNN Model For Environmental Sound Classification
No ratings yet
A New Deep CNN Model For Environmental Sound Classification
9 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
Paper - Bims 2
No ratings yet
Paper - Bims 2
6 pages
Automatic Piano Transcription With Hierarchical Frequency-Time Transformer
No ratings yet
Automatic Piano Transcription With Hierarchical Frequency-Time Transformer
8 pages
Brainy Kl7 Short Tests Unit 5 Lesson 1
No ratings yet
Brainy Kl7 Short Tests Unit 5 Lesson 1
1 page
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
No ratings yet
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
6 pages
Metadata of The Chapter That Will Be Visualized Online: Samui
No ratings yet
Metadata of The Chapter That Will Be Visualized Online: Samui
14 pages
Allmodels
No ratings yet
Allmodels
4 pages
1 s2.0 S0957417424006341 Main
No ratings yet
1 s2.0 S0957417424006341 Main
10 pages
Spectrogram Transformers For Audio Classification
No ratings yet
Spectrogram Transformers For Audio Classification
7 pages
Recent Advances in End-to-End Automatic Speech Recognition
No ratings yet
Recent Advances in End-to-End Automatic Speech Recognition
64 pages
2021 Learning Multi-Pitch Estimation From Weakly Aligned Score-Audio Pairs Using A Multi-Label CTC Loss
No ratings yet
2021 Learning Multi-Pitch Estimation From Weakly Aligned Score-Audio Pairs Using A Multi-Label CTC Loss
5 pages
An Enhanced Speech Emotion Recognition Using Vision Transformer
No ratings yet
An Enhanced Speech Emotion Recognition Using Vision Transformer
17 pages
Istftnet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
No ratings yet
Istftnet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
6 pages
Evaluation and Analysis of Ear Recognition Models-Performance Complexcity and Evaluation (2017)
No ratings yet
Evaluation and Analysis of Ear Recognition Models-Performance Complexcity and Evaluation (2017)
16 pages
Colstuf 3
No ratings yet
Colstuf 3
1 page
VSET
No ratings yet
VSET
5 pages
cmmr2021 24
No ratings yet
cmmr2021 24
10 pages
Audio DeepFake Detection (Innovative)
No ratings yet
Audio DeepFake Detection (Innovative)
16 pages
Gao Learning To Separate CVPR 2018 Paper
No ratings yet
Gao Learning To Separate CVPR 2018 Paper
4 pages
Audio - Deepfake - Detection - Using - Deep - Learning Paper2
No ratings yet
Audio - Deepfake - Detection - Using - Deep - Learning Paper2
6 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
Syllabus GR 4 Final Exams 2025
No ratings yet
Syllabus GR 4 Final Exams 2025
2 pages
Convnext TTS and Convnext VC Convnext Based Fast End To End Sequence To Sequence Text To Speech and V
No ratings yet
Convnext TTS and Convnext VC Convnext Based Fast End To End Sequence To Sequence Text To Speech and V
5 pages
BTP Final
No ratings yet
BTP Final
16 pages
Base Paper 1 (Hybrid Approach)
No ratings yet
Base Paper 1 (Hybrid Approach)
6 pages
CNN Bilstm
No ratings yet
CNN Bilstm
16 pages
A Hybrid Approach To Audio-to-Score Alignment: Ruchit Agrawal Simon Dixon
No ratings yet
A Hybrid Approach To Audio-to-Score Alignment: Ruchit Agrawal Simon Dixon
4 pages
Ssast
No ratings yet
Ssast
11 pages
23 Self-Supervised Blind Room Parameter Estimation Using
No ratings yet
23 Self-Supervised Blind Room Parameter Estimation Using
6 pages

2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass

Uploaded by

2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass

Uploaded by

AST: Audio Spectrogram Transformer

Yuan Gong, Yu-An Chung, James Glass

Abstract Linear Output

purely based on attention are sufficient to obtain good perfor- 1 2 3 4 5 6 7 8

Model Balanced Full Balanced Set Full Set

3.1.2. AudioSet Results 3.1.3. Ablation Study

# Patches Balanced Set Full Set

3.2. Results on ESC-50 and Speech Commands 4. Conclusions

You might also like