Modeling Beats and Downbeats With A Time-Frequency Transformer
Modeling Beats and Downbeats With A Time-Frequency Transformer
Yun-Ning Hung2,∗ , Ju-Chiang Wang1 , Xuchen Song1 , Wei-Tsung Lu1 , and Minz Won1
1
ByteDance, Mountain View, CA, USA
2
Center for Music Technology, Georgia Institute of Technology, Atlanta, GA, USA
[email protected], {ju-chiang.wang, xuchen.song, weitsung.lu, minzwon}@bytedance.com
arXiv:2205.14701v1 [cs.SD] 29 May 2022
ABSTRACT model these aspects, one need to explore more local spectral infor-
mation, and at the same time allow the important local information
Transformer is a successful deep neural network (DNN) architec- to be exchangeable among different temporal positions far apart in
ture that has shown its versatility not only in natural language pro- the audio sequence, since a measure may require a longer audio
cessing but also in music information retrieval (MIR). In this pa- context to be discovered.
per, we present a novel Transformer-based approach to tackle beat
Recently, Transformer architecture [3, 4] has brought lots of at-
and downbeat tracking. This approach employs SpecTNT (Spectral-
tention due to its remarkable performance in natural language pro-
Temporal Transformer in Transformer), a variant of Transformer that
cessing. For song-level classification tasks in MIR, researchers treat
models both spectral and temporal dimensions of a time-frequency
it as a temporal encoder to aggregate the temporal features to repre-
input of music audio. A SpecTNT model uses a stack of blocks,
sent a music audio sequence [5, 6]. As a replacement of RNN, it also
where each consists of two levels of Transformer encoders. The
works well for chord recognition [7, 8]. However, these prior works
lower-level (or spectral) encoder handles the spectral features and
did not attempt to make full exploitation of Transformer to model
enables the model to pay attention to harmonic components of each
the spectral information that interacts with the temporal encoder as
frame. Since downbeats indicate bar boundaries and are often ac-
a whole. On the other hand, Transformer is data-hungry in nature.
companied by harmonic changes, this step may help downbeat mod-
For many MIR tasks, annotations are very expensive, and the size
eling. The upper-level (or temporal) encoder aggregates useful local
of annotated data is typically small. Therefore, training an effective
spectral information to pay attention to beat/downbeat positions. We
Transformer-based model without modification and data augmenta-
also propose an architecture that combines SpecTNT with a state-of-
tion is likely infeasible.
the-art model, Temporal Convolutional Networks (TCN), to further
improve the performance. Extensive experiments demonstrate that In this paper, we propose to use SpecTNT (Spectral-Temporal
our approach can significantly outperform TCN in downbeat track- Transformer in Transformer) [9], a variant of Transformer that mod-
ing while maintaining comparable result in beat tracking. els the spectrogram along both time and frequency axes. To the
best of our knowledge, this is the first successful attempt to apply
Index Terms— Beat, Downbeat, Transformer, SpecTNT Transformer-based model to beat and downbeat tracking. The ba-
sic principle of SpecTNT stems from the interaction between two
1. INTRODUCTION levels of Transformer encoders, namely spectral Transformer and
temporal Transformer. The former is responsible for extracting the
Beat is regarded as one of the fundamental rhythmic units perceived spectral features for each frame. Whereas the latter is capable of
by humans, and a downbeat is conceived to occur at the first beat of exchanging the local information over the time axis. Thanks to the
a bar (measure), which usually indicates the start of a chord. Beat hierarchical design [9], SpecTNT permits a smaller number of pa-
and downbeat tracking is concerned with developing algorithms to rameters compared to the original Transformer [4], so it can main-
detect the beats and downbeats as pulse signals in music audio. This tain good generalization ability for MIR tasks with a smaller size
is a well-defined problem in music information retrieval (MIR) and of training data. To further improve the performance, we also pro-
has been a regular task in MIREX challenge 1 for over a decade. pose a novel structure that combines SpecTNT with a state-of-the-art
In recent years, researchers proposed to employ deep neural net- (SOTA) model, temporal convolutional networks (TCN). We evalu-
work (DNN)-based methods to model beats and downbeats, which ate our proposed systems on various public datasets. Our results
are regularly repeating events in the audio sequence. Böck et al. [1] demonstrate that SpecTNT can achieve SOTA performance in down-
successfully used a recurrent neural network (RNN) with input mel beat tracking on most of the datasets while maintaining comparable
spectrogram features to jointly model beats and downbeats. Such a performance in beat tracking. The combined structure of SpecTNT
system requires annotated data to train and then predict accordingly. with TCN can further boost the performance.
This has set up a data-driven paradigm for the subsequent systems,
where improvement can be expected by either involving more train-
ing data or developing a more advanced machine learning algorithm. 2. RELATED WORK
Compared to beats, predicting downbeats still remains a chal-
lenge, because it requires a more complicated awareness of har- Early approaches in beat tracking mostly focused on incorporating
monic, timbral, and structural context, which may be related to heuristic information into the systems. For example, Dixon et al.
chords, instrumentation, and percussive arrangements [2]. To better [10] proposed to estimate the tempo and beats by detecting salient
rhythmic events at various metrical levels. Klapuri et al. [11] pro-
∗ The author conducted this work as an intern at ByteDance. posed a probabilistic model to process the music accents and esti-
1 https://www.music-ir.org/mirex/wiki/MIREX HOME mate beat patterns. On the other hand, musical events such as chord
Beat 6 seconds
6 seconds
Activation
Input
Feature DNN Downbeat Beat Input Input Input Input
Audio Extraction Model Activation
DBN
Downbeat ResNet
Non-beat
Activation 5 SpecTNT Blocks ResNet
Linear
2 SpecTNT Blocks
Fig. 1. Pipeline for the beat and downbeat tracking system. Output Curves
Activation Functions
Emb Emb Emb Emb
(a) SpecTNT
changes and drum patterns were explored to facilitate the analysis 24 3 SpecTNT Blocks Emb (24 sec)
by [13] to estimate tempo and beat sequentially, where the process Input Linear TCN
fectively model the long sequence of beat events from audio. Multi- (b) TCN (c) SpecTNT-TCN
task approaches were also introduced to include more types of anno-
tations in the learning system, such as tempo and downbeat [16, 17,
18, 19], owing to their high relevance to beats, and each task was im- Fig. 2. Model architecture overview.
proved by training multiple tasks altogether. These endeavors have
improved beat tracking significantly, and the evaluation scores can
harmonic and timbral components, which may better represent a
easily achieve over 90% on many benchmarks.
chord or instrumentation. For the temporal encoder, we use 256
As noted in the previous section, downbeat tracking still remains
feature maps and 8 attention heads. It enables the important local
a challenge. Several prior instances tackled this problem by using
spectral information to be exchangeable via FCTs to pay attention
musical features such as time-signature [20], onset positions [21],
to the beat/downbeat positions. More details can be found in [9].
chords [22], and spectral difference in beats [23]. With machine
Finally, we stack 5 SpecTNT blocks, followed by a linear layer to
learning approaches, we believe it can be improved by introducing a
output three activation functions for beat, downbeat, and non-beat.
novel model architecture with more training data.
4.2. Experiment Setting The beat and downbeat tracking results are presented in Table 1 and
Table 2, respectively. We include the results from prior literature for
We use nine public datasets to evaluate the proposed methods. Since reference, although they are not necessarily comparable to ours, as
the beat/downbeat tracking datasets are commonly small in size, we the training data and experimental settings are different. The perfor-
tried to collect as many as possible and combined them to conduct mance scores of Harmonix Set for Böck et al. [18] were evaluated
F1 CMLt AMLt F1 CMLt AMLt F1 CMLt AMLt F1 CMLt AMLt
RWC-POP Harmonix Set RWC-POP Harmonix Set
Böck et al. [18] .943 - - .933† .841† .938† Böck et al. [18] .861 - - .804† .747† .873†
TCN (baseline) .947 .922 .952 .946 .895 .942 TCN (baseline) .930 .928 .938 .873 .839 .908
SpecTNT .953 .925 .957 .947 .896 .943 SpecTNT .941 .929 .957 .897? .862 .924?
SpecTNT-TCN .950 .925 .958 .953 .939 .959 SpecTNT-TCN .945 .939 .959 .908? .872? .928?
SMC Beatles Ballroom Beatles
Böck et al. [18] .516 .406 .575 .918 - - Fuentes et al. [38] .830 - - .860 - -
Böck et al. [17] .544 .443 .635 - - - Böck et al. [17] .916 .913 .960 .837 .742 .862
TCN (baseline) .560 .474 .621 .933 .870 .933 TCN (baseline) .841 .788 .937 .843 .767 .851
SpecTNT .602? .515? .661 .940 .898 .929 SpecTNT .884? .835? .937 .867 .810? .860
SpecTNT-TCN .605? .514? .663 .943 .896 .938 SpecTNT-TCN .937? .927? .968 .870 .812? .865
Ballroom Hainsworth Hainsworth
Davies et al. [15] .933 .881 .929 .874 .795 .930 Böck et al. [17] .722 .696 .872
Böck et al. [17] .962 .947 .961 .902 .848 .930 TCN (baseline) .682 .683 .852
TCN (baseline) .940 .870 .957 .860 .849 .915 SpecTNT .729? .734? .879
SpecTNT .927 .856 .939 .866 .865 .914 SpecTNT-TCN .748? .738? .870
SpecTNT-TCN .962? .939? .967 .877 .862 .915
Table 2. Downbeat tracking results using an 8-fold cross validation,
Table 1. Beat tracking results using an 8-fold cross validation, where where ? denotes statistical significance compared to TCN (baseline),
?
denotes statistical significance compared to TCN (baseline), and † and † denotes the predictions made by madmom [14].
denotes the predictions made by madmom [14].
Beat Downbeat
F1 CMLt AMLt F1 CMLt AMLt
on the predictions made by the madmom package [14], which uses Böck et al. [17] .885 .813 .931 .672 .640 .832
the method proposed in [18]. Harmonix Set was not included in the Transformer .853 .741 .887 .667 .617 .843
training set of madmom. TCN (baseline) .879 .802 .911 .702 .660 .859
Comparing the results among SpecTNT, TCN, and SpecTNT- SpecTNT .883 .809 .906 .745? .696? .875?
SpecTNT-TCN .887 .812 .920 .756? .715? .881?
TCN in Table 1, we observe that SpecTNT is generally better than
TCN, and that SpecTNT-TCN outperforms SpecTNT in most cases.
This demonstrates that SpecTNT is a promising DNN architecture Table 3. Beat and downbeat tracking result on (test-only) GTZAN,
for this task. In particular, SpecTNT and SpecTNT-TCN perform where ? denotes statistical significance compared to TCN (baseline).
significantly better on SMC, one of the most challenging datasets in
beat tracking. Our qualitative inspections indicate that SpecTNT is
superior in handling expressive timing. For instance, it performs bet- nificant in terms of F1 and CMLt. As mentioned before, SpecTNT
ter in cases such as “SMC 00111” (with a pause) and “SMC 00165” may sacrifice its beat tracking performance for downbeat, as the per-
(with varying tempo). However, it may suffer from non-trivial per- formance gain of downbeat is larger. Qualitatively, we also found
cussive patterns that yield phase errors (e.g., “SMC 00213”). Other SpecTNT can avoid quite a few phase errors as compared to TCN.
than that, the results also show that SpecTNT seems less successful Lastly, we present the results of GTZAN in Table 3. In addition
on Ballroom and Hainsworth. This could be attributed to the sub- to the models presented above, we also include the results from a
optimal issue mentioned in Section 3.4. Further investigations reveal regular Transformer [6], which contains only the temporal encoder
that SpecTNT tends to over-fit beat annotations of these two datasets [4]. As it is non-hierarchical and without the spectral encoder, we
early during training, and then the performance drops slightly in the treat the comparison as the ablation study. It is clear that the regular
later epochs. This is expected, because Hainsworth covers more di- Transformer does not work well in this case, most likely because the
verse genres (e.g., classical, folk, and jazz), and Ballroom is com- training data is insufficient. Whereas SpecTNT’s design can handle
piled with old-school dance music. These two datasets provide rela- this well by leveraging spectral encoders in a stacked architecture.
tively less examples for their specific genres, as compared to a much Moreover, we once again see similar results to previous experiments
larger dataset, Harnmonix set, which offers fairly more examples of when comparing among SpecTNT, SpecTNT-TCN, and TCN.
pop-style in the training set. Since our model selection criterion con-
siders the joint performance of beat and downbeat, a model that per-
forms much better in downbeat than in beat is picked eventually (and 5. CONCLUSION
it requires more epochs). Nevertheless, SpecTNT-TCN is more ro-
bust to this issue and shows improvements over SpecTNT and TCN. In this paper, we present a novel Transformer-based model, SpecTNT,
From the downbeat tracking results in Table 2, similar observa- to model beats and downbeats from audio signals. The design of
tions can also be made. The performance difference between TCN spectral and temporal encoders enables a promising DNN approach,
and SpecTNT is more obvious. Such results are in line with our re- showing SOTA results in downbeats tracking.
search motivation that SpecTNT can help capture harmonic and tim- For future work, we will try to mitigate the sub-optimal issue
bral information and improve downbeat tracking. The improvement (see Section 4.4) with better solutions, as we expect it is currently
varies across different datasets. Our proposed models perform bet- the limitation for SpecTNT to improve beat tracking. We also plan
ter than the baseline by a wider margin for non-pop-centric datasets: to study the possibility to include additionally relevant MIR tasks
Ballroom and Hainsworth. In particular, the performance difference (e.g., music structure [39], chords, and melody) into a unified multi-
between SpecTNT (or SpecTNT-TCN) and TCN are statistically sig- task learning framework and model them jointly with SpecTNT.
6. REFERENCES [20] F. Krebs, S. Böck, and G. Widmer, “Rhythmic pattern model-
ing for beat and downbeat tracking in musical audio.,” in Proc.
[1] S. Böck and M. Schedl, “Enhanced beat tracking with context- ISMIR, 2013.
aware neural networks,” in Proc. DAFx, 2011. [21] T. Jehan, “Downbeat prediction by listening and learning,” in
[2] S. Durand, J. P. Bello, B. David, and G. Richard, “Robust Proc. IEEE WASPAA, 2005.
downbeat tracking using an ensemble of convolutional net- [22] H. Papadopoulos and G. Peeters, “Joint estimation of chords
works,” IEEE/ACM Trans. Audio, Speech, and Language Pro- and downbeats from an audio signal,” IEEE Trans. Audio,
cess., vol. 25, no. 1, 2016. Speech, and Language Process., vol. 19, no. 1, 2010.
[3] Ashish Vaswani et al., “Attention is all you need,” in Proc. [23] M. E. Davies and M. D. Plumbley, “A spectral difference ap-
NIPS, 2017, pp. 5998–6008. proach to extracting downbeats in musical audio,” in Proc.
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: EUSIPCO, 2006.
Pre-training of deep bidirectional transformers for language [24] M. Won, S. Chun, O. Nieto, and X. Serra, “Data-driven
understanding,” in Proc. NAACL, 2019. harmonic filters for audio representation learning,” in Proc.
ICASSP, 2020, pp. 536–540.
[5] M. Won, S. Chun, and X. Serra, “Toward interpretable music
tagging with self-attention,” arXiv preprint arXiv:1906.04972, [25] S. Böck, F. Krebs, and G. Widmer, “A multi-model approach
2019. to beat tracking considering heterogeneous music styles.,” in
Proc. ISMIR, 2014.
[6] M. Won, K. Choi, and X. Serra, “Semi-supervised music tag-
ging transformer,” Proc. ISMIR, 2021. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in
deep residual networks,” in Proc. ECCV. Springer, 2016.
[7] T.-P. Chen and L. Su, “Harmony Transformer: Incorporating
[27] M. E. P. Davies, S. B ock, and M. Fuentes, Tempo, Beat and
chord segmentation into harmony recognition,” in Proc. IS-
Downbeat Estimation, https://tempobeatdownbeat.github.io/
MIR, 2019, pp. 259–267.
tutorial/intro.html, 2021.
[8] J. Park, K. Choi, S. Jeon, D. Kim, and J. Park, “A bi-directional [28] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
transformer for musical chord recognition,” in Proc. ISMIR, mization,” in Proc. ICLR, 2015.
2019.
[29] M. E. Davies, N. Degara, and M. D. Plumbley, “Evaluation
[9] W.-T. Lu, J.-C. Wang, M. Won, K. Choi, and X. Song, methods for musical audio beat tracking algorithms,” Queen
“SpecTNT: A time-frequency transformer for music audio,” in Mary University of London, Centre for Digital Music, 2009.
Proc. ISMIR, 2021. [30] F. Gouyon et al., A computational approach to rhythm
[10] S. Dixon, “Automatic extraction of tempo and beat from ex- description-Audio features for the computation of rhythm pe-
pressive performances,” J. NMR, vol. 30, no. 1, 2001. riodicity functions and their use in tempo induction and mu-
[11] A. P. Klapuri, A. J. Eronen, and J. T. Astola, “Analysis of the sic content processing, Ph.D. dissertation, Universitat Pompeu
meter of acoustic musical signals,” IEEE Trans. Audio, Speech, Fabra, 2005.
and Language Process., vol. 14, no. 1, 2005. [31] S. W. Hainsworth and M. Macleod, “Particle filtering applied
to musical tempo tracking,” EURASIP J. Advances in Signal
[12] M. Goto, “An audio-based real-time beat tracking system for
Process., vol. 2004, no. 15, 2004.
music with or without drum-sounds,” J. New Music Research,
vol. 30, no. 2, 2001. [32] A. Holzapfel, M. EP Davies, J. R Zapata, J. L. Oliveira, and
F. Gouyon, “Selective sampling for beat tracking evaluation,”
[13] M. E. Davies and M. D. Plumbley, “Context-dependent beat IEEE Trans. ASLP, vol. 20, no. 9, pp. 2539–2548, 2012.
tracking of musical audio,” IEEE Trans. Audio, Speech, and
[33] J. Hockman, M. EP Davies, and I. Fujinaga, “One in the jungle:
Language Process., vol. 15, no. 3, 2007.
Downbeat detection in hardcore, jungle, and drum and bass,”
[14] S. Böck et al., “Madmom: A new python audio and music in Proc. ISMIR, 2012, pp. 169–174.
signal processing library,” in Proc. ACM Multimedia, 2016, [34] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC
pp. 1174–1178. music database: Popular, classical and jazz music databases.,”
[15] E. M. Davies and S. Böck, “Temporal convolutional networks in Proc. ISMIR, 2002, vol. 2.
for musical audio beat tracking,” in Proc. EUSIPCO, 2019. [35] O. Nieto et al., “The Harmonix Set: Beats, downbeats, and
[16] S. Böck, M. EP Davies, and P. Knees, “Multi-task learning of functional segment annotations of western popular music,” in
tempo and beat: Learning one to improve the other.,” in Proc. Proc. ISMIR, 2019, pp. 565–572.
ISMIR, 2019. [36] U. Marchand and G. Peeters, “Swing ratio estimation,” in
[17] S. Böck and M. EP Davies, “Deconstruct, analyse, reconstruct: Proc. DAFx, Trondheim, Norway, 2015.
How to improve tempo, beat, and downbeat estimation.,” in [37] C. Raffel, B. McFee, E. Humphrey, J. Salamon, O. Nieto,
Proc. ISMIR, 2020. D. Liang, and D. PW Ellis, “mir eval: A transparent imple-
[18] S. Böck, F. Krebs, and G. Widmer, “Joint beat and down- mentation of common mir metrics,” in Proc. ISMIR, 2014.
beat tracking with recurrent neural networks.,” in Proc. ISMIR, [38] M. Fuentes, B. McFee, H. Crayencour, S. Essid, and J. P. Bello,
2016. “Analysis of common design choices in deep learning systems
for downbeat tracking,” in Proc. ISMIR, 2018.
[19] M. Fuentes, B. Mcfee, H. C Crayencour, S. Essid, and J. P.
Bello, “A music structure informed downbeat tracking system [39] J.-C. Wang, Y.-N. Hung, and B. L. J. Smith, “To catch a chorus,
using skip-chain conditional random fields and deep learning,” verse, intro, or anything else: Analyzing a song with structural
in Proc. ICASSP, 2019. functions,” in Proc. ICASSP, 2022.