Modeling Beats and Downbeats With A Time-Frequency Transformer

This document presents a Transformer-based approach for beat and downbeat tracking in music. It employs SpecTNT, a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input. SpecTNT uses a hierarchical structure with two levels of Transformer encoders - a lower spectral encoder handles spectral features while an upper temporal encoder aggregates useful local spectral information. Experimental results show SpecTNT significantly outperforms temporal convolutional networks for downbeat tracking while maintaining comparable performance for beat tracking.

Uploaded by

Yearnyeen Ho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views5 pages

Modeling Beats and Downbeats With A Time-Frequency Transformer

Uploaded by

Yearnyeen Ho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

MODELING BEATS AND DOWNBEATS WITH A TIME-FREQUENCY TRANSFORMER

Yun-Ning Hung2,∗ , Ju-Chiang Wang1 , Xuchen Song1 , Wei-Tsung Lu1 , and Minz Won1
1
ByteDance, Mountain View, CA, USA
2
Center for Music Technology, Georgia Institute of Technology, Atlanta, GA, USA
[email protected], {ju-chiang.wang, xuchen.song, weitsung.lu, minzwon}@bytedance.com
arXiv:2205.14701v1 [cs.SD] 29 May 2022

ABSTRACT model these aspects, one need to explore more local spectral infor-
mation, and at the same time allow the important local information
Transformer is a successful deep neural network (DNN) architec- to be exchangeable among different temporal positions far apart in
ture that has shown its versatility not only in natural language pro- the audio sequence, since a measure may require a longer audio
cessing but also in music information retrieval (MIR). In this pa- context to be discovered.
per, we present a novel Transformer-based approach to tackle beat
Recently, Transformer architecture [3, 4] has brought lots of at-
and downbeat tracking. This approach employs SpecTNT (Spectral-
tention due to its remarkable performance in natural language pro-
Temporal Transformer in Transformer), a variant of Transformer that
cessing. For song-level classification tasks in MIR, researchers treat
models both spectral and temporal dimensions of a time-frequency
it as a temporal encoder to aggregate the temporal features to repre-
input of music audio. A SpecTNT model uses a stack of blocks,
sent a music audio sequence [5, 6]. As a replacement of RNN, it also
where each consists of two levels of Transformer encoders. The
works well for chord recognition [7, 8]. However, these prior works
lower-level (or spectral) encoder handles the spectral features and
did not attempt to make full exploitation of Transformer to model
enables the model to pay attention to harmonic components of each
the spectral information that interacts with the temporal encoder as
frame. Since downbeats indicate bar boundaries and are often ac-
a whole. On the other hand, Transformer is data-hungry in nature.
companied by harmonic changes, this step may help downbeat mod-
For many MIR tasks, annotations are very expensive, and the size
eling. The upper-level (or temporal) encoder aggregates useful local
of annotated data is typically small. Therefore, training an effective
spectral information to pay attention to beat/downbeat positions. We
Transformer-based model without modification and data augmenta-
also propose an architecture that combines SpecTNT with a state-of-
tion is likely infeasible.
the-art model, Temporal Convolutional Networks (TCN), to further
improve the performance. Extensive experiments demonstrate that In this paper, we propose to use SpecTNT (Spectral-Temporal
our approach can significantly outperform TCN in downbeat track- Transformer in Transformer) [9], a variant of Transformer that mod-
ing while maintaining comparable result in beat tracking. els the spectrogram along both time and frequency axes. To the
best of our knowledge, this is the first successful attempt to apply
Index Terms— Beat, Downbeat, Transformer, SpecTNT Transformer-based model to beat and downbeat tracking. The ba-
sic principle of SpecTNT stems from the interaction between two
1. INTRODUCTION levels of Transformer encoders, namely spectral Transformer and
temporal Transformer. The former is responsible for extracting the
Beat is regarded as one of the fundamental rhythmic units perceived spectral features for each frame. Whereas the latter is capable of
by humans, and a downbeat is conceived to occur at the first beat of exchanging the local information over the time axis. Thanks to the
a bar (measure), which usually indicates the start of a chord. Beat hierarchical design [9], SpecTNT permits a smaller number of pa-
and downbeat tracking is concerned with developing algorithms to rameters compared to the original Transformer [4], so it can main-
detect the beats and downbeats as pulse signals in music audio. This tain good generalization ability for MIR tasks with a smaller size
is a well-defined problem in music information retrieval (MIR) and of training data. To further improve the performance, we also pro-
has been a regular task in MIREX challenge 1 for over a decade. pose a novel structure that combines SpecTNT with a state-of-the-art
In recent years, researchers proposed to employ deep neural net- (SOTA) model, temporal convolutional networks (TCN). We evalu-
work (DNN)-based methods to model beats and downbeats, which ate our proposed systems on various public datasets. Our results
are regularly repeating events in the audio sequence. Böck et al. [1] demonstrate that SpecTNT can achieve SOTA performance in down-
successfully used a recurrent neural network (RNN) with input mel beat tracking on most of the datasets while maintaining comparable
spectrogram features to jointly model beats and downbeats. Such a performance in beat tracking. The combined structure of SpecTNT
system requires annotated data to train and then predict accordingly. with TCN can further boost the performance.
This has set up a data-driven paradigm for the subsequent systems,
where improvement can be expected by either involving more train-
ing data or developing a more advanced machine learning algorithm. 2. RELATED WORK
Compared to beats, predicting downbeats still remains a chal-
lenge, because it requires a more complicated awareness of har- Early approaches in beat tracking mostly focused on incorporating
monic, timbral, and structural context, which may be related to heuristic information into the systems. For example, Dixon et al.
chords, instrumentation, and percussive arrangements [2]. To better [10] proposed to estimate the tempo and beats by detecting salient
rhythmic events at various metrical levels. Klapuri et al. [11] pro-
∗ The author conducted this work as an intern at ByteDance. posed a probabilistic model to process the music accents and esti-
1 https://www.music-ir.org/mirex/wiki/MIREX HOME mate beat patterns. On the other hand, musical events such as chord
Beat 6 seconds
6 seconds
Activation
Input
Feature DNN Downbeat Beat Input Input Input Input
Audio Extraction Model Activation
DBN
Downbeat ResNet
Non-beat
Activation 5 SpecTNT Blocks ResNet

Linear
2 SpecTNT Blocks
Fig. 1. Pipeline for the beat and downbeat tracking system. Output Curves
Activation Functions
Emb Emb Emb Emb
(a) SpecTNT
changes and drum patterns were explored to facilitate the analysis 24 3 SpecTNT Blocks Emb (24 sec)

of beat structures [12]. A two-stage approach was also proposed seconds

by [13] to estimate tempo and beat sequentially, where the process Input Linear TCN

attempted to replicate human ability to tap along music. Front-End

Linear
More recently, data-driven approaches have been widely used TCN SpecTNT Output
[1], since they enable a system to automatically learn the relevant + TCN Output
Linear
information to detect beats from annotated audio. Specifically, DNN
architectures such as RNN [14] and TCN [15] were employed to ef- Output Curves
Activation Functions Output Curves
Activation Functions

fectively model the long sequence of beat events from audio. Multi- (b) TCN (c) SpecTNT-TCN
task approaches were also introduced to include more types of anno-
tations in the learning system, such as tempo and downbeat [16, 17,
18, 19], owing to their high relevance to beats, and each task was im- Fig. 2. Model architecture overview.
proved by training multiple tasks altogether. These endeavors have
improved beat tracking significantly, and the evaluation scores can
harmonic and timbral components, which may better represent a
easily achieve over 90% on many benchmarks.
chord or instrumentation. For the temporal encoder, we use 256
As noted in the previous section, downbeat tracking still remains
feature maps and 8 attention heads. It enables the important local
a challenge. Several prior instances tackled this problem by using
spectral information to be exchangeable via FCTs to pay attention
musical features such as time-signature [20], onset positions [21],
to the beat/downbeat positions. More details can be found in [9].
chords [22], and spectral difference in beats [23]. With machine
Finally, we stack 5 SpecTNT blocks, followed by a linear layer to
learning approaches, we believe it can be improved by introducing a
output three activation functions for beat, downbeat, and non-beat.
novel model architecture with more training data.

3.3. Baseline TCN

3. PROPOSED METHOD
We adopt the TCN architecture [17, 15] with slight modifications as
3.1. System Overview our baseline model (see Fig. 2(b)), since it is regarded as the current
SOTA. We re-implemented the model following [17]. Readers can
The system pipeline follows the standard setting [14, 17] as depicted refer to the codes recently released by the authors in [27].
in Figure 1. In this work, our contribution is to introduce two novel The TCN front-end is composed of three convolutional layers,
DNN models in the pipeline, and the rest of the modules remain as each having a filter size of 20 and kernels with various sizes as de-
conventional. For feature extraction, we apply trainable harmonic scribed in [17]. Each convolutional layer also applies a max-pooling
filters to convert the input audio into a harmonic representation [24]. to downsample the frequency dimension. After the front-end, the
This step helps capture useful harmonic information and has proven main model contains 11 dilated convolutional layers. Each convolu-
beneficial in music auto-tagging [24]. Then, we define the DNN tional layer has two dilated convolution blocks, and each block has
model to contain three modules: front-end, main model, and a linear a twice the dilation rate of the another. This design has resulted in
layer. The DNN model predicts the likelihoods of beat, downbeat, better performance according to [17]. Finally, a linear layer is used
and non-beat at each time step and forms three corresponding acti- to output the three activation functions.
vation functions. Finally, a dynamic Bayesian network (DBN) [25]
decodes the activation functions to output the timestamps of beats
3.4. Combining SpecTNT and TCN
and downbeats. In what follows, we will detail three options of DNN
models (two proposed and one baseline models) used in this work. SpecTNT and TCN are fundamentally different in many aspects, so
we believe they can be complementary to each other if they are com-
3.2. Spectral-Temporal Transformer in Transformer (SpecTNT) bined. From our pilot study, we also found that modeling beats and
downbeats jointly can result in a sub-optimal model for either beat or
An input harmonic representation is first processed by the front-end downbeat tracking. Since the amount of downbeat annotations is less
convolutional layers (called ResNet [26]). Then, we stack 3 residual than that of beat annotations, downbeat optimum will require more
units, each uses 256 feature maps and a kernel size of 3. This design epochs to attain. However, longer training may lead to worse per-
is similar to that proposed in [6]. formance for beat, because the model over-fits the beat annotations.
The main SpecTNT module (see Fig. 2(a)) is formed by stacking We will discuss this later in Section 4.4. To this end, we propose a
multiple SpecTNT blocks. A SpecTNT block consists of a spectral fused architecture, called SpecTNT-TCN (see Fig. 2(c)). We hope
encoder and a temporal encoder. We use 64 feature maps with 4 it can gain more flexibility during training so as to facilitate a joint
attention heads for the spectral encoder to extract the spectral fea- optimum for beat and downbeat tracking simultaneously.
tures into a set of frequency class tokens (FCTs) at each time step. Specifically, SpecTNT-TCN first uses 2 SpecTNT blocks at
Through the attention mechanism, a FCT can characterize useful the front, and then arranges a two-stream structure that allocates a
TCN branch and 3 succeeding SpecTNT blocks in parallel. Because
SpecTNT requires shorter input, as discussed in Section 4.1, we
divide the input into smaller chunks at the SpecTNT front-end. The
two branches have their own independent linear layers, and the final
output is the average of the two branch outputs. The TCN branch
follows the same TCN module described in Section 3.3. To sum up,
during training, parameters of each branch are updated according
to its own loss and the same SpecTNT parameters at the front are
updated according to the added losses. Final outputs are the average
of the outputs from each branch. From our observations, such design
is better than directly merging the outputs of two full models (i.e.,
averaging the two outputs of Fig. 2(a) and Fig. 2(b)).
(a) Spectral attention (b) Temporal attention

4. EXPERIMENTS Fig. 3. Visualization of the attention maps for a GTZAN example,

“pop.00098.wav” from 6 to 12 seconds. Upper subplot displays one
4.1. Implementation Details of the time-synchronized harmonic representations.
For data augmentation, we adopt a similar method to that proposed
in [17] for training. Specifically, the hop size to construct the STFT
representation is randomly changed during training using a normal training and evaluation in an 8-fold cross-validation manner follow-
distribution with the estimated tempo from the training data as the ing [17]. These datasets include Beatles [29], Ballroom [30], SMC
mean and 5% standard deviation. This can result in more samples of [30], Hainsworth [31], Simac [32], HJDB [33], RWC-Popular [34],
various tempi. Then, multiple triangular filters are applied to STFT and Harmonix Set [35]. An additional dataset, GTZAN [36], is used
to obtain the harmonic representation. Following [17], we apply the for testing only.
same target widening strategy for the frame-wise beat and downbeat We note that the Harmonix Set is comparably larger than other
labels, where the neighbouring frames of an annotated frame are set datasets, containing 912 pop/rock songs with more than 56 hours
to be positive, but with a lower weight of 0.5. We use a weighted of beat- and downbeat-annotated audio content. Our copy is slightly
binary cross-entropy to compare between each target and prediction. different from the original one in [35]. The original audio files corre-
sponding to the annotations are not available, but reference YouTube
Based on our pilot study, we determine the input length to be
links are provided. Our investigation indicated there are many er-
6 seconds for SpecTNT and 24 seconds for TCN. Although several
rors when aligning the annotations with the YouTube audio, so we
counterpart systems (e.g., [18] and [17]) use 6-second, we found
searched for the audio and manually adjusted the annotations to en-
24-second works better for TCN. In SpecTNT-TCN, the input is 24-
sure the labels and timestamps are sensible and aligned to the audio.
second, where it is later divided into four 6-second chunks for the
For evaluation, we report the standard metrics: F1, CMLt, and
SpecTNT front module (see Fig. 2(c)). For STFT, we use sampling
AMLt, with a tolerance window of 70 ms as defined in [29]. The
rate of 16000 Hz and a window length of 1024.
resulting scores are computed using the mir eval package [37].
We apply random sampling to load a random chunk into a batch,
instead of sequentially loading a (6- or 24-second) chunk from the
beginning of each training song. That is, we first enumerate all the 4.3. Attention Map Visualization
valid training chunks into a list by using a sliding window (6- or 24- To better understand the attention mechanism in SpecTNT when
second) with 1-second hop size on every song, and the last chunk of modeling beats and downbeats, we visualize one of the attention
a song shall not exceed the length of the song. Then, the data loader heads of the last SpecTNT block during the inference. For the spec-
draws a sample uniformly from the list during training. Therefore, a tral attention (see Fig. 3(a)), the frequency attention scores that gen-
batch can include chunks from different locations of different songs. erate the FCT of each frame are obtained. It can be seen that the
This technique leads to faster and better convergence. attention map successfully characterizes the harmonic components.
We use PyTorch 1.8 and the Adam optimizer [28] with 0.001 The spectral attention map emphasizes on the bass notes between
learning rate and 80% weight decay. The numbers of parameters 200 and 400 Hz which matches the bass line contour in the harmonic
for TCN, SpecTNT, and SpecTNT-TCN are 86,679, 4,637,392, and representation. For the temporal attention (see Fig. 3(b)), we display
4,692,896, respectively. We train and test the models using 4 Tesla- the time-by-time self-attention map from one of the 8 attention heads
V100-SXM2-32GB GPUs with batch sizes of 128, 128, and 32 for that we consider mostly relevant to the downbeat modeling. The red
SpecTNT, TCN, and SpecTNT-TCN, respectively. Each epoch runs dashed rectangles highlight the downbeat locations. We can see that
500 mini-batches of training before a validation. Using the afore- almost every frame attends to the downbeat locations, i.e., each col-
mentioned specifications, TCN, SpecTNT, and SpecTNT-TCN take umn vector (a frame) has higher scores at the downbeat times. This
about 5.8 minutes, 10 minutes, and 6.5 minutes, respectively, to train will help generate the output embeddings that emphasize the down-
500 batches, and 6 seconds, 12.6 seconds, and 15.1 seconds, respec- beats, and predict higher probabilities for them accordingly.
tively, to inference 100 30-second audio clips. The best model is
selected based on the validation set for testing.
4.4. Results and Discussion

4.2. Experiment Setting The beat and downbeat tracking results are presented in Table 1 and
Table 2, respectively. We include the results from prior literature for
We use nine public datasets to evaluate the proposed methods. Since reference, although they are not necessarily comparable to ours, as
the beat/downbeat tracking datasets are commonly small in size, we the training data and experimental settings are different. The perfor-
tried to collect as many as possible and combined them to conduct mance scores of Harmonix Set for Böck et al. [18] were evaluated
F1 CMLt AMLt F1 CMLt AMLt F1 CMLt AMLt F1 CMLt AMLt
RWC-POP Harmonix Set RWC-POP Harmonix Set
Böck et al. [18] .943 - - .933† .841† .938† Böck et al. [18] .861 - - .804† .747† .873†
TCN (baseline) .947 .922 .952 .946 .895 .942 TCN (baseline) .930 .928 .938 .873 .839 .908
SpecTNT .953 .925 .957 .947 .896 .943 SpecTNT .941 .929 .957 .897? .862 .924?
SpecTNT-TCN .950 .925 .958 .953 .939 .959 SpecTNT-TCN .945 .939 .959 .908? .872? .928?
SMC Beatles Ballroom Beatles
Böck et al. [18] .516 .406 .575 .918 - - Fuentes et al. [38] .830 - - .860 - -
Böck et al. [17] .544 .443 .635 - - - Böck et al. [17] .916 .913 .960 .837 .742 .862
TCN (baseline) .560 .474 .621 .933 .870 .933 TCN (baseline) .841 .788 .937 .843 .767 .851
SpecTNT .602? .515? .661 .940 .898 .929 SpecTNT .884? .835? .937 .867 .810? .860
SpecTNT-TCN .605? .514? .663 .943 .896 .938 SpecTNT-TCN .937? .927? .968 .870 .812? .865
Ballroom Hainsworth Hainsworth
Davies et al. [15] .933 .881 .929 .874 .795 .930 Böck et al. [17] .722 .696 .872
Böck et al. [17] .962 .947 .961 .902 .848 .930 TCN (baseline) .682 .683 .852
TCN (baseline) .940 .870 .957 .860 .849 .915 SpecTNT .729? .734? .879
SpecTNT .927 .856 .939 .866 .865 .914 SpecTNT-TCN .748? .738? .870
SpecTNT-TCN .962? .939? .967 .877 .862 .915
Table 2. Downbeat tracking results using an 8-fold cross validation,
Table 1. Beat tracking results using an 8-fold cross validation, where where ? denotes statistical significance compared to TCN (baseline),
?
denotes statistical significance compared to TCN (baseline), and † and † denotes the predictions made by madmom [14].
denotes the predictions made by madmom [14].
Beat Downbeat
F1 CMLt AMLt F1 CMLt AMLt
on the predictions made by the madmom package [14], which uses Böck et al. [17] .885 .813 .931 .672 .640 .832
the method proposed in [18]. Harmonix Set was not included in the Transformer .853 .741 .887 .667 .617 .843
training set of madmom. TCN (baseline) .879 .802 .911 .702 .660 .859
Comparing the results among SpecTNT, TCN, and SpecTNT- SpecTNT .883 .809 .906 .745? .696? .875?
SpecTNT-TCN .887 .812 .920 .756? .715? .881?
TCN in Table 1, we observe that SpecTNT is generally better than
TCN, and that SpecTNT-TCN outperforms SpecTNT in most cases.
This demonstrates that SpecTNT is a promising DNN architecture Table 3. Beat and downbeat tracking result on (test-only) GTZAN,
for this task. In particular, SpecTNT and SpecTNT-TCN perform where ? denotes statistical significance compared to TCN (baseline).
significantly better on SMC, one of the most challenging datasets in
beat tracking. Our qualitative inspections indicate that SpecTNT is
superior in handling expressive timing. For instance, it performs bet- nificant in terms of F1 and CMLt. As mentioned before, SpecTNT
ter in cases such as “SMC 00111” (with a pause) and “SMC 00165” may sacrifice its beat tracking performance for downbeat, as the per-
(with varying tempo). However, it may suffer from non-trivial performance gain of downbeat is larger. Qualitatively, we also found
cussive patterns that yield phase errors (e.g., “SMC 00213”). Other SpecTNT can avoid quite a few phase errors as compared to TCN.
than that, the results also show that SpecTNT seems less successful Lastly, we present the results of GTZAN in Table 3. In addition
on Ballroom and Hainsworth. This could be attributed to the sub- to the models presented above, we also include the results from a
optimal issue mentioned in Section 3.4. Further investigations reveal regular Transformer [6], which contains only the temporal encoder
that SpecTNT tends to over-fit beat annotations of these two datasets [4]. As it is non-hierarchical and without the spectral encoder, we
early during training, and then the performance drops slightly in the treat the comparison as the ablation study. It is clear that the regular
later epochs. This is expected, because Hainsworth covers more di- Transformer does not work well in this case, most likely because the
verse genres (e.g., classical, folk, and jazz), and Ballroom is com- training data is insufficient. Whereas SpecTNT’s design can handle
piled with old-school dance music. These two datasets provide rela- this well by leveraging spectral encoders in a stacked architecture.
tively less examples for their specific genres, as compared to a much Moreover, we once again see similar results to previous experiments
larger dataset, Harnmonix set, which offers fairly more examples of when comparing among SpecTNT, SpecTNT-TCN, and TCN.
pop-style in the training set. Since our model selection criterion con-
siders the joint performance of beat and downbeat, a model that per-
forms much better in downbeat than in beat is picked eventually (and 5. CONCLUSION
it requires more epochs). Nevertheless, SpecTNT-TCN is more ro-
bust to this issue and shows improvements over SpecTNT and TCN. In this paper, we present a novel Transformer-based model, SpecTNT,
From the downbeat tracking results in Table 2, similar observa- to model beats and downbeats from audio signals. The design of
tions can also be made. The performance difference between TCN spectral and temporal encoders enables a promising DNN approach,
and SpecTNT is more obvious. Such results are in line with our re- showing SOTA results in downbeats tracking.
search motivation that SpecTNT can help capture harmonic and tim- For future work, we will try to mitigate the sub-optimal issue
bral information and improve downbeat tracking. The improvement (see Section 4.4) with better solutions, as we expect it is currently
varies across different datasets. Our proposed models perform bet- the limitation for SpecTNT to improve beat tracking. We also plan
ter than the baseline by a wider margin for non-pop-centric datasets: to study the possibility to include additionally relevant MIR tasks
Ballroom and Hainsworth. In particular, the performance difference (e.g., music structure [39], chords, and melody) into a unified multi-
between SpecTNT (or SpecTNT-TCN) and TCN are statistically sig- task learning framework and model them jointly with SpecTNT.
6. REFERENCES [20] F. Krebs, S. Böck, and G. Widmer, “Rhythmic pattern model-
ing for beat and downbeat tracking in musical audio.,” in Proc.
[1] S. Böck and M. Schedl, “Enhanced beat tracking with context- ISMIR, 2013.
aware neural networks,” in Proc. DAFx, 2011. [21] T. Jehan, “Downbeat prediction by listening and learning,” in
[2] S. Durand, J. P. Bello, B. David, and G. Richard, “Robust Proc. IEEE WASPAA, 2005.
downbeat tracking using an ensemble of convolutional net- [22] H. Papadopoulos and G. Peeters, “Joint estimation of chords
works,” IEEE/ACM Trans. Audio, Speech, and Language Pro- and downbeats from an audio signal,” IEEE Trans. Audio,
cess., vol. 25, no. 1, 2016. Speech, and Language Process., vol. 19, no. 1, 2010.
[3] Ashish Vaswani et al., “Attention is all you need,” in Proc. [23] M. E. Davies and M. D. Plumbley, “A spectral difference ap-
NIPS, 2017, pp. 5998–6008. proach to extracting downbeats in musical audio,” in Proc.
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: EUSIPCO, 2006.
Pre-training of deep bidirectional transformers for language [24] M. Won, S. Chun, O. Nieto, and X. Serra, “Data-driven
understanding,” in Proc. NAACL, 2019. harmonic filters for audio representation learning,” in Proc.
ICASSP, 2020, pp. 536–540.
[5] M. Won, S. Chun, and X. Serra, “Toward interpretable music
tagging with self-attention,” arXiv preprint arXiv:1906.04972, [25] S. Böck, F. Krebs, and G. Widmer, “A multi-model approach
2019. to beat tracking considering heterogeneous music styles.,” in
Proc. ISMIR, 2014.
[6] M. Won, K. Choi, and X. Serra, “Semi-supervised music tag-
ging transformer,” Proc. ISMIR, 2021. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in
deep residual networks,” in Proc. ECCV. Springer, 2016.
[7] T.-P. Chen and L. Su, “Harmony Transformer: Incorporating
[27] M. E. P. Davies, S. B ock, and M. Fuentes, Tempo, Beat and
chord segmentation into harmony recognition,” in Proc. IS-
Downbeat Estimation, https://tempobeatdownbeat.github.io/
MIR, 2019, pp. 259–267.
tutorial/intro.html, 2021.
[8] J. Park, K. Choi, S. Jeon, D. Kim, and J. Park, “A bi-directional [28] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
transformer for musical chord recognition,” in Proc. ISMIR, mization,” in Proc. ICLR, 2015.
2019.
[29] M. E. Davies, N. Degara, and M. D. Plumbley, “Evaluation
[9] W.-T. Lu, J.-C. Wang, M. Won, K. Choi, and X. Song, methods for musical audio beat tracking algorithms,” Queen
“SpecTNT: A time-frequency transformer for music audio,” in Mary University of London, Centre for Digital Music, 2009.
Proc. ISMIR, 2021. [30] F. Gouyon et al., A computational approach to rhythm
[10] S. Dixon, “Automatic extraction of tempo and beat from ex- description-Audio features for the computation of rhythm pe-
pressive performances,” J. NMR, vol. 30, no. 1, 2001. riodicity functions and their use in tempo induction and mu-
[11] A. P. Klapuri, A. J. Eronen, and J. T. Astola, “Analysis of the sic content processing, Ph.D. dissertation, Universitat Pompeu
meter of acoustic musical signals,” IEEE Trans. Audio, Speech, Fabra, 2005.
and Language Process., vol. 14, no. 1, 2005. [31] S. W. Hainsworth and M. Macleod, “Particle filtering applied
to musical tempo tracking,” EURASIP J. Advances in Signal
[12] M. Goto, “An audio-based real-time beat tracking system for
Process., vol. 2004, no. 15, 2004.
music with or without drum-sounds,” J. New Music Research,
vol. 30, no. 2, 2001. [32] A. Holzapfel, M. EP Davies, J. R Zapata, J. L. Oliveira, and
F. Gouyon, “Selective sampling for beat tracking evaluation,”
[13] M. E. Davies and M. D. Plumbley, “Context-dependent beat IEEE Trans. ASLP, vol. 20, no. 9, pp. 2539–2548, 2012.
tracking of musical audio,” IEEE Trans. Audio, Speech, and
[33] J. Hockman, M. EP Davies, and I. Fujinaga, “One in the jungle:
Language Process., vol. 15, no. 3, 2007.
Downbeat detection in hardcore, jungle, and drum and bass,”
[14] S. Böck et al., “Madmom: A new python audio and music in Proc. ISMIR, 2012, pp. 169–174.
signal processing library,” in Proc. ACM Multimedia, 2016, [34] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC
pp. 1174–1178. music database: Popular, classical and jazz music databases.,”
[15] E. M. Davies and S. Böck, “Temporal convolutional networks in Proc. ISMIR, 2002, vol. 2.
for musical audio beat tracking,” in Proc. EUSIPCO, 2019. [35] O. Nieto et al., “The Harmonix Set: Beats, downbeats, and
[16] S. Böck, M. EP Davies, and P. Knees, “Multi-task learning of functional segment annotations of western popular music,” in
tempo and beat: Learning one to improve the other.,” in Proc. Proc. ISMIR, 2019, pp. 565–572.
ISMIR, 2019. [36] U. Marchand and G. Peeters, “Swing ratio estimation,” in
[17] S. Böck and M. EP Davies, “Deconstruct, analyse, reconstruct: Proc. DAFx, Trondheim, Norway, 2015.
How to improve tempo, beat, and downbeat estimation.,” in [37] C. Raffel, B. McFee, E. Humphrey, J. Salamon, O. Nieto,
Proc. ISMIR, 2020. D. Liang, and D. PW Ellis, “mir eval: A transparent imple-
[18] S. Böck, F. Krebs, and G. Widmer, “Joint beat and down- mentation of common mir metrics,” in Proc. ISMIR, 2014.
beat tracking with recurrent neural networks.,” in Proc. ISMIR, [38] M. Fuentes, B. McFee, H. Crayencour, S. Essid, and J. P. Bello,
2016. “Analysis of common design choices in deep learning systems
for downbeat tracking,” in Proc. ISMIR, 2018.
[19] M. Fuentes, B. Mcfee, H. C Crayencour, S. Essid, and J. P.
Bello, “A music structure informed downbeat tracking system [39] J.-C. Wang, Y.-N. Hung, and B. L. J. Smith, “To catch a chorus,
using skip-chain conditional random fields and deep learning,” verse, intro, or anything else: Analyzing a song with structural
in Proc. ICASSP, 2019. functions,” in Proc. ICASSP, 2022.