0% found this document useful (0 votes)
35 views5 pages

2023 ICASSP Soft Dynamic Time Warping For Multi-Pitch Estimation and Beyond

This paper presents Soft Dynamic Time Warping (SoftDTW) as an alternative to the Connectionist Temporal Classification (CTC) loss for multi-pitch estimation in music information retrieval. SoftDTW is a differentiable variant of classical DTW that allows for effective learning from weakly aligned data, outperforming CTC in terms of elegance and applicability to real-valued target sequences. The authors demonstrate its effectiveness through experiments, showing that SoftDTW can achieve results comparable to state-of-the-art methods while simplifying the alignment process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views5 pages

2023 ICASSP Soft Dynamic Time Warping For Multi-Pitch Estimation and Beyond

This paper presents Soft Dynamic Time Warping (SoftDTW) as an alternative to the Connectionist Temporal Classification (CTC) loss for multi-pitch estimation in music information retrieval. SoftDTW is a differentiable variant of classical DTW that allows for effective learning from weakly aligned data, outperforming CTC in terms of elegance and applicability to real-valued target sequences. The authors demonstrate its effectiveness through experiments, showing that SoftDTW can achieve results comparable to state-of-the-art methods while simplifying the alignment process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.

00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10095907

SOFT DYNAMIC TIME WARPING FOR MULTI-PITCH ESTIMATION AND BEYOND

Michael Krause, Christof Weiß, Meinard Müller

International Audio Laboratories Erlangen

=Y

=Y
=Y
ABSTRACT SoftDTW
Many tasks in music information retrieval (MIR) involve weakly
aligned data, where exact temporal correspondences are unknown.

Time
The connectionist temporal classification (CTC) loss is a standard
technique to learn feature representations based on weakly aligned
training data. However, CTC is limited to discrete-valued target se-
quences and can be difficult to extend to multi-label problems. In
this article, we show how soft dynamic time warping (SoftDTW), a a) b) c)
= f(X)
differentiable variant of classical DTW, can be used as an alterna- f
tive to CTC. Using multi-pitch estimation as an example scenario,
we show that SoftDTW yields results on par with a state-of-the-art
multi-label extension of CTC. In addition to being more elegant in =X
terms of its algorithmic formulation, SoftDTW naturally extends to
real-valued target sequences. Time
Index Terms— dynamic time warping, music processing, music Fig. 1: Illustration of SoftDTW for aligning a learned feature se-
information retrieval, multi-pitch estimation, music transcription quence f (X) and a target sequence Y, where one may consider (a)
single-label, (b) multi-label, or (c) real-valued targets.
1. INTRODUCTION
A common technique used in MIR for finding an optimal align-
Many applications in music information retrieval (MIR) require ment between weakly aligned sequences is dynamic time warping
alignments between sequences of music data. Often, the sequences (DTW) in combination with hand-crafted features [10]. Such a
given are only weakly aligned. For example, in audio-to-score tran- pipeline can provide good alignment results for tasks like audio–
scription, pairs of audio and score excerpts are easy to find but exact audio synchronization [6], but the standard DTW-based cost func-
temporal correspondences between these pairs are hard to estab- tion is not fully differentiable, which prevents its use in an end-
lish [1]. Furthermore, music data sequences may involve different to-end deep learning context. To resolve this issue, Cuturi and
levels of complexity. For instance, given a single-instrument mono- Blondel [11] proposed a differentiable variant of DTW, called Soft-
phonic music recording, monophonic pitch estimation [2] aims at DTW, that approximates the original DTW cost. In recent work,
finding a single pitch value per time step (see also Figure 1a). Other SoftDTW and related techniques have been successfully used in
scenarios with discrete, single-label targets include lyrics transcrip- computer vision applications such as action alignment [12, 13]. To
tion or lyrics alignment for songs with a single singer [3, 4]. More our knowledge, the only prior work applying SoftDTW in an MIR
complex sequences appear in multi-pitch estimation (MPE), where context is by Agrawal et al. [17].
multiple pitches may be active simultaneously (Figure 1b). Finally, Our contributions are as follows: We demonstrate the use of
some scenarios involve alignment between real-valued sequences SoftDTW for MPE. In particular, we show that SoftDTW performs
(Figure 1c), e. g., audio–audio synchronization [5, 6] or multi-modal on par with a multi-label extension of CTC, while being conceptu-
alignment problems such as synchronizing dance videos with mu- ally simpler. Furthermore, we show that the SoftDTW approach nat-
sic [7]. urally generalizes to real-valued target sequences, as illustrated in
The connectionist temporal classification (CTC) [8] loss, a fully Figure 1, making it applicable for a wide range of alignment tasks.
differentiable loss function initially developed for speech recogni- The remainder of the paper is structured as follows: In Section 2,
tion, is commonly used for learning features from weakly aligned we review the current state of the art for multi-pitch estimation from
data when the targets are sequences over a finite alphabet of labels. weakly aligned data with CTC. In Section 3, we formalize SoftDTW
Recently, CTC was extended to handle multi-label learning prob- for general sequences and, in Section 4, apply it for MPE. Section 5
lems [9], where the main idea was to locally transform the multi- demonstrates the potential of SoftDTW for learning with real-valued
label into the single-label case. However, in addition to its compli- targets. Finally, Section 6 concludes the paper with an outlook to-
cated algorithmic formulation, this approach is unsuitable for target wards future applications.
sequences that do not originate from a discrete vocabulary.
This work was supported by the German Research Foundation (DFG 2. WEAKLY ALIGNED TRAINING FOR MPE
MU 2686/7-2). The authors are with the International Audio Laboratories
Erlangen, a joint institution of the Friedrich-Alexander-Universität Erlangen-
Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. The au- In recent years, automated music transcription has become a cen-
thors gratefully acknowledge the compute resources and support provided by tral topic in MIR research, with deep learning techniques achieving
the Erlangen Regional Computing Center (RRZE). state-of-the-art results [14–16]. We here focus on MPE as a sub-

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.
problem of automated music transcription, where the goal is to trans- n ∈ [1 : N ] , m ∈ [1 : M ]). Given some differentiable cost function
form an input music recording X into a piano-roll representation Y c : F1 × F2 → R defined on these feature spaces, we can construct
of pitches played. In particular, multiple pitches may be active at a matrix C ∈ RN ×M of local costs where each entry
the same time. Most learning-based approaches for MPE require
strongly aligned data for training, i. e., pitches are annotated for each C(n, m) = c(xn , ym )
audio frame of the input recording. Since annotating data in such a
frame-wise fashion is very time consuming, most MPE datasets have contains the cost of locally aligning xn with ym . To determine an
been generated (semi-)automatically, e. g., by using MIDI pianos or optimal global alignment1 between the sequences X and Y one com-
by applying score–audio synchronization techniques (which may in- putes an accumulated cost matrix Dγ ∈ RN ×M using the recursion
troduce labeling errors). Techniques that allow learning from pairs Dγ (1, 1) = C(1, 1),
of X and Y that are not temporally aligned are therefore highly de-
m
sirable. X
Dγ (1, m) = C(1, k), for m ∈ [1 : M ] ,
As discussed in the introduction, a common technique for deal-
k=1
ing with weakly aligned learning problems is CTC [8]. Here, the n
target sequences Y consist of symbols from a discrete alphabet L, γ
D (n, 1) =
X
C(k, 1), for n ∈ [1 : N ] ,
including a special blank symbol necessary for distinguishing repe-
k=1
titions of symbols. For each frame in the input sequence X, a neural
network outputs a probability distribution over L. The CTC loss Dγ (n, m) = C(n, m) + µγ ({Dγ (n − 1, m − 1),
then corresponds to the likelihood of Y given these network outputs, Dγ (n − 1, m), Dγ (n, m − 1)}),
taking into account all possible alignments between X and Y . Note
that CTC is agnostic about the durations of symbols in Y , i. e., even for n ∈ [2 : N ] , m ∈ [2 : M ]. Here, µγ refers to a differentiable
if information about symbol durations is available, CTC is unable approximation of the minimum function given by
to exploit this for alignment. An efficient dynamic programming al-  
X s
gorithm for computing the CTC loss exists (with time complexity µγ (S) = −γ log exp − ,
O(|L|2 · N ), where N is the length of X), but it requires special s∈S
γ
care in handling the blank symbol [8].
A naive extension of CTC towards multi-label target sequences where S is some finite set of real numbers and γ ∈ R>0 is a temper-
would introduce unique network outputs for all possible symbol ature parameter that determines the “softness” of the approximation.
combinations, which leads to a combinatorial explosion. Instead, One can show that µγ is a lower bound of the minimum function [12]
the authors in [9] propose to locally reduce the multi-label to the and converges towards the true minimum for γ → 0. As a con-
single-label case by only considering those symbol combinations sequence, Dγ becomes the accumulated cost matrix from classical
that occur within a single training batch (called multi-label CTC, DTW for γ → 0. Thus, SoftDTW becomes DTW in the limit case.
i. e., MCTC). This defines a “batch-dependent alphabet,” avoiding After evaluating the SoftDTW recursion, the entry DTWγ (C) =
the combinatorial explosion. The technical details of this process Dγ (N, M ) contains the approximate minimal cost of aligning the
are tricky and special care needs to be taken for handling the blank sequences X and Y , given the local costs C. A similar recursion
symbol. In [1], this idea is adapted for MPE by considering pitches exists for computing the gradient of DTWγ (C) with regard to any
as symbols and multi-pitch annotations as combinations of sym- matrix coefficient C(n, m) for n ∈ [1 : N ] and m ∈ [1 : M ] [11,
bols. This formulation allows them to train networks for MPE on Algorithm 2]. The time and space complexity of the SoftDTW re-
pairs of X and Y that are only weakly aligned, e. g., where X is cursion as well as the gradient computation is both O(N ·M ), which
a music recording and Y is a MIDI representation derived from is sufficiently fast for use in deep learning.
the corresponding score. In this paper, using MPE from [1] as an Note that SoftDTW requires no prior knowledge of the align-
example application, we show how the technically intricate MCTC ment between X and Y , which enables the use of DTWγ (C) as a
can be replaced by a conceptually more elegant SoftDTW approach. loss function for learning problems with weakly aligned data. Fur-
SoftDTW does not involve the need for a blank symbol, which may thermore, X and Y can come from arbitrary feature spaces, as long
be well-motivated in text applications but can be unnatural in MIR as an appropriate cost function c can be defined.
problems such as MPE.
4. APPLICATION TO MULTI-PITCH ESTIMATION
3. SOFT DYNAMIC TIME WARPING
We now apply SoftDTW to multi-pitch estimation. For a given piece
The objective of DTW is to find an optimal temporal alignment be- of music, the sequence X corresponds to some representation of
tween two sequences. SoftDTW [11] is a differentiable approxi- an input recording, while Y corresponds to a multi-hot encoding of
mation of DTW that allows for propagating gradients through the pitches played. Note that Y does not need to be temporally aligned
alignment procedure, making SoftDTW applicable for deep learn- with X and could arise, e. g., from a score representation of the mu-
ing. Like classical DTW, SoftDTW admits an efficient dynamic pro- sical piece. An element ym of the sequence Y is encoded as a vector
gramming (DP) recursion for computing the optimal alignment cost. ym ∈ {0, 1}72 and the entries of ym correspond to the 72 pitches
Furthermore, there also exists a DP-algorithm for efficiently comput- from C1 to B6. In our experiments, rather than directly aligning Y
ing the gradient of that cost. In this section, we briefly summarize with some fixed representation X, we use a neural network f that
the problem statement and DP recursion of SoftDTW for general se- takes X as input and outputs a feature vector per frame in X. Thus,
quences. We then apply this to our music scenarios in later sections. 1 Subject to some constraints, namely, the first and last elements of both
Consider two sequences X = (x1 , x2 , . . . , xN ) and Y = sequences are aligned to each other (boundary constraint), no element is
(y1 , y2 , . . . , yM ) of lengths N, M ∈ N with elements coming skipped (step-size constraint), and the alignment is monotonous (monotonic-
from some feature spaces F1 , F2 (i. e., xn ∈ F1 , ym ∈ F2 for all ity constraint).

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.
we obtain a sequence f (X) = (z1 , . . . , zN ) with the same length
N as X. We construct f such that zn ∈ R72 for the elements zn
of f (X). Thus, both sequences Y and f (X) contain elements from

Pitches
the features space F1 = F2 = R72 . We then align f (X) and Y , as
illustrated in Figure 1.
To our knowledge, SoftDTW has not previously been used
for MPE and is seldom explored in MIR. The authors in [4] used
the classical, non-differentiable DTW recursion inside an attention
mechanism for lyrics alignment, which led to training instabilities. a) b) c) d) e)
The work by Agrawal et al. [17] constitutes the first use of SoftDTW
Fig. 2: (a) Strongly aligned pitch annotations for an audio excerpt,
for an MIR application. They successfully employ a variant of
(b) Annotations without note durations (as used by MCTC), (c) An-
SoftDTW to train a system for score-audio synchronization. In their
notations without note durations, stretched to excerpt length, (d)
scenario, SoftDTW is applied to discrete-valued, one-dimensional,
Score representation, not aligned to the audio excerpt, (e) Score rep-
and strongly aligned sequences. In contrast, we employ SoftDTW
resentation, stretched to excerpt length
for multi-dimensional sequences in weakly aligned settings.
4.1. Implementation Details and Evaluation Metrics Scenario F-measure CS AP Acc.
CE [1] 0.70 0.759 0.764 0.546
Since the focus of our work is on evaluating the efficacy of SoftDTW MCTC [1] 0.69 0.744 0.734 0.532
for MIR tasks and in order to maintain comparability with the results
presented in [1], we adopt the same training setup and network ar- SoftDTWW1 0.00 0.465 0.297 0.002
SoftDTWW2 0.69 0.736 0.737 0.529
chitecture. Thus, we use harmonic CQT (HCQT, [18]) excerpts of
roughly ten second lengths as input and pass them through a five-
layer convolutional neural network to obtain a sequence of per-frame Table 1: Results for multi-pitch estimation on the Schubert Winter-
representations f (X) (see [1] for details on the network architecture reise Dataset for SoftDTW compared with MCTC.
and HCQT representation).
We train our networks by minimizing the soft alignment cost icantly shorter than the learned sequence f (X).4 We repeat the
DTWγ (C).2 In all experiments, we use the squared Euclidean dis- experiment by temporally stretching the sequence Y to match the
tance for c and set γ = 10.0. We did not see improvements for number of frames in f (X) (illustrated in Figure 2c). When apply-
alternative choices of c and obtained similar results for a wide range ing SoftDTW together with this trick (denoted by SoftDTWW2 ),
of values for γ ∈ [0.5, 20.0]. Furthermore, we use a fast GPU im- results are again very similar to MCTC (AP = 0.737). Thus, Soft-
plementation of the SoftDTW recursion and gradient computation DTW may be used to replace MCTC in this scenario.
which was implemented in [19].
To compare network predictions with the strongly aligned pitch 4.3. Incorporating Note Durations
annotations of the test sets, we use common evaluation measures for In contrast to MCTC, SoftDTW is able to incorporate (approxi-
MPE, including cosine similarity between predictions and annota- mate) note durations during training. SWD, for example, contains
tions (CS), area under the precision-recall curve (also called average non-aligned score representations of the pieces performed. We now
precision, AP), as well as F-measure and accuracy (Acc., introduced use these score representations as target sequences Y (denoted by
in [20]) at a threshold of 0.4 (which is a common choice in MPE SoftDTWW3 , see Figure 2d for an illustration). Table 2 shows
systems, see also [21]). evaluation results, which are slightly improved compared to training
4.2. Comparison with MCTC without note durations (F-measure of 0.71 compared to 0.69 and
CS = 0.756 compared to 0.736 for SoftDTWW2 ). Here, there is
We begin by comparing our results with the main results reported in only a moderate difference between the lengths of excerpt and label
[1], which are obtained on the Schubert Winterreise Dataset (SWD) sequence and stretching the label sequence to the length of the input
[22]. SWD provides strongly aligned annotations for all recordings. yields nearly identical results (denoted by SoftDTWW4 , see Fig-
Due to this, one can consider a baseline trained on the aligned an- ure 2e). Finally, we may also use SoftDTW using strongly aligned
notations with a per-frame cross-entropy loss (CE). The first line label sequences (denoted by SoftDTWS ). In this very optimistic
of Table 1 shows results for such an optimistic baseline (reprinted scenario, no alignment is necessary, but SoftDTW may compensate
from [1]), which yields an F-measure of 0.70 and AP = 0.764. To for inaccuracies introduced by the dataset annotation procedures.
train a network using MCTC instead, one must remove all informa- Indeed, this scenario yields best results (F-measure of 0.72 and
tion about note durations from the label sequence Y (see Figure 2b). AP = 0.769), even slightly improving upon the cross-entropy
The results obtained this way are just slightly lower at AP = 0.734, baseline in Table 1.
even though only weakly aligned labels are used. When performing
the same experiment using SoftDTW (denoted by SoftDTWW1 ), 4.4. Cross-Dataset Experiment
we obtain much weaker results with an F-measure of 0.00 and AP = We also perform a cross-dataset experiment (again following the
0.297.3 In this experiment, the label sequence Y may be signif- setup in [1]), where we train on the popular MAESTRO [23] and
2 Note that we normalize DTWγ (C) by its value for the first training MusicNet [21] datasets. Both contain strongly aligned pitch an-
batch. Thus, the loss is exactly 1 for the first batch and its value range remains notations for the training recordings, but they do not provide non-
similar across training configurations, regardless of the sequence lengths N aligned score representations of the pieces, so SoftDTWW3 and
and M or other factors.
3 Note that the F-measure and Accuracy scores can be improved to 0.32 4 A large discrepancy in sequence lengths is well known to cause problems
and 0.20, respectively, by choosing a more suitable detection threshold. Still, for classical DTW. Further investigation is needed to understand how this
these scores are notably worse compared to the results for MCTC. affects the training process with SoftDTW.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.
Scenario F-measure CS AP Acc. 5.1. Pitch Estimation with Overtone Model
SoftDTWW3 0.71 0.756 0.755 0.552
SoftDTWW4 0.71 0.757 0.750 0.555 First, we consider a straightforward extension of MPE, where we
SoftDTWS 0.72 0.761 0.769 0.563 transform the binary, multi-hot target vectors of MPE to real-valued
vectors by adding energy according to a simple overtone model, see
Table 2: Results on the Schubert Winterreise Dataset for incorporat- Figure 1c. Here, we consider 10 overtones for each active pitch,
ing note durations with SoftDTW. with amplitude (1/3)n for the n-th overtone. As a baseline utilizing
strongly aligned labels, we compare with a model trained using an ℓ2
AP regression loss at each frame (similar to the cross-entropy baseline
Scenario in Section 4). To evaluate, we use the cosine similarity CS between
SWD Bach10 TRIOS Phenicx
network outputs and annotations. Note that other MPE evaluation
Default network architecture
CE [1] 0.684 0.864 0.825 0.829 metrics are not applicable for real-valued vectors.
MCTC [1] 0.666 0.861 0.824 0.833 When performing this experiment on the SWD dataset, we ob-
SoftDTWW2 0.665 0.835 0.812 0.788 tain CS = 0.794 for per-frame training with strongly aligned labels,
Larger network architecture which is higher than for MPE on SWD (cf. Table 1). Training with-
CE [1] 0.701 0.886 0.863 0.846 out strongly aligned labels using SoftDTWW2 yields only slightly
MCTC [1] 0.677 0.871 0.849 0.850 lower cosine similarities at 0.770. This illustrates that SoftDTW also
SoftDTWW2 0.682 0.896 0.864 0.838 works for settings with real-valued target sequences.

Table 3: Results for multi-pitch estimation in a cross-dataset exper-


iment. Here, MAESTRO and MusicNet have been used for training 5.2. Cross-Version Training
while four different smaller datasets are used for testing.
Second, as a scenario with more realistic target sequences, we
choose Y to be the CQT representation of another version (i. e.,
SoftDTWW4 are not applicable here. We then evaluate on the four a different performance) of the piece played in X. In this case,
smaller datasets SWD, Bach10 [24], TRIOS [25] and Phenicx Ane- the two sequences f (X) and Y will not correspond temporally,
choic [26]. Note that the latter three datasets each contain less than but SoftDTW can be used to find an appropriate alignment during
ten minutes of audio. This is a difficult scenario since some styles training. We perform this experiment using SWD, which provides
and instruments in the test datasets are not present during training. multiple versions of the same musical pieces. In particular, we
For example, Phenicx Anechoic contains orchestral instruments, choose one version (OL06) as the target version and train our net-
while MAESTRO and MusicNet contain piano and chamber music. work using SoftDTW to align input excerpts from other versions
The results of this experiment are given in Table 3. Here, MCTC to excerpts from OL06. Finally, we pass versions unseen during
and a cross-entropy baseline perform roughly on par. SoftDTW training through the trained network and evaluate against excerpts
yields slightly lower results, especially on Phenicx (AP = 0.788 from OL06 using cosine similarity. As a learning-free baseline, we
compared to 0.833 for MCTC). Given that this evaluation scenario also compute CS between the original CQT representations of the
is harder and the training datasets are larger, we also repeat this ex- test recordings and the OL06 representations. To compute the co-
periment with a larger network architecture (increasing the number sine similarities during testing, we use the ground truth alignments
of channels for all convolutional layers in the network). The re- between OL06 and all other versions provided by the dataset, but we
sulting architecture has roughly 600 000 parameters, compared to do not need ground truth alignments during training.
50 000 parameters in the default architecture. Results are shown in Directly comparing the CQT representations of input version
the lower half of Table 3. Average precision scores improve consis- and target yields an average cosine similarity of 0.576. Training (us-
tently across all methods and datasets, e. g., AP = 0.896 for Soft- ing SoftDTWW3 ) yields much higher results at CS = 0.720. Thus,
DTW on Bach10 compared to 0.835 using the smaller architecture. the network trained using SoftDTW is able to produce real-valued
In particular, SoftDTW now outperforms MCTC on all test datasets outputs that are similar to the target version.
except for Phenicx, where the performance gap is now much smaller
(AP = 0.838 compared to 0.850 for MCTC).
All in all, we conclude that the results for MCTC and SoftDTW 6. CONCLUSION
are roughly comparable, even in a challenging cross-dataset evalua-
tion. Thus, MCTC may be replaced with SoftDTW without sacrific-
In this paper, we have considered SoftDTW as a tool for dealing
ing alignment quality. In addition, SoftDTW can generalize to other
with weakly aligned learning problems in MIR, in particular, multi-
kinds of target sequences, as discussed in the next section.
pitch estimation. We showed that a network trained with SoftDTW
performs on par with the same network trained using a state-of-the-
art multi-label CTC loss. We further demonstrated that SoftDTW
5. EXTENSION TO REAL-VALUED TARGETS can be used to learn features when the target sequences have real-
valued entries—something not possible with CTC.
As explained in Section 3, the two sequences X and Y that are used In future work, SoftDTW may be applied to more diverse MIR
as input to SoftDTW may come from arbitrary feature spaces. In or- tasks, such as lyrics alignment, audio–audio synchronization, or
der to illustrate the potential of using SoftDTW for learning from cross-modal learning from unaligned video–audio pairs. Further-
arbitrary sequences, we now perform two experiments with real- more, one may explore the possibility of combining both strongly
valued targets, i. e., yn ∈ R72 for the elements yn of Y . Note that aligned and non-aligned data within the same training. All these
MCTC is unable to handle such a setting. options are supported by the same algorithmic framework.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.
7. REFERENCES [14] Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian
Simon, Colin Raffel, Jesse H. Engel, Sageev Oore, and Dou-
[1] Christof Weiß and Geoffroy Peeters, “Learning multi-pitch es- glas Eck, “Onsets and frames: Dual-objective piano transcrip-
timation from weakly aligned score-audio pairs using a multi- tion,” in Proc. Int. Soc. Music Information Retrieval Conf.,
label CTC loss,” in Proc. IEEE Workshop on Applications (ISMIR), Paris, France, 2018, pp. 50–57.
of Signal Processing to Audio and Acoustics (WASPAA), New
[15] Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Sebastian
Paltz, USA, 2021, pp. 121–125.
Böck, Andreas Arzt, and Gerhard Widmer, “On the potential of
[2] Juan J. Bosch, Rachel M. Bittner, Justin Salamon, and Emilia simple framewise approaches to piano transcription,” in Proc.
Gómez, “A comparison of melody extraction methods based Int. Soc. Music Information Retrieval Conf. (ISMIR), New York
on source-filter modelling,” in Proc. Int. Soc. Music Informa- City, New York, USA, 2016, pp. 475–481.
tion Retrieval Conf. (ISMIR), New York City, New York, USA,
2016, pp. 571–577. [16] Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, and
Dorien Herremans, “Revisiting the onsets and frames model
[3] Ye Wang, Min-Yen Kan, Tin Lay Nwe, Arun Shenoy, and Jun with additive attention,” in Proc. Int. Joint Conf. Neural Net-
Yin, “Lyrically: automatic synchronization of acoustic musical works (IJCNN), Shenzhen, China, 2021.
signals and textual lyrics,” in Proc. ACM Int. Conf. Multimedia,
New York, NY, USA, 2004, pp. 212–219. [17] Ruchit Agrawal, Daniel Wolff, and Simon Dixon, “A
convolutional-attentional neural framework for structure-aware
[4] Kilian Schulze-Forster, Clement S. J. Doire, Gaël Richard, and performance-score synchronization,” IEEE Signal Processing
Roland Badeau, “Phoneme level lyrics alignment and text- Letters, vol. 29, pp. 344–348, 2021.
informed singing voice separation,” IEEE/ACM Trans. on Au-
dio, Speech and Language Processing, vol. 29, pp. 2382–2395, [18] Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, and
2021. Juan P. Bello, “Deep salience representations for F0 tracking
in polyphonic music,” in Proc. Int. Soc. Music Information
[5] Simon Dixon and Gerhard Widmer, “MATCH: A music align-
Retrieval Conf. (ISMIR), Suzhou, China, 2017, pp. 63–70.
ment tool chest,” in Proc. Int. Soc. Music Information Retrieval
Conf. (ISMIR), London, UK, 2005, pp. 492–497. [19] Mehran Maghoumi, Eugene Matthew Taranta, and Joseph
LaViola, “DeepNAG: Deep non-adversarial gesture genera-
[6] Sebastian Ewert, Meinard Müller, and Peter Grosche, “High
tion,” in Proc. Int. Conf. Intelligent User Interfaces (IUI), Col-
resolution audio synchronization using chroma onset features,”
lege Station, Texas, USA, 2021, pp. 213–223.
in Proc. of IEEE Int. Conf. Acoustics, Speech, and Signal Pro-
cessing (ICASSP), Taipei, Taiwan, Apr. 2009, pp. 1869–1872. [20] Graham E. Poliner and Daniel P.W. Ellis, “A discriminative
[7] Shuhei Tsuchida, Satoru Fukayama, Masahiro Hamasaki, and model for polyphonic piano transcription,” EURASIP Journal
Masataka Goto, “AIST dance video database: Multi-genre, on Advances in Signal Processing, vol. 2007, no. 1, 2007.
multi-dancer, and multi-camera database for dance information [21] John Thickstun, Zaı̈d Harchaoui, and Sham M. Kakade,
processing,” in Proc. Int. Soc. Music Information Retrieval “Learning features of music from scratch,” in Proc. Int. Conf.
Conf. (ISMIR), Delft, The Netherlands, 2019, pp. 501–510. Learning Representations (ICLR), Toulon, France, 2017.
[8] Alex Graves, Santiago Fernández, Faustino J. Gomez, and [22] Christof Weiß, Frank Zalkow, Vlora Arifi-Müller, Meinard
Jürgen Schmidhuber, “Connectionist temporal classification: Müller, Hendrik Vincent Koops, Anja Volk, and Harald Gro-
Labelling unsegmented sequence data with recurrent neural hganz, “Schubert Winterreise dataset: A multimodal scenario
networks,” in Proc. Int. Conf. Machine Learning (ICML), Pitts- Music analysis,” ACM Journal on Computing and Cultural
burgh, Pennsylvania, USA, 2006, pp. 369–376. Heritage (JOCCH), vol. 14, no. 2, pp. 25:1–18, 2021.
[9] Curtis Wigington, Brian L. Price, and Scott Cohen, “Multi- [23] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Si-
label connectionist temporal classification,” in Proc. Int. Conf. mon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen,
Document Analysis and Recognition (ICDAR), Sydney, Aus- Jesse H. Engel, and Douglas Eck, “Enabling factorized piano
tralia, 2019, pp. 979–986. music modeling and generation with the MAESTRO dataset,”
[10] Meinard Müller, Fundamentals of Music Processing – Using in Proc. Int. Conf. Learning Representations (ICLR), New Or-
Python and Jupyter Notebooks, Springer Verlag, 2nd edition, leans, Louisiana, USA, 2019.
2021. [24] Zhiyao Duan, Bryan Pardo, and Changshui Zhang, “Multiple
[11] Marco Cuturi and Mathieu Blondel, “Soft-DTW: a differen- fundamental frequency estimation by modeling spectral peaks
tiable loss function for time-series,” in Proc. Int. Conf. Ma- and non-peak regions,” IEEE Trans. on Audio, Speech, and
chine Learning (ICML), Sydney, NSW, Australia, 2017, pp. Language Processing, vol. 18, no. 8, pp. 2121–2133, 2010.
894–903. [25] Joachim Fritsch and Mark D. Plumbley, “Score informed au-
[12] Isma Hadji, Konstantinos G. Derpanis, and Allan D. Jepson, dio source separation using constrained nonnegative matrix
“Representation learning via global temporal alignment and factorization and score synthesis,” in Proc. IEEE Int. Conf.
cycle-consistency,” in IEEE/CVF Conf. Computer Vision and Acoustics, Speech, and Signal Processing (ICASSP), Vancou-
Pattern Recognition (CVPR), Virtual, 2021, pp. 11068–11077. ver, Canada, May 2013, pp. 888–891.
[13] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and [26] Marius Miron, Julio J. Carabias-Orti, Juan J. Bosch, Emilia
Juan Carlos Niebles, “D3TW: discriminative differentiable dy- Gómez, and Jordi Janer, “Score-informed source separation
namic time warping for weakly supervised action alignment for multichannel orchestral recordings,” Journal of Electri-
and segmentation,” in IEEE/CVF Conf. Computer Vision and cal and Computer Engineering, vol. 2016, pp. 8363507:1–
Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 8363507:19, 2016.
3546–3555.

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.

You might also like