2023 ICASSP Soft Dynamic Time Warping For Multi-Pitch Estimation and Beyond
2023 ICASSP Soft Dynamic Time Warping For Multi-Pitch Estimation and Beyond
=Y
=Y
=Y
ABSTRACT SoftDTW
Many tasks in music information retrieval (MIR) involve weakly
aligned data, where exact temporal correspondences are unknown.
Time
The connectionist temporal classification (CTC) loss is a standard
technique to learn feature representations based on weakly aligned
training data. However, CTC is limited to discrete-valued target se-
quences and can be difficult to extend to multi-label problems. In
this article, we show how soft dynamic time warping (SoftDTW), a a) b) c)
= f(X)
differentiable variant of classical DTW, can be used as an alterna- f
tive to CTC. Using multi-pitch estimation as an example scenario,
we show that SoftDTW yields results on par with a state-of-the-art
multi-label extension of CTC. In addition to being more elegant in =X
terms of its algorithmic formulation, SoftDTW naturally extends to
real-valued target sequences. Time
Index Terms— dynamic time warping, music processing, music Fig. 1: Illustration of SoftDTW for aligning a learned feature se-
information retrieval, multi-pitch estimation, music transcription quence f (X) and a target sequence Y, where one may consider (a)
single-label, (b) multi-label, or (c) real-valued targets.
1. INTRODUCTION
A common technique used in MIR for finding an optimal align-
Many applications in music information retrieval (MIR) require ment between weakly aligned sequences is dynamic time warping
alignments between sequences of music data. Often, the sequences (DTW) in combination with hand-crafted features [10]. Such a
given are only weakly aligned. For example, in audio-to-score tran- pipeline can provide good alignment results for tasks like audio–
scription, pairs of audio and score excerpts are easy to find but exact audio synchronization [6], but the standard DTW-based cost func-
temporal correspondences between these pairs are hard to estab- tion is not fully differentiable, which prevents its use in an end-
lish [1]. Furthermore, music data sequences may involve different to-end deep learning context. To resolve this issue, Cuturi and
levels of complexity. For instance, given a single-instrument mono- Blondel [11] proposed a differentiable variant of DTW, called Soft-
phonic music recording, monophonic pitch estimation [2] aims at DTW, that approximates the original DTW cost. In recent work,
finding a single pitch value per time step (see also Figure 1a). Other SoftDTW and related techniques have been successfully used in
scenarios with discrete, single-label targets include lyrics transcrip- computer vision applications such as action alignment [12, 13]. To
tion or lyrics alignment for songs with a single singer [3, 4]. More our knowledge, the only prior work applying SoftDTW in an MIR
complex sequences appear in multi-pitch estimation (MPE), where context is by Agrawal et al. [17].
multiple pitches may be active simultaneously (Figure 1b). Finally, Our contributions are as follows: We demonstrate the use of
some scenarios involve alignment between real-valued sequences SoftDTW for MPE. In particular, we show that SoftDTW performs
(Figure 1c), e. g., audio–audio synchronization [5, 6] or multi-modal on par with a multi-label extension of CTC, while being conceptu-
alignment problems such as synchronizing dance videos with mu- ally simpler. Furthermore, we show that the SoftDTW approach nat-
sic [7]. urally generalizes to real-valued target sequences, as illustrated in
The connectionist temporal classification (CTC) [8] loss, a fully Figure 1, making it applicable for a wide range of alignment tasks.
differentiable loss function initially developed for speech recogni- The remainder of the paper is structured as follows: In Section 2,
tion, is commonly used for learning features from weakly aligned we review the current state of the art for multi-pitch estimation from
data when the targets are sequences over a finite alphabet of labels. weakly aligned data with CTC. In Section 3, we formalize SoftDTW
Recently, CTC was extended to handle multi-label learning prob- for general sequences and, in Section 4, apply it for MPE. Section 5
lems [9], where the main idea was to locally transform the multi- demonstrates the potential of SoftDTW for learning with real-valued
label into the single-label case. However, in addition to its compli- targets. Finally, Section 6 concludes the paper with an outlook to-
cated algorithmic formulation, this approach is unsuitable for target wards future applications.
sequences that do not originate from a discrete vocabulary.
This work was supported by the German Research Foundation (DFG 2. WEAKLY ALIGNED TRAINING FOR MPE
MU 2686/7-2). The authors are with the International Audio Laboratories
Erlangen, a joint institution of the Friedrich-Alexander-Universität Erlangen-
Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. The au- In recent years, automated music transcription has become a cen-
thors gratefully acknowledge the compute resources and support provided by tral topic in MIR research, with deep learning techniques achieving
the Erlangen Regional Computing Center (RRZE). state-of-the-art results [14–16]. We here focus on MPE as a sub-
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.
problem of automated music transcription, where the goal is to trans- n ∈ [1 : N ] , m ∈ [1 : M ]). Given some differentiable cost function
form an input music recording X into a piano-roll representation Y c : F1 × F2 → R defined on these feature spaces, we can construct
of pitches played. In particular, multiple pitches may be active at a matrix C ∈ RN ×M of local costs where each entry
the same time. Most learning-based approaches for MPE require
strongly aligned data for training, i. e., pitches are annotated for each C(n, m) = c(xn , ym )
audio frame of the input recording. Since annotating data in such a
frame-wise fashion is very time consuming, most MPE datasets have contains the cost of locally aligning xn with ym . To determine an
been generated (semi-)automatically, e. g., by using MIDI pianos or optimal global alignment1 between the sequences X and Y one com-
by applying score–audio synchronization techniques (which may in- putes an accumulated cost matrix Dγ ∈ RN ×M using the recursion
troduce labeling errors). Techniques that allow learning from pairs Dγ (1, 1) = C(1, 1),
of X and Y that are not temporally aligned are therefore highly de-
m
sirable. X
Dγ (1, m) = C(1, k), for m ∈ [1 : M ] ,
As discussed in the introduction, a common technique for deal-
k=1
ing with weakly aligned learning problems is CTC [8]. Here, the n
target sequences Y consist of symbols from a discrete alphabet L, γ
D (n, 1) =
X
C(k, 1), for n ∈ [1 : N ] ,
including a special blank symbol necessary for distinguishing repe-
k=1
titions of symbols. For each frame in the input sequence X, a neural
network outputs a probability distribution over L. The CTC loss Dγ (n, m) = C(n, m) + µγ ({Dγ (n − 1, m − 1),
then corresponds to the likelihood of Y given these network outputs, Dγ (n − 1, m), Dγ (n, m − 1)}),
taking into account all possible alignments between X and Y . Note
that CTC is agnostic about the durations of symbols in Y , i. e., even for n ∈ [2 : N ] , m ∈ [2 : M ]. Here, µγ refers to a differentiable
if information about symbol durations is available, CTC is unable approximation of the minimum function given by
to exploit this for alignment. An efficient dynamic programming al-
X s
gorithm for computing the CTC loss exists (with time complexity µγ (S) = −γ log exp − ,
O(|L|2 · N ), where N is the length of X), but it requires special s∈S
γ
care in handling the blank symbol [8].
A naive extension of CTC towards multi-label target sequences where S is some finite set of real numbers and γ ∈ R>0 is a temper-
would introduce unique network outputs for all possible symbol ature parameter that determines the “softness” of the approximation.
combinations, which leads to a combinatorial explosion. Instead, One can show that µγ is a lower bound of the minimum function [12]
the authors in [9] propose to locally reduce the multi-label to the and converges towards the true minimum for γ → 0. As a con-
single-label case by only considering those symbol combinations sequence, Dγ becomes the accumulated cost matrix from classical
that occur within a single training batch (called multi-label CTC, DTW for γ → 0. Thus, SoftDTW becomes DTW in the limit case.
i. e., MCTC). This defines a “batch-dependent alphabet,” avoiding After evaluating the SoftDTW recursion, the entry DTWγ (C) =
the combinatorial explosion. The technical details of this process Dγ (N, M ) contains the approximate minimal cost of aligning the
are tricky and special care needs to be taken for handling the blank sequences X and Y , given the local costs C. A similar recursion
symbol. In [1], this idea is adapted for MPE by considering pitches exists for computing the gradient of DTWγ (C) with regard to any
as symbols and multi-pitch annotations as combinations of sym- matrix coefficient C(n, m) for n ∈ [1 : N ] and m ∈ [1 : M ] [11,
bols. This formulation allows them to train networks for MPE on Algorithm 2]. The time and space complexity of the SoftDTW re-
pairs of X and Y that are only weakly aligned, e. g., where X is cursion as well as the gradient computation is both O(N ·M ), which
a music recording and Y is a MIDI representation derived from is sufficiently fast for use in deep learning.
the corresponding score. In this paper, using MPE from [1] as an Note that SoftDTW requires no prior knowledge of the align-
example application, we show how the technically intricate MCTC ment between X and Y , which enables the use of DTWγ (C) as a
can be replaced by a conceptually more elegant SoftDTW approach. loss function for learning problems with weakly aligned data. Fur-
SoftDTW does not involve the need for a blank symbol, which may thermore, X and Y can come from arbitrary feature spaces, as long
be well-motivated in text applications but can be unnatural in MIR as an appropriate cost function c can be defined.
problems such as MPE.
4. APPLICATION TO MULTI-PITCH ESTIMATION
3. SOFT DYNAMIC TIME WARPING
We now apply SoftDTW to multi-pitch estimation. For a given piece
The objective of DTW is to find an optimal temporal alignment be- of music, the sequence X corresponds to some representation of
tween two sequences. SoftDTW [11] is a differentiable approxi- an input recording, while Y corresponds to a multi-hot encoding of
mation of DTW that allows for propagating gradients through the pitches played. Note that Y does not need to be temporally aligned
alignment procedure, making SoftDTW applicable for deep learn- with X and could arise, e. g., from a score representation of the mu-
ing. Like classical DTW, SoftDTW admits an efficient dynamic pro- sical piece. An element ym of the sequence Y is encoded as a vector
gramming (DP) recursion for computing the optimal alignment cost. ym ∈ {0, 1}72 and the entries of ym correspond to the 72 pitches
Furthermore, there also exists a DP-algorithm for efficiently comput- from C1 to B6. In our experiments, rather than directly aligning Y
ing the gradient of that cost. In this section, we briefly summarize with some fixed representation X, we use a neural network f that
the problem statement and DP recursion of SoftDTW for general se- takes X as input and outputs a feature vector per frame in X. Thus,
quences. We then apply this to our music scenarios in later sections. 1 Subject to some constraints, namely, the first and last elements of both
Consider two sequences X = (x1 , x2 , . . . , xN ) and Y = sequences are aligned to each other (boundary constraint), no element is
(y1 , y2 , . . . , yM ) of lengths N, M ∈ N with elements coming skipped (step-size constraint), and the alignment is monotonous (monotonic-
from some feature spaces F1 , F2 (i. e., xn ∈ F1 , ym ∈ F2 for all ity constraint).
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.
we obtain a sequence f (X) = (z1 , . . . , zN ) with the same length
N as X. We construct f such that zn ∈ R72 for the elements zn
of f (X). Thus, both sequences Y and f (X) contain elements from
Pitches
the features space F1 = F2 = R72 . We then align f (X) and Y , as
illustrated in Figure 1.
To our knowledge, SoftDTW has not previously been used
for MPE and is seldom explored in MIR. The authors in [4] used
the classical, non-differentiable DTW recursion inside an attention
mechanism for lyrics alignment, which led to training instabilities. a) b) c) d) e)
The work by Agrawal et al. [17] constitutes the first use of SoftDTW
Fig. 2: (a) Strongly aligned pitch annotations for an audio excerpt,
for an MIR application. They successfully employ a variant of
(b) Annotations without note durations (as used by MCTC), (c) An-
SoftDTW to train a system for score-audio synchronization. In their
notations without note durations, stretched to excerpt length, (d)
scenario, SoftDTW is applied to discrete-valued, one-dimensional,
Score representation, not aligned to the audio excerpt, (e) Score rep-
and strongly aligned sequences. In contrast, we employ SoftDTW
resentation, stretched to excerpt length
for multi-dimensional sequences in weakly aligned settings.
4.1. Implementation Details and Evaluation Metrics Scenario F-measure CS AP Acc.
CE [1] 0.70 0.759 0.764 0.546
Since the focus of our work is on evaluating the efficacy of SoftDTW MCTC [1] 0.69 0.744 0.734 0.532
for MIR tasks and in order to maintain comparability with the results
presented in [1], we adopt the same training setup and network ar- SoftDTWW1 0.00 0.465 0.297 0.002
SoftDTWW2 0.69 0.736 0.737 0.529
chitecture. Thus, we use harmonic CQT (HCQT, [18]) excerpts of
roughly ten second lengths as input and pass them through a five-
layer convolutional neural network to obtain a sequence of per-frame Table 1: Results for multi-pitch estimation on the Schubert Winter-
representations f (X) (see [1] for details on the network architecture reise Dataset for SoftDTW compared with MCTC.
and HCQT representation).
We train our networks by minimizing the soft alignment cost icantly shorter than the learned sequence f (X).4 We repeat the
DTWγ (C).2 In all experiments, we use the squared Euclidean dis- experiment by temporally stretching the sequence Y to match the
tance for c and set γ = 10.0. We did not see improvements for number of frames in f (X) (illustrated in Figure 2c). When apply-
alternative choices of c and obtained similar results for a wide range ing SoftDTW together with this trick (denoted by SoftDTWW2 ),
of values for γ ∈ [0.5, 20.0]. Furthermore, we use a fast GPU im- results are again very similar to MCTC (AP = 0.737). Thus, Soft-
plementation of the SoftDTW recursion and gradient computation DTW may be used to replace MCTC in this scenario.
which was implemented in [19].
To compare network predictions with the strongly aligned pitch 4.3. Incorporating Note Durations
annotations of the test sets, we use common evaluation measures for In contrast to MCTC, SoftDTW is able to incorporate (approxi-
MPE, including cosine similarity between predictions and annota- mate) note durations during training. SWD, for example, contains
tions (CS), area under the precision-recall curve (also called average non-aligned score representations of the pieces performed. We now
precision, AP), as well as F-measure and accuracy (Acc., introduced use these score representations as target sequences Y (denoted by
in [20]) at a threshold of 0.4 (which is a common choice in MPE SoftDTWW3 , see Figure 2d for an illustration). Table 2 shows
systems, see also [21]). evaluation results, which are slightly improved compared to training
4.2. Comparison with MCTC without note durations (F-measure of 0.71 compared to 0.69 and
CS = 0.756 compared to 0.736 for SoftDTWW2 ). Here, there is
We begin by comparing our results with the main results reported in only a moderate difference between the lengths of excerpt and label
[1], which are obtained on the Schubert Winterreise Dataset (SWD) sequence and stretching the label sequence to the length of the input
[22]. SWD provides strongly aligned annotations for all recordings. yields nearly identical results (denoted by SoftDTWW4 , see Fig-
Due to this, one can consider a baseline trained on the aligned an- ure 2e). Finally, we may also use SoftDTW using strongly aligned
notations with a per-frame cross-entropy loss (CE). The first line label sequences (denoted by SoftDTWS ). In this very optimistic
of Table 1 shows results for such an optimistic baseline (reprinted scenario, no alignment is necessary, but SoftDTW may compensate
from [1]), which yields an F-measure of 0.70 and AP = 0.764. To for inaccuracies introduced by the dataset annotation procedures.
train a network using MCTC instead, one must remove all informa- Indeed, this scenario yields best results (F-measure of 0.72 and
tion about note durations from the label sequence Y (see Figure 2b). AP = 0.769), even slightly improving upon the cross-entropy
The results obtained this way are just slightly lower at AP = 0.734, baseline in Table 1.
even though only weakly aligned labels are used. When performing
the same experiment using SoftDTW (denoted by SoftDTWW1 ), 4.4. Cross-Dataset Experiment
we obtain much weaker results with an F-measure of 0.00 and AP = We also perform a cross-dataset experiment (again following the
0.297.3 In this experiment, the label sequence Y may be signif- setup in [1]), where we train on the popular MAESTRO [23] and
2 Note that we normalize DTWγ (C) by its value for the first training MusicNet [21] datasets. Both contain strongly aligned pitch an-
batch. Thus, the loss is exactly 1 for the first batch and its value range remains notations for the training recordings, but they do not provide non-
similar across training configurations, regardless of the sequence lengths N aligned score representations of the pieces, so SoftDTWW3 and
and M or other factors.
3 Note that the F-measure and Accuracy scores can be improved to 0.32 4 A large discrepancy in sequence lengths is well known to cause problems
and 0.20, respectively, by choosing a more suitable detection threshold. Still, for classical DTW. Further investigation is needed to understand how this
these scores are notably worse compared to the results for MCTC. affects the training process with SoftDTW.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.
Scenario F-measure CS AP Acc. 5.1. Pitch Estimation with Overtone Model
SoftDTWW3 0.71 0.756 0.755 0.552
SoftDTWW4 0.71 0.757 0.750 0.555 First, we consider a straightforward extension of MPE, where we
SoftDTWS 0.72 0.761 0.769 0.563 transform the binary, multi-hot target vectors of MPE to real-valued
vectors by adding energy according to a simple overtone model, see
Table 2: Results on the Schubert Winterreise Dataset for incorporat- Figure 1c. Here, we consider 10 overtones for each active pitch,
ing note durations with SoftDTW. with amplitude (1/3)n for the n-th overtone. As a baseline utilizing
strongly aligned labels, we compare with a model trained using an ℓ2
AP regression loss at each frame (similar to the cross-entropy baseline
Scenario in Section 4). To evaluate, we use the cosine similarity CS between
SWD Bach10 TRIOS Phenicx
network outputs and annotations. Note that other MPE evaluation
Default network architecture
CE [1] 0.684 0.864 0.825 0.829 metrics are not applicable for real-valued vectors.
MCTC [1] 0.666 0.861 0.824 0.833 When performing this experiment on the SWD dataset, we ob-
SoftDTWW2 0.665 0.835 0.812 0.788 tain CS = 0.794 for per-frame training with strongly aligned labels,
Larger network architecture which is higher than for MPE on SWD (cf. Table 1). Training with-
CE [1] 0.701 0.886 0.863 0.846 out strongly aligned labels using SoftDTWW2 yields only slightly
MCTC [1] 0.677 0.871 0.849 0.850 lower cosine similarities at 0.770. This illustrates that SoftDTW also
SoftDTWW2 0.682 0.896 0.864 0.838 works for settings with real-valued target sequences.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.
7. REFERENCES [14] Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian
Simon, Colin Raffel, Jesse H. Engel, Sageev Oore, and Dou-
[1] Christof Weiß and Geoffroy Peeters, “Learning multi-pitch es- glas Eck, “Onsets and frames: Dual-objective piano transcrip-
timation from weakly aligned score-audio pairs using a multi- tion,” in Proc. Int. Soc. Music Information Retrieval Conf.,
label CTC loss,” in Proc. IEEE Workshop on Applications (ISMIR), Paris, France, 2018, pp. 50–57.
of Signal Processing to Audio and Acoustics (WASPAA), New
[15] Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Sebastian
Paltz, USA, 2021, pp. 121–125.
Böck, Andreas Arzt, and Gerhard Widmer, “On the potential of
[2] Juan J. Bosch, Rachel M. Bittner, Justin Salamon, and Emilia simple framewise approaches to piano transcription,” in Proc.
Gómez, “A comparison of melody extraction methods based Int. Soc. Music Information Retrieval Conf. (ISMIR), New York
on source-filter modelling,” in Proc. Int. Soc. Music Informa- City, New York, USA, 2016, pp. 475–481.
tion Retrieval Conf. (ISMIR), New York City, New York, USA,
2016, pp. 571–577. [16] Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, and
Dorien Herremans, “Revisiting the onsets and frames model
[3] Ye Wang, Min-Yen Kan, Tin Lay Nwe, Arun Shenoy, and Jun with additive attention,” in Proc. Int. Joint Conf. Neural Net-
Yin, “Lyrically: automatic synchronization of acoustic musical works (IJCNN), Shenzhen, China, 2021.
signals and textual lyrics,” in Proc. ACM Int. Conf. Multimedia,
New York, NY, USA, 2004, pp. 212–219. [17] Ruchit Agrawal, Daniel Wolff, and Simon Dixon, “A
convolutional-attentional neural framework for structure-aware
[4] Kilian Schulze-Forster, Clement S. J. Doire, Gaël Richard, and performance-score synchronization,” IEEE Signal Processing
Roland Badeau, “Phoneme level lyrics alignment and text- Letters, vol. 29, pp. 344–348, 2021.
informed singing voice separation,” IEEE/ACM Trans. on Au-
dio, Speech and Language Processing, vol. 29, pp. 2382–2395, [18] Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, and
2021. Juan P. Bello, “Deep salience representations for F0 tracking
in polyphonic music,” in Proc. Int. Soc. Music Information
[5] Simon Dixon and Gerhard Widmer, “MATCH: A music align-
Retrieval Conf. (ISMIR), Suzhou, China, 2017, pp. 63–70.
ment tool chest,” in Proc. Int. Soc. Music Information Retrieval
Conf. (ISMIR), London, UK, 2005, pp. 492–497. [19] Mehran Maghoumi, Eugene Matthew Taranta, and Joseph
LaViola, “DeepNAG: Deep non-adversarial gesture genera-
[6] Sebastian Ewert, Meinard Müller, and Peter Grosche, “High
tion,” in Proc. Int. Conf. Intelligent User Interfaces (IUI), Col-
resolution audio synchronization using chroma onset features,”
lege Station, Texas, USA, 2021, pp. 213–223.
in Proc. of IEEE Int. Conf. Acoustics, Speech, and Signal Pro-
cessing (ICASSP), Taipei, Taiwan, Apr. 2009, pp. 1869–1872. [20] Graham E. Poliner and Daniel P.W. Ellis, “A discriminative
[7] Shuhei Tsuchida, Satoru Fukayama, Masahiro Hamasaki, and model for polyphonic piano transcription,” EURASIP Journal
Masataka Goto, “AIST dance video database: Multi-genre, on Advances in Signal Processing, vol. 2007, no. 1, 2007.
multi-dancer, and multi-camera database for dance information [21] John Thickstun, Zaı̈d Harchaoui, and Sham M. Kakade,
processing,” in Proc. Int. Soc. Music Information Retrieval “Learning features of music from scratch,” in Proc. Int. Conf.
Conf. (ISMIR), Delft, The Netherlands, 2019, pp. 501–510. Learning Representations (ICLR), Toulon, France, 2017.
[8] Alex Graves, Santiago Fernández, Faustino J. Gomez, and [22] Christof Weiß, Frank Zalkow, Vlora Arifi-Müller, Meinard
Jürgen Schmidhuber, “Connectionist temporal classification: Müller, Hendrik Vincent Koops, Anja Volk, and Harald Gro-
Labelling unsegmented sequence data with recurrent neural hganz, “Schubert Winterreise dataset: A multimodal scenario
networks,” in Proc. Int. Conf. Machine Learning (ICML), Pitts- Music analysis,” ACM Journal on Computing and Cultural
burgh, Pennsylvania, USA, 2006, pp. 369–376. Heritage (JOCCH), vol. 14, no. 2, pp. 25:1–18, 2021.
[9] Curtis Wigington, Brian L. Price, and Scott Cohen, “Multi- [23] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Si-
label connectionist temporal classification,” in Proc. Int. Conf. mon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen,
Document Analysis and Recognition (ICDAR), Sydney, Aus- Jesse H. Engel, and Douglas Eck, “Enabling factorized piano
tralia, 2019, pp. 979–986. music modeling and generation with the MAESTRO dataset,”
[10] Meinard Müller, Fundamentals of Music Processing – Using in Proc. Int. Conf. Learning Representations (ICLR), New Or-
Python and Jupyter Notebooks, Springer Verlag, 2nd edition, leans, Louisiana, USA, 2019.
2021. [24] Zhiyao Duan, Bryan Pardo, and Changshui Zhang, “Multiple
[11] Marco Cuturi and Mathieu Blondel, “Soft-DTW: a differen- fundamental frequency estimation by modeling spectral peaks
tiable loss function for time-series,” in Proc. Int. Conf. Ma- and non-peak regions,” IEEE Trans. on Audio, Speech, and
chine Learning (ICML), Sydney, NSW, Australia, 2017, pp. Language Processing, vol. 18, no. 8, pp. 2121–2133, 2010.
894–903. [25] Joachim Fritsch and Mark D. Plumbley, “Score informed au-
[12] Isma Hadji, Konstantinos G. Derpanis, and Allan D. Jepson, dio source separation using constrained nonnegative matrix
“Representation learning via global temporal alignment and factorization and score synthesis,” in Proc. IEEE Int. Conf.
cycle-consistency,” in IEEE/CVF Conf. Computer Vision and Acoustics, Speech, and Signal Processing (ICASSP), Vancou-
Pattern Recognition (CVPR), Virtual, 2021, pp. 11068–11077. ver, Canada, May 2013, pp. 888–891.
[13] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and [26] Marius Miron, Julio J. Carabias-Orti, Juan J. Bosch, Emilia
Juan Carlos Niebles, “D3TW: discriminative differentiable dy- Gómez, and Jordi Janer, “Score-informed source separation
namic time warping for weakly supervised action alignment for multichannel orchestral recordings,” Journal of Electri-
and segmentation,” in IEEE/CVF Conf. Computer Vision and cal and Computer Engineering, vol. 2016, pp. 8363507:1–
Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 8363507:19, 2016.
3546–3555.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on December 13,2024 at 23:43:46 UTC from IEEE Xplore. Restrictions apply.