Generative Evaluation of Audio Representations
Generative Evaluation of Audio Representations
Editors: Joseph Turian, Björn W. Schuller, Dorien Herremans, Katrin Kirchhoff, Paola Garcia
Perera, and Philippe Esling
Abstract
The “Holistic Evaluation of Audio Representations” (HEAR) is an emerging research
program towards statistical models that can transfer to diverse machine listening tasks.
The originality of HEAR is to conduct a fair, “apples-to-apples” comparison of many deep
learning models over many datasets, resulting in multitask evaluation metrics that are readily
interpretable by practitioners. On the flip side, this comparison incurs a neural architecture
search: as such, it is not directly interpretable in terms of audio signal processing. In this
paper, we propose a complementary viewpoint on the HEAR benchmark, which we name
GEAR: Generative Evaluation of Audio Representations. The key idea behind GEAR is to
generate a dataset of sounds with few independent factors of variability, analyze it with
HEAR embeddings, and visualize it with an unsupervised manifold learning algorithm.
Visual inspection reveals stark contrasts in the global structure of the nearest-neighbor
graphs associated to logmelspec, Open-L3 , BYOL, CREPE, wav2vec2, GURA, and YAMNet.
Although GEAR currently lacks mathematical refinement, we intend it as a proof of concept
to show the potential of parametric audio synthesis in general-purpose machine listening
research.
1. Introduction
1.1. Towards machine listening
Machine listening (Rowe, 1992), also known as audio content analysis (Lerch, 2012), aims
to extract the information from a digital audio signal in the same way as a human listener
would. Since the 1950s and the earliest spoken digit recognizer (Davis et al., 1952), this
technology has gained in sophistication, thanks to a number of factors in conjunction: the
falling costs of audio acquisition hardware, the acceleration of personal computing, and
the massification of user-generated content (Gemmeke et al., 2017), just to name a few.
Over the past decade, the renewed interest for deep learning has spurred the development
of a new generation of machine listening systems. These systems tend to have certain
∗
Work done during exchange with Centrale Nantes.
. Download the source code for our experiments:
https://github.com/yy945635407/GEAR
traits in common: the resort to a mel-frequency or CQT representation as input, the “deep”
stacking of convolutional or recurrent layers, and training from a random initialization by
some variant of stochastic gradient descent (McFee, 2018). Yet, the debate surrounding
the best computational architecture for machine listening remains intense: for example,
recent research has shown that deep learning in the “raw waveform” domain may outperform
predefined time–frequency representations (Zeghidour, 2019). The same can be said of the
“Transformer,” a feedforward layer with multiplicative interactions which may outperform the
well-established convolutional layer (Gong et al., 2021). Lastly, many variants of pre-training,
either supervised or self-supervised, have shown a strong potential (Tagliasacchi et al., 2020).
49
Lostanlen Yan Yang
1.5. Contribution
In the present paper, we propose a first attempt at remediating the two aforementioned
shortcomings, which we perceive in the HEAR benchmark: lack of visual interpretation
and dependency on a hyperparameter grid. To do so, we generate a synthetic dataset with
known independent factors of variability. Then, we run each audio sample in the dataset
50
From HEAR to GEAR:Generative Evaluation of Audio Representations
through the HEAR model under study, which we regard as a feature extractor. Thirdly,
we visualize the nearest-neighbor graph between feature vectors associated to the dataset,
by means of an unsupervised manifold learning algorithm. Lastly, we color the vertices
of this graph according to the values of the latent variables governing the known factors
of variability in the dataset. In this way, we hope to shed light on the model’s ability to
discover these factors of variability without any supervision.
Figure 1 illustrates our protocol. As a nod to HEAR, we call our method “Generative
Evaluation of Audio Representations,” or GEAR for short.
With GEAR, we do not have the ambition to compete with HEAR, nor to present an
alternative leaderboard. On the contrary, we intend to provide a complementary viewpoint
on the same benchmark: the strengths of GEAR are the weaknesses of HEAR and vice versa.
Indeed, the main drawback of GEAR is that it is not grounded in any real-world use case
or “task:” disentangling factors of variability is a necessary first step for high-dimensional
representation learning (Bengio, 2013) but rarely suffices on its own to correctly assign
patterns to classes. Meanwhile, GEAR operates as a visual “smoke test” which looks for some
of the most basic attributes of auditory perception. In this paper, we have experimented
with pitched sounds and varied three important parameters of the spectral envelope; but
we stress that GEAR is a general methodology and could, in the future, apply to more
sophisticated generative models than the one we present.
Our results are presented in Section 4. They reveal that the nearest-neighbor graphs which
proceed from seven of the embeddings in HEAR exhibit qualitatively different topologies:
some of them appear like 1-D strands; others a like 2-D sheaf; others like a 3-D dense volume
Nevertheless, we acknowledge that the interpretation of point clouds in Figure 3 is hampered
by the lack of shortcomings in the design of our synthetic dataset, which only became
evident once the study was complete. In particular, distances in the space of synthesis
parameters do not necessarily correlate with perceptual judgments of auditory similarity.
Thus, although the current formulation of GEAR may serve to check local properties of
continuity and independence between factors of variability, one should not make conclusions
about the global geometric properties of audio representations from GEAR visualization
alone. Still, we believe that finely manipulating audio data via parametric synthesizers has
a strong potential towards the better interpretability of deep audio representations. Section
4.3 summarizes the known limitations of GEAR, offers some “learned lessons” after running
it on HEAR challenge submissions, and offers some perspectives for future work.
51
Lostanlen Yan Yang
1.7. Outline
Section 2 presents our parametric generative model; our synthetic audio dataset; and
our chosen graph-based algorithm for manifold learning, Isomap. Section 3 presents the
application of GEAR to the HEAR baseline, as well as six open-source audio representations.
Section 4 summarizes our findings in both qualitative and quantitative terms.
2. Methods
2.1. Additive Fourier synthesis
We build a dataset of complex tones according to the following additive synthesis model:
P
X 1 + (−1)p r
yθ (t) = cos(pf1 t)ϕT (t), (1)
pα
p=1
where ϕT is a Hann window of duration T . This additive synthesis model depends upon
three parameters: the fundamental frequency f1 , the Fourier decay α, and the relative
odd-to-even amplitude difference r. We denote the triplet (f1 , α, r) by θ. The Fourier decay
affects the perceived brightness of the of sound, while the relative odd-to-even amplitude
difference is linked to the boundary conditions of the underlying wave equation: a value
of r = 1 suggests a semi-open 1-D resonator, such as a clarinet, whereas a value of r = 0
52
From HEAR to GEAR:Generative Evaluation of Audio Representations
suggests a closed resonator such as a flute. Hence, our audio signal has three degrees of
freedom.
The reason why we focus on sustained pitched sounds for this study is that they are
present in music, speech and some bio-acoustic sounds. Besides, note that one-dimensional
audio descriptors such as spectral brightness and zero-crossing rate correlate with both f1
and α, hence a lack of disentanglement. Likewise, a previous publication has shown that
mel-frequency cepstral coefficients (MFCC), arguably the most widespread set of engineered
features for speech and music processing, is incapable of disentangling f1 from α and r
(Lostanlen et al., 2020). Thus, simulating the ability of our auditory system to represent
stimuli in Equation 1 is more challenging that it might seem at first glance.
53
Lostanlen Yan Yang
= 1.0, r = 1.0 = 1.0, r = 1.2 = 1.0, r = 1.4 = 1.0, r = 1.6 = 1.0, r = 1.8
= 1.4, r = 1.0 = 1.4, r = 1.2 = 1.4, r = 1.4 = 1.4, r = 1.6 = 1.4, r = 1.8
= 1.8, r = 1.0 = 1.8, r = 1.2 = 1.8, r = 1.4 = 1.8, r = 1.6 = 1.8, r = 1.8
= 2.2, r = 1.0 = 2.2, r = 1.2 = 2.2, r = 1.4 = 2.2, r = 1.6 = 2.2, r = 1.8
= 2.6, r = 1.0 = 2.6, r = 1.2 = 2.6, r = 1.4 = 2.6, r = 1.6 = 2.6, r = 1.8
= 1.0, r = 1.0 = 1.0, r = 1.2 = 1.0, r = 1.4 = 1.0, r = 1.6 = 1.0, r = 1.8
= 1.4, r = 1.0 = 1.4, r = 1.2 = 1.4, r = 1.4 = 1.4, r = 1.6 = 1.4, r = 1.8
= 1.8, r = 1.0 = 1.8, r = 1.2 = 1.8, r = 1.4 = 1.8, r = 1.6 = 1.8, r = 1.8
= 2.2, r = 1.0 = 2.2, r = 1.2 = 2.2, r = 1.4 = 2.2, r = 1.6 = 2.2, r = 1.8
= 2.6, r = 1.0 = 2.6, r = 1.2 = 2.6, r = 1.4 = 2.6, r = 1.6 = 2.6, r = 1.8
Figure 2: Visualization of 25 samples in our synthetic dataset: r varies across columns, α across
rows, and f1 varies randomly. Blue: time domain. Red: Fourier domain.
54
From HEAR to GEAR:Generative Evaluation of Audio Representations
to K = 100. That being said, we note that the hyperparameter K has an impact on the
visualization and its role in GEAR is deserving of future work.
The second stage of Isomap is classically done by a shortest path algorithm; in our case,
Dijkstra’s algorithm. It yields a large matrix (N × N ) of geodesic distances, which we square
and recenter to null mean.
Lastly, multidimensional scaling (MDS) determines the top-three eigenvalues of this
matrix above and its associated eigenvectors in a space of dimension N . Displaying the
entries of these eigenvectors as a scatter plot produces a 3-D cloud of N points, each
corresponding to a different sound, which lends itself to visual interpretation: any two points
appearing nearby are similar in feature space, in the sense that there exists a short Euclidean
path connecting them. Furthermore, looking up the value of α, r, or f1 for these points
assigns them a color in a continuous scale: in our case, red–white–blue.
Hence, the core hypothesis of GEAR is that a desirable audio representation should
produce a dense arrangement of points in 3-D, in which all three parameters of interest
appear as smooth color progressions over orthogonal coordinates. This hypothesis may
be simply checked by visual inspection, or quantified automatically by defining a task of
parameter regression (see Figure 1).
3.2. Open-L3
Open-L3 is a deep convolutional network (convnet) that is trained entirely by self-supervision
(Cramer et al., 2019). In Open-L3 , the “open” prefix stands for open-source while the suffix
L3 is short for “Look, Listen and Learn” (Arandjelovic and Zisserman, 2017). Open-L3
consists of two subnetworks: a video subnetwork and an audio subnetwork. The two
subnetworks are trained jointly to distinguish whether a video frame and a one-second audio
segment are from the same video file; a task known as audio-visual correspondence. These
files are sampled from a large unlabeled dataset of 60 million videos. The audio subnetwork
has reached state-of-the-art results in urban sound classification. We use this subnetwork as
a feature extractor in dimension 6144. Open-L3 is one of the three (non-naive) baselines
that are provided by the organizers of the HEAR benchmark.
55
Lostanlen Yan Yang
f1 r
base
f1 r
Open-L 3
f1 r
BYOL
f1 r
CREPE
f1 r
Wav2vec2
f1 r
GURA
f1 r
YAMNet
Figure 3: Isomap visualization of our synthetic dataset after feature map by seven audio represen-
tations. Top to bottom: HEAR baseline, Open-L3 , BYOL, CREPE, wav2vec2, GURA, YAMNet.
Shades of red (resp. blue) denote greater (resp. lower) values of the fundamental frequency f1 (left),
the spectral decay exponent α (center), and the odd-to-even harmonic energy ratio r (right).
56
From HEAR to GEAR:Generative Evaluation of Audio Representations
(BYOL for Speech). Eventually, the authors have extended BYOL-S so as to learn from
handcrafted features, yielding Hybrid BYOL-S. Hybrid BYOL-S was found to outperform
BYOL-S in the HEAR 2021 benchmark, which is why we prioritized it for inclusion in
GEAR.
3.4. CREPE
CREPE means “Convolutional Representation for Pitch Estimation” (Kim et al., 2018):
it was initially proposed to solve the problem of monophonic pitch tracking. It is a deep
convolutional neural network architecture whose originality is to operate directly on the
“raw waveform domain” rather than upon a time–frequency representation. After supervised
learning on synthetic data, CREPE outperforms other popular models for pitch estimation
on two real-world datasets. CREPE is made available as a (non-naive) baseline by the
organizers of HEAR.
3.5. Wav2vec2
Wav2vec2 (Baevski et al., 2020) is a self-supervised learning model built based on a contrastive
learning task. It is also one of the baselines in HEAR competition. Its feature encoder
consists of a convnet in the raw waveform domain, followed by a Transformer and a vector
quantization module for sequence modeling. Wav2vec2 has proven to extract cross-linguistic
speech units. On a downstream task of automatic speech recognition, wav2vec2 matches the
previous state of the art model with 100 times fewer labeled samples.
GURA 2 is a set of ensemble methods applied on three models: HuBERT (Hsu et al.,
2021), wav2vec2, and CREPE. The authors have considered several aggregation strategies:
feature concatenation, averaging, and fusion. For GEAR, we choose to visualize the
feature concatenation variant, referred to as “GURA Cat H+w+C” on the leaderboard and
“fusion cat xwc” in their repository. It concatenates three 1024-dimensional embeddings and
produces a 3072-dimensional embedding.
3.7. YAMNet
57
Lostanlen Yan Yang
4. Results
4.1. Visual inspection
Figure 3 shows the result of Isomap for all seven embeddings listed in the previous section.
Each row corresponds to a different embedding while each columns corresponds to a different
synthesis parameter: i.e., f1 , α, r.
We make the following observations:
base The logmelspec baseline is strongly sensitive to pitch (f1 ), especially for bright spectra
(low α) comprising a sparse harmonic series (high r).
Open-L3 Discontinuities in the color scale associated to f1 indicate that the topology of the
pitch axis is not preserved. This is consistent with the previous findings of Lostanlen
et al. (2020).
CREPE Strongly sensitive to pitch (f1 ) and quasi-invariant to spectral envelope (α and r).
This is to be expected since CREPE was trained for fundamental frequency estimation.
wav2vec2 A 3-D manifold but whose coordinates do not align with the original degrees of
variability of the data.
GURA Similarly to BYOL, a 2-D manifold in which f1 is the dominant factor of variability.
YAMNet Arguably the best representation in the GEAR benchmark: YAMNet produces
a dense 3-D manifold in which the underlying parameters of audio synthesis (f1 , α,
and r) appear as smoothly changing over perpendicular directions.
1 X
θ̃ i = θj . (2)
K
θj ∈NK (θi )
58
From HEAR to GEAR:Generative Evaluation of Audio Representations
Figure 4: Unsupervised benchmark results of our synthetic dataset after feature map by different
embedding models. subplot a corresponds to the unsupervised results of parameter f1 , subplot b
corresponds to α, and subplot c corresponds to r. Different colors represent the results of different
models. The x-axis is the score on a logarithmic scale while the y-axis lists model names. Scores
closer to 1 (vertical center of the plot) are judged to be better.
and repeat the same operation for every synthetic sample i. Intuitively, an ideal audio
representation should yield small absolute values for the log-ratio above; i.e., ratios θ̃ i /θ i
that are all close to one.
For the regression task, we lower the value of the number of neighbors: K = 40, compared
to K = 100 in Isomap. Figure 4 displays the resulting distribution of log-ratios for each
parameter and each representation in GEAR. Unfortunately, the regression benchmark
remains inconclusive: all embeddings fare similarly in terms of estimation error for α and
r. The only assured finding is that the naive logmelspec baseline and CREPE are highly
sensitive to f1 , which was to be expected. Hence, future work should not evaluate regression
error at the local scale of nearest neighbors; but rather, at the global scale of invariance and
disentanglement.
4.3. Perspectives
GEAR is an attempt at evaluating audio representations via generative models. In this way,
it takes a different approach than HEAR, which is based on real-world classification tasks.
Our paper has shown that GEAR is feasible in practice while incurring a moderate workload.
Specifically, this paper was achieved by two MSc students, working part time under the
supervision of one faculty member. In comparison, data collection and preparation for
HEAR required the work of 23 challenge organizers. Hence, we believe that the conceptual
simplicity of the GEAR methodology has the potential to expand the accessibility and
attractiveness of HEAR to newcomers, particularly at the undergrad and grad student levels.
However, we acknowledge that GEAR currently suffers from a lack of direct applicability
to research questions in machine listening. This leads to two avenues of methodological
clarification. First, to be meaningful for hearing scientists, parameters in the GEAR
synthesizer ought to be sampled according to a perceptually uniform progression. These
progressions are often nonlinear in terms of physical units: for example, the human perception
of pitch is nonlinear with respect to fundamental frequency in Hertz. In order to account for
these distinctions, it would be necessary to include some prior knowledge about auditory
perception into the design of the synthesizer. Such prior knowledge could take the form
of just-noticeable differences (JND) and could apply to low-level attributes such as pitch,
loudness, or roughness. It could also involve relative dissimilarity judgments such as those
59
Lostanlen Yan Yang
involved in the study of qualitative timbre. This former avenue of research connects with
ongoing work about the perceptual control of differentiable synthesizers.
A second avenue of research arises from the fact that GEAR is currently tied to Euclidean
nearest-neighbor searc in feature space. Meanwhile, the HEAR benchmark is not based
on nearest-neighbor classification but on some form of representation learning: namely, a
shallow neural network. Even though this shallow neural network is continuous with respect
to its input, it stretches distances nonuniformly and non-linearly, thus yielding a higher-level
metric space in which comparisons are no longer Euclidean in terms of embeddings. In order
to align better with the formulation of HEAR, GEAR should not be restricted to the raw
feature space but should also be performed at deeper levels of representation. The shallow
neural network could be trained in a supervised way, by performing parameter regression; or
in an unsupervised way, e.g., via self-supervised contrastive learning. Following this protocol
would incur a stage of neural architecture search, It would certainly be heavier in terms of
workload and computation than nearest-neighbor search in feature space; but also more
informative and more consistent with real-world audio classification in HEAR.
5. Conclusion
The HEAR benchmark provides a common API for sharing and improving general-purpose
machine listening models. In this paper, we have taken this opportunity to download HEAR
submissions en masse and run them as feature extractors to a synthetic dataset of pitched
sounds. In doing so, we have visualized deep audio embeddings as points in a 3-D space, with
colors denoting the parameters underlying synthesis. Our contribution, named GEAR, serves
as a qualitative counterpoint to HEAR: although it does not fulfil any real-world “task,”
it sheds light on the respective abilities of audio representations to disentangle auditory
attributes, without depending on a choice of supervised learning architecture downstream.
The companion website of our paper (see footnote of first page) contains all the necessary
source code to reproduce our findings, as well as to replicate them on future editions of
HEAR.
One limitation of GEAR in its current formulation is that it largely relies on visual
inspection. Our parameter regression benchmarks from Section 4 provide some quantitative
evidence for local neighborhoods, but not for global disentanglement. As such, GEAR would
not easily scale to hundreds of audio representations, nor to dozens of degrees of freedom in
the synthetic data. Furthermore, we note that Isomap may occasionally produce spurious
graphical patterns which were not structural properties of the underlying manifold (Donoho
and Grimes, 2003), hence deceiving the human eye. In this context, it would be interesting to
perform topological data analysis (TDA) on the nearest-neighbor graphs so as to automate
the characterization of the feature space (Hensel et al., 2021).
Although we only have experimented with fundamental frequency and two low-level
attributes of spectral envelope (α and r, see Section 2), we stress that the GEAR methodology
is very generic and could easily be transferred to different synthetic datasets in the future,
insofar that the underlying synthesizer is parametric with continuous independent parameters.
For example, beyond the case of sustained harmonic tones, one might run GEAR on a
physical synthesis model for virtual drum shapes (Han and Lostanlen, 2020); on a neural
audio synthesizer with perceptually relevant control (Roche et al., 2021); or on a text-to-
60
From HEAR to GEAR:Generative Evaluation of Audio Representations
speech rendering engine with global style tokens for expressive conditioning of prosody
(Wang et al., 2018).
6. Acknowledgment
We thank Gaëtan Garcia, Mathieu Lagrange, Mira Rizkallah, Joseph Turian, and Cyrus
Vahidi for helpful discussions.
References
Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the
IEEE International Conference on Computer Vision, pages 609–617, 2017.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A
framework for self-supervised learning of speech representations. Advances in Neural
Information Processing Systems, 33:12449–12460, 2020.
Juan P Bello, Claudio Silva, Oded Nov, R Luke Dubois, Anish Arora, Justin Salamon,
Charles Mydlarz, and Harish Doraiswamy. SONYC: A system for monitoring, analyzing,
and mitigating urban noise pollution. Communications of the ACM, 62(2):68–77, 2019.
Jason Cramer, Ho-Hsiang Wu, Justin Salamon, and Juan Pablo Bello. Look, Listen, and
Learn More: Design Choices for Deep Audio Embeddings. In Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
3852–3856. IEEE, 2019.
Ken H Davis, R Biddulph, and Stephen Balashek. Automatic recognition of spoken digits.
The Journal of the Acoustical Society of America, 24(6):637–642, 1952.
Gauri Deshpande, Anton Batliner, and Björn W Schuller. Ai-based human audio processing
for covid-19: A comprehensive overview. Pattern recognition, 122:108289, 2022.
David L Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques
for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):
5591–5596, 2003.
Gasser Elbanna, Alice Biryukov, Neil Scheidwasser-Clow, Lara Orlandic, Pablo Mainar,
Mikolaj Kegler, Pierre Beckmann, and Milos Cernak. Hybrid handcrafted and learnable
audio representation for analysis of speech under cognitive and physical load. arXiv
preprint arXiv:2203.16637, 2022a.
Gasser Elbanna, Neil Scheidwasser-Clow, Mikolaj Kegler, Pierre Beckmann, Karl El Hajal,
and Milos Cernak. BYOL-S: Learning self-supervised speech representations by boot-
strapping. In Joseph Turian, Björn W. Schuller, Dorien Herremans, Katrin Kirchoff,
Paola Garcia Perera, and Philippe Esling, editors, HEAR: Holistic Evaluation of Audio
61
Lostanlen Yan Yang
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Chan-
ning Moore, Manoj Plakal, and Marvin Ritter. Audio Set: An Ontology and Human-
labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference
on Acoustics, Speech and Signal processing (ICASSP), pages 776–780. IEEE, 2017.
Yuan Gong, Yu-An Chung, and James Glass. PSLA: Improving audio tagging with pretrain-
ing, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 29:3292–3306, 2021.
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond,
Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad
Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised
learning. Advances in neural information processing systems, 33:21271–21284, 2020.
Han Han and Vincent Lostanlen. wav2shape: Hearing the shape of a drum machine. In
Proceedings of Forum Acusticum, pages 647–654, 2020.
Felix Hensel, Michael Moor, and Bastian Rieck. A survey of topological machine learning
methods. Frontiers in Artificial Intelligence, 4:681108, 2021.
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdi-
nov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning
by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 29:3451–3460, 2021.
Corey Kereliuk, Bob L Sturm, and Jan Larsen. Deep learning, audio adversaries, and music
content analysis. In 2015 IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics (WASPAA), pages 1–5. IEEE, 2015.
Jaehun Kim, Julián Urbano, Cynthia CS Liem, and Alan Hanjalic. Are nearby neighbors
relatives? Testing deep music embeddings. Frontiers in Applied Mathematics and Statistics,
5:53, 2019.
Jaehun Kim, Julián Urbano, Cynthia Liem, and Alan Hanjalic. One deep music representation
to rule them all? A comparative analysis of different representation learning strategies.
Neural Computing and Applications, 32(4):1067–1093, 2020.
Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. Crepe: A convolutional
representation for pitch estimation. In 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 161–165. IEEE, 2018.
62
From HEAR to GEAR:Generative Evaluation of Audio Representations
Yuma Koizumi, Shoichiro Saito, Hisashi Uematsu, Yuta Kawachi, and Noboru Harada.
Unsupervised detection of anomalous sound based on deep learning and the neyman–
pearson lemma. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27
(1):212–224, 2018.
Vincent Lostanlen, Alice Cohen-Hadria, and Juan Pablo Bello. One or Two Frequencies?
The Scattering Transform Answers. In Proceedings of the European Signal Processing
Conference, 2020.
Brian McFee. Statistical methods for scene and event classification. In Computational
Analysis of Sound Scenes and Events, pages 103–146. Springer, 2018.
Alessandro B Melchiorre, Verena Haunschmid, Markus Schedl, and Gerhard Widmer. Lemons:
Listenable explanations for music recommender systems. In European Conference on
Information Retrieval, pages 531–536. Springer, 2021.
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino.
Byol for audio: Self-supervised learning for general-purpose audio representation. In 2021
International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
Geoffroy Peeters and Gaël Richard. Deep learning for audio and music. In Multi-Faceted
Deep Learning, pages 231–266. Springer, 2021.
Fanny Roche, Thomas Hueber, Maëva Garnier, Samuel Limier, and Laurent Girin. Make that
sound more metallic: Towards a perceptually relevant control of the timbre of synthesizer
sounds using a variational autoencoder. Transactions of the International Society for
Music Information Retrieval (TISMIR), 4:52–66, 2021.
Robert Rowe. Interactive music systems: machine listening and composing. MIT press,
1992.
Dan Stowell. Computational bioacoustics with deep learning: A review and roadmap. PeerJ,
10:e13152, 2022.
Marco Tagliasacchi, Beat Gfeller, Félix de Chaumont Quitry, and Dominik Roblek. Pre-
training audio representations with self-supervision. IEEE Signal Processing Letters, 27:
600–604, 2020.
Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework
for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
63
Lostanlen Yan Yang
Joseph Turian and Max Henry. I’m Sorry For Your Loss: Spectrally-Based Audio Distances
Are Bad at Pitch. In Proceedings of the NeurIPS “I Can’t Believe It’s Not Better”
Workshop, 2020.
Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J.
Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry,
Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca,
Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu
Jin, and Yonatan Bisk. HEAR: Holistic Evaluation of Audio Representations. In Douwe
Kiela, Marco Ciccone, and Barbara Caputo, editors, Proceedings of the NeurIPS 2021
Competitions and Demonstrations Track, volume 176 of Proceedings of Machine Learning
Research, pages 125–145. PMLR, 06–14 Dec 2022. URL https://proceedings.mlr.press/
v176/turian22a.html.
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying
Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style Tokens: Unsupervised Style Modeling,
Control and Transfer in End-to-end Speech Synthesis. In Proceedings of the International
Conference on Machine Learning, pages 5180–5189. PMLR, 2018.
Neil Zeghidour. Learning representations of speech from the raw waveform. PhD thesis, PSL
Research University, 2019.
64