0% found this document useful (0 votes)
64 views6 pages

Timbre Id

This document describes a timbre analysis and classification toolkit called timbreID for the Pure Data programming environment. The toolkit includes objects for extracting various timbral features from audio as well as a classification object. The feature extraction objects analyze audio independently and output results upon receiving a trigger, making analysis straightforward. Features can be combined flexibly to characterize timbre. The classification object stores feature data and can cluster, order, and classify new sounds based on the stored training data. Examples of applications include vowel classification, concatenative synthesis, ordering sounds by timbre, and mapping sounds in a timbre space.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views6 pages

Timbre Id

This document describes a timbre analysis and classification toolkit called timbreID for the Pure Data programming environment. The toolkit includes objects for extracting various timbral features from audio as well as a classification object. The feature extraction objects analyze audio independently and output results upon receiving a trigger, making analysis straightforward. Features can be combined flexibly to characterize timbre. The classification object stores feature data and can cluster, order, and classify new sounds based on the stored training data. Examples of applications include vowel classification, concatenative synthesis, ordering sounds by timbre, and mapping sounds in a timbre space.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A TIMBRE ANALYSIS AND CLASSIFICATION TOOLKIT FOR PURE DATA

William Brent
University of California, San Diego
Center for Research in Computing and the Arts

ABSTRACT eliminates the need for sub-patches in Pd to set window size


using the block∼ object. Second, Hann windowing is au-
This paper describes example applications of a timbre anal- tomatically applied within each object so that input signals
ysis and classification toolkit for pure data (Pd). The tim- do not need to be multiplied against a window table using
breID collection of Pd externals enable both systematic and the tabreceive∼ object. Third, analysis timing is sample-
casual exploration of sound characteristics via objects that accurate. Each object outputs analysis results upon receiv-
are streamlined and easy to use. Details surrounding signal ing a bang, capturing the desired slice of audio regardless
buffering, blocking, and windowing are performed indepen- of Pd’s default 64-sample block boundaries. Thus, there is
dently by the objects, so that analyses can be obtained with no need to set overlap values with block∼ in order to de-
very little patching. A modular design allows for adaptable fine a particular time resolution. Fourth, because the objects
configurations and many possible creative ends. The appli- perform analysis on a per-request basis, the only computa-
cations described here include vowel classification, target- tional overhead incurred during periods of analysis inactiv-
driven concatenative synthesis, ordering sounds by timbre, ity is that of buffering. Combined, these four qualities make
and mapping of a sound set in timbre space. signal analysis in Pd straightforward and accessible.

1. INTRODUCTION 2.1. Available Features

Several projects have been developed for the purpose of or- The following external objects for measuring basic features
ganizing sounds and/or querying an audio corpus based on are provided with timbreID: magSpec∼, specBrightness∼,
timbral similarity. CataRT and Soundspotter are among the specCentroid∼, specFlatness∼, specFlux∼, specIrregularity∼,
most widely recognized open source options [7][3]. The specKurtosis∼, specRolloff∼, specSkewness∼, specSpread∼,
former is available as a Max/MSP implementation, while and zeroCrossing∼. The more processed features in the set
the latter is intended for multiple platforms—including Pd. (generated by barkSpec∼, cepstrum∼, mfcc∼, and bfcc∼)
Soundspotter’s Pd realization is primarily designed for real are generally the most powerful for classification. Math-
time target-driven concatenative synthesis. More general ematical definitions for many of these measurements are
tools for creative work centered on timbre similarity are lim- given in a previous paper, along with an evaluation of their
ited in Pd. effectiveness [1]. Detailed information on sound descrip-
timbreID is a Pd external collection developed by the tors in general is available elsewhere [8][9]. Although an
author. It is composed of a group of objects for extract- understanding of the various analysis techniques is not re-
ing timbral features, and a classification object that man- quired for use, a general idea of what to expect can be very
ages the resulting database of information. The objects are helpful. To that effect, a simple demonstration and straight-
designed to be easy to use and adaptable for a number of forward explanation of each feature is given in its accompa-
purposes, including real-time timbre identification, ordering nying help file.
of sounds by timbre, target-driven concatenative synthesis, In order to facilitate as many types of usage as possible,
and plotting of sounds in a user-defined timbre space that non real-time versions of all feature externals are provided
can be auditioned interactively. This paper will summarize for analyzing samples directly from graphical arrays in Pd.
the most relevant features of the toolkit and describe its use
in the four applications listed above.
2.2. Open-ended analysis strategies

2. FEATURE EXTRACTION OBJECTS Independent, modular analysis objects allow for flexible anal-
ysis strategies. Each of the objects reports its results as ei-
In general, timbreID’s feature extraction objects have four ther a single number or a list that can be further manipulated
important qualities. First, each object maintains its own in Pd. Feature lists of any size can be packed together so
signal buffer based on a user-specified window size. This that users can design a custom approach that best suits their
particular sound set. Figure 1 demonstrates how to gener-
ate a feature list composed of MFCCs, spectral centroid,
and spectral brightness. Subsets of mel-frequency cepstral
coefficients (MFCCs) are frequently used for economically
representing spectral envelope, while spectral centroid and
brightness provide information about the distribution of spec-
tral energy in a signal. Each time the button in the upper
right region of the patch is clicked, a multi-feature analysis
snapshot composed of these features will be produced.

Figure 1. Generating a mixed feature list.

Capturing the temporal evolution of audio features re-


quires some additional logic. In Figure 2, a single feature list
is generated based on 5 successive analysis frames, spaced Figure 2. Generating a time-evolving feature list.
50 milliseconds apart. The attack of a sound is reported
by bonk∼ [6], turning on a metro that fires once every 50 less knowledge about analysis techniques, and strip away
ms before turning off after almost a quarter second. Via layers of patching associated with blocking and windowing.
list prepend, the initial moments of the sound’s temporally-
In order to have maximum control over algorithm de-
evolving MFCCs are accumulated to form a single list. By
tails, all feature extraction and classification functions were
the time the fifth mel-frequency cepstrum measurement is
written by the author, and timbreID has no non-standard li-
added, the complete feature list is allowed to pass through a
brary dependencies.
spigot for routing to timbreID, the classification object de-
scribed below in section 3. Recording changes in MFCCs
(or any combination of features) over time provides detailed 3. THE CLASSIFICATION OBJECT
information for the comparison of complex sounds.
These patches illustrate some key differences from the Features generated with the objects described in section 2
Pd implementation of libXtract, a well developed multi-platf- can be used directly as control information in real-time per-
orm feature extraction library described in [2]. Extracting formance. In order to extend functionality, however, a multi-
features in Pd using the libXtract∼ wrapper requires sub- purpose classification external is provided as well. This ob-
patch blocking, Hann windowing, and an understanding of ject, timbreID, functions as a storage and routing mecha-
the libXtract’s order of operations. For instance, to gener- nism that can cluster and order the features it stores in mem-
ate MFCCs, it is necessary to generate magnitude spectrum ory, and classify new features relative to its database. Apart
with a separate object, then chain its output to a separate from the examples package described in the following sec-
MFCC object. The advantage of libXtract’s cascading ar- tion, an in-depth help patch accompanies timbreID, demon-
chitecture is that the spectrum calculation occurs only once, strating how to provide it with training features and classify
yet two or more features can be generated from the results. new sounds based on training. Figure 3 depicts the most
While timbreID objects are wasteful in this sense (each basic network required for this task.
object redundantly calculates its own spectrum), they are Training features go to the first inlet, and features in-
more efficient with respect to downtime. As mentioned above, tended for classification go to the second inlet. Suppose the
features are not generated constantly, only when needed. patch in Figure 3 is to be used for percussive instrument
Further, from a user’s perspective, timbreID objects require classification. In order to train the system, each instrument
specified. Four different similarity metrics are available:
Euclidean, Manhattan (taxicab), Correlation, and Cosine Sim-
ilarity. For feature databases composed of mixed features,
feature attribute normalization can be activated so that fea-
tures with large ranges do not inappropriately weight the
distance calculation. Specific weights can be dynamically
assigned to any attribute in the feature list in order to ex-
plore the effects of specific proportions of features during
timbre classification or sound set ordering. Alternatively,
the feature attributes used in nearest match calculations can
be restricted to a specific range or subset. Or, the attribute
columns of the feature database can be ordered by variance,
so that match calculations will be based on the attributes
with the highest variance.
Further aspects of timbreID’s functionality are best il-
lustrated in context. The following section describes four of
the example patches that accompany the timbreID package.
Figure 3. timbreID in a training configuration.
4. APPLICATIONS

should be struck a few times at different dynamic levels. For 4.1. Vowel recognition
each strike, an onset detector like bonk∼ will send a bang
Identification of vowels articulated by a vocalist is a task
message to bfcc∼—the bark-frequency cepstral analysis ob-
best accomplished using the cepstrum∼ object. Under the
ject. Once a training database has been accumlated in this
right circumstances, cepstral analysis can achieve a rough
manner, bfcc∼’s output can be routed to timbreID’s second
deconvolution of two convolved signals. In the case of a
inlet, so that any new instrument onsets will generate a near-
sung voiced vowel, glottal impulses at a certain frequency
est match report from the first outlet. A match result is given
are convolved with a filter corresponding to the shape of the
as the index of the nearest matching instance as assigned
vocalist’s oral cavity. Depending on fundamental frequency,
during training. For each match, the second outlet reports
the cepstrum of such a signal will produce two distinctly
the distance between the input feature and its nearest match,
identifiable regions: a compact representation of the filter
and the third outlet produces a confidence measure based on
component at the low end, and higher up, a peak associated
the ratio of the first and second best match distances.
with the pitch of the note being sung. The filter region of the
For many sound sets, timbreID’s clustering function will
cepstrum should hold its shape reasonably steady in spite
automatically group features by instrument. A desired num-
of pitch changes, making it possible to identify vowels no
ber of clusters corresponding to the number of instruments
matter which pitch the vocalist happens to be singing. As
must be given with the “cluster” message, and an agglomer-
pitch moves higher, the cepstral peak actually moves lower,
ative hierarchical clustering algorithm will group instances
as the so-called “quefrency” axis corresponds to period—
according to current similarity metric settings. Afterward,
the inverse of frequency. If the pitch is very high, it will
timbreID will report the associated cluster index of the near-
overlap with the region representing the filter component,
est match in response to classification requests.
and destroy the potential for recognizing vowels regardless
Once training is complete, the resulting feature database
of pitch2 .
can be saved to a file for future use. There are four file
Having acknowledged these limitations, a useful pitch-
formats available: timbreID’s binary .timid format, a text
independent vowel recognition system can nevertheless be
format for users who wish to inspect the database, ARFF
arranged using timbreID objects very easily. Figure 4 shows
format for use in WEKA1 , and .mat format for use in either
a simplified excerpt of an example patch where cepstral co-
MATLAB or GNU octave.
efficients 2 through 40 are sent to timbreID’s training in-
let every time the red snapshot button is clicked. Although
3.1. timbreID settings identical results could be achieved without splitting off a
Nearest match searches are performed with a k-nearest neigh- specific portion of the cepstrum3 , pre-processing the feature
bor strategy, where K can be chosen by the user. Several 2 These qualities of cepstral analysis can be observed by sending

other settings related to the matching process can also be cepstrum∼’s output list to an array and graphing the analysis continuously
in real-time.
1 WEKA is a popular open source machine learning package described 3 The alternative would be to pass the entire cepstrum, but set timbreID’s

in [4] active attribute range to use only the 2nd through 40th coefficients in simi-
from cepstrum∼ to timbreID, a nearest match is identified
and its associated cluster index is sent out timbreID’s first
outlet. The example patch animates vowel classifications as
they occur.

4.2. Target-based Concatenative Synthesis

Some new challenges arise in the case of comparing a con-


stant stream of input features against a large database in
real-time. The feature database in the vowel recognition
example only requires about 20 instances. To obtain inter-
esting results from target-based concatenative synthesis, the
database must be much larger, with thousands rather than
dozens of instances. This type of synthesis can be achieved
using the systems mentioned in section 1, and is practiced
live by the artist sCrAmBlEd?HaCkZ! using his own soft-
ware design [5]. The technique is to analyze short, overlap-
ping frames of an input signal, find the most similar sound-
Figure 4. Sending training snapshots and continuous over- ing audio frame in a pre-analyzed corpus of unrelated audio,
lapping cepstral analyses to timbreID. and output a stream of the best-matching frames at the same
rate and overlap as the input.
with two instances of Pd’s list splitting object keeps tim- The example included with timbreID provides an audio
breID’s feature database more compact. The choice of cep- corpus consisting of 5 minutes of bowed string instrument
stral coefficient range 2 through 40 is somewhat arbitrary, samples. As an audio signal comes in, an attempt at recon-
but it is very easy to experiment with different ranges by structing the signal using grains from the bowed string cor-
changing the arguments of the two list split objects. pus is output in real time. Audio examples demonstrating
In order to train the system on 3 vowels, about 5 snap- the results can be accessed at www.williambrent.com.
shots must be captured during training examples of each In these types of applications, timbreID’s third inlet can
sung vowel. In order to distinguish background noise, 5 be used in order to search large feature databases. Classifi-
additional snapshots should be taken while the vocalist is cation requests sent to the third inlet are restricted by a few
silent. Next, the “cluster” message is sent with an argument additional parameters. For instance, the search for a near-
of 4, which automatically groups similar analyses so that the est match can be carried out on a specified subset of the
first vowel is represented by cluster 0, the second vowel by database by setting the “search center” and “neighborhood”
cluster 1, and so on. The cluster associated with background parameters.
noise will end up as cluster 3. It is not necessary to ensure The concatenative synthesis example provides options
that each vowel receives the same number of analyses. If for different grain sizes and analysis rates, but with default
there were 7 training examples for the first vowel and only settings, the process of computing a BFCC feature for the
5 for the others, the clustering algorithm should still group input signal, comparing it with 2500 instances in the feature
the analyses correctly. Clustering results can be verified by database, and playing back the best-matching grain occurs
sending the “cluster list” message, which sends a list of any at a rate of 43 times per second. Using a 2.91 GHz Intel Core
particular cluster’s members out of timbreID’s fourth outlet. 2 Duo machine running Fedora 11 with 4 GB of RAM, the
To switch from training to classification, cepstrum∼’s processor load is about 17%. By lowering the neighborhood
pre-processed output must be connected to timbreID’s sec- setting, this load can be lowered. However, reducing pro-
ond inlet. The actual example patch contains a few routing cessor load is not the only reason that restricted searches are
objects to avoid this type of manual re-patching, but they are useful. A performer may also wish to control which region
omitted here for clarity. Activating the metro in Figure 4 en- of the audio corpus from which to synthesize.
ables continuous overlapping analysis. If finer time resolu-
A third parameter, “reorient” causes search center to be
tion is desired for even faster response, the metro’s rate can
continually updated to the current best match during active
be set to a shorter duration. Here, the rate is set to half the
synthesis. With matches occuring 43 times per second, the
duration of the analysis window size in milliseconds, which
search range adapts very quickly to changes in the input
corresponds to an overlap of 2. As each analysis is passed
signal, finding an optimal region of sequential grains from
larity calculations. which to draw.
4.3. Timbre ordering strument 0 with a drum and progresses through other drum
strikes followed by snares, a sequence of cymbal strikes, and
The timbre ordering examples use two different approaches a sequence of wooden instruments. Ordering the set by start-
to sound segmentation: the first reads in pre-determined ing with a wooden instrument will produce a different result
onset/offset times for each of 51 percussion instrument at- that retains similarly grouped sequences. An expanded ver-
tacks, and the second automatically divides loaded samples sion of this patch could be useful as a compositional aid for
into grains that are 4096 samples in length by default. On- exploring relationships between sounds in a much larger set,
set/offset labels for the first example were generated man- offering paths through the sounds that are smooth with re-
ually in Audacity, exported to a text file, then imported to spect to different sonic characteristics.
a table in Pd. The percussive sound set included with this Two types of ordering are available: “raw” and “rela-
example is small, and is intended to provide a clear demon- tive”. The graph in Figure 5 was produced with relative or-
stration of timbreID’s ordering capabilities. Figure 5 shows dering, which starts with the user-specified instrument, finds
a region of the patch that includes the table where order- the nearest match in the set, then finds the nearest match
ing information is stored and 5 sliders that control feature to that match (without replacement), and so on. The point
weighting. of reference is always shifting. Raw ordering begins with
the given instrument, then finds the closest match, the sec-
ond closest match, the third closest match (also without re-
placement), and so on. Orderings of this type start with a
sequence of very similar sounds that slowly degrades into
randomness, and usually finish with a sequence of similar
sounds—those that are all roughly equal in distance from
the initial sound, and hence, roughly similar to each other.
The second ordering example loads and segments arbi-
trary sound files. Loading a speech sample generates se-
quences of similar phonemes with a surprisingly continuous
pitch contour. Audio generated from this and other ordering
examples can be accessed at the author’s website.

4.4. Mapping sounds in timbre space

Figure 5. 51 percussion sounds ordered based on a user-


specified weighting of 5 features.

Ordering is always performed relative to a user-specified


starting point. With 51 instruments, when an instrument in-
dex between 0 and 50 is supplied along with the “order”
message, timbreID will output the ordering list at its fourth
outlet for graphing. Using the 5 feature weight sliders, it is
possible to boost or cut back the influence of any particular
feature in the ordering process. The features implemented in
this patch are temporally evolving spectral centroid, spectral
flatness, zero crossing rate, loudness, and BFCCs.
After hearing the results of a particular ordering, the lev- Figure 6. 847 speech grains mapped with respect to the 2nd
els of the feature weight sliders can be changed in order to and 3rd BFCC.
produce a new ordering and gain an understanding of the
effects of various features in the process. An ordering is Another way to understand how the components of a
shown in the graph of Figure 5, where the y axis represents sound set relate to one another is to plot them in a user-
instrument indices 0 through 50, and the x axis indicates defined timbre space. CataRT is the most recognized and
each instrument’s position in the ordering. It begins at in- well developed system for this task; timbreID makes it pos-
sible within Pd using GEM for two- and three-dimensional source code, binaries, and the example patches described
plotting. In the provided example, the axes of the space above are all available for download at the author’s web-
can be assigned to a number of different spectral features, site: www.williambrent.com. The remaining patches in the
zero crossing rate, amplitude, frequency, or any of 47 Bark- example package—a cepstrogram plotting interface and a
frequency cepstral coefficients. By editing the analysis sub- percussion classification system that identifies instruments
patch, additional features can be included. Figure 6 shows immediately upon attack—were not described. The exam-
the speech grains described in the previous section plotted ple patches are simple in some respects and are intended to
in a space where values of the second and third BFCCs are be starting points that can be expanded upon by the user.
mapped to the x and y axes respectively. RGB color can be Future development will be focused on adding new fea-
mapped to any available features as well. tures to the set of feature extraction objects, implementing
Mousing over a point in the space plays back its ap- a kD-tree for fast searching of large databases in order to
propriate grain, enabling exploration aimed at identifying make concatenative synthesis more efficient, and developing
regions of timbral similarity. The upper left region of fig- strategies for processing multiple-frame features of different
ure 6 contains a grouping of “sh” sounds, while the central lengths in order to compare sounds of various durations.
lower region contains a cluster of “k” and “ch” grains. Other
phonemes can be located as well. In order to explore dense 6. REFERENCES
regions of the plot, keyboard navigation can be enabled to
zoom with respect to either axis (or both simultaneously), [1] W. Brent, “Cepstral analysis tools for percussive timbre
and move up, down, left, or right in the space. identification,” in Proceedings of the 3rd International
Pure Data Convention, São Paulo, Brazil, 2009.
[2] J. Bullock, “Libxtract: A lightweight library for audio
feature extraction,” in Proceedings of the International
Computer Music Conference, 2007.
[3] M. Casey and M. Grierson, “Soundspotter/remix-tv:
fast approximate matching for audio and video perfor-
mance,” in Proceedings of the International Computer
Music Conference, Copenhagen, Denmark, 2007.
[4] G. Holmes, A. Donkin, and I. Witten, “Weka: a ma-
chine learning workbench,” in Proceedings of the sec-
ond Australia and New Zealand Conference on Intel-
ligent Information Systems, Brisbane, Australia, 1994,
pp. 357–361.

Figure 7. 2400 string grains mapped with respect to ampli- [5] S. König, http://www.popmodernism.org/scrambledhackz.
tude and fundamental frequency. [6] M. Puckette, T. Apel, and D. Zicarelli, “Real-time audio
analysis tools for pd and msp,” in Proceedings of the
Figure 7 shows a plot of string sample grains mapped ac-
International Computer Music Conference, 1998, pp.
cording to RMS amplitude and fundamental frequency. Be-
109–112.
cause the frequencies in this particular sound file fall into
discrete pitch classes, its grains are visibly stratified along [7] D. Schwarz, G. Beller, B. Verbrugghe, and S. Britton,
the vertical dimension. “Real-time corpus-based concatenative synthesis with
Mapping is achieved by recovering features from tim- catart,” in Proceedings of the COST-G6 Gonference on
breID’s database with the “feature list” message, which is Digital Audio Effects (DAFx), Montreal, Canada, 2006,
sent with a database index indicating which instance to re- pp. 279–282.
port. The feature list for the specified instance is then sent
out of timbreID’s fifth outlet, and used to determine the in- [8] G. Tzanetakis and P. Cook, “Musical genre classifica-
stance’s position in feature space. tion of audio signals,” IEEE Transactions on Speech and
Audio Processing, vol. 10, no. 5, pp. 293–302, 2002.
5. CONCLUSION [9] X. Zhang and Z. Ras, “Analysis of sound features for
music timbre recognition,” in Proceedings of the IEEE
This paper has introduced some important features of the CS International Conference on Multimedia and Ubiq-
timbreID analysis/classification toolkit for Pd, and demon- uitous Engineering, 2007, pp. 3–8.
strated its adaptability to four unique tasks. Pd external

You might also like