A Computationally Efficient Speech/music Discriminator For Radio Recordings

A computationally efficient speech/music discriminator for radio recordings
Aggelos Pikrakis, Theodoros Giannakopoulos and Sergios Theodoridis

University of Athens
Department of Informatics and Telecommunications
Panepistimioupolis, 15784, Athens, Greece
{pikrakis, tyiannak, stheodor}@di.uoa.gr
Abstract was used in [7]. The authors in [8] have used Gaussian Mix-
This paper presents a speech/music discriminator for radio ture Modeling on a single feature, called Warped LPC-based
recordings, based on a new and computationally efficient re- spectral centroid, for the classification of pre-segmented au-
gion growing technique, that bears its origins in the field of dio data to speech and music.
image segmentation. The proposed scheme operates on a In this paper, a different philosophy is adopted, that bears
single feature, a variant of the spectral entropy, which is ex- its origins in the field of image segmentation. The main idea
tracted from the audio recording by means of a short-term is that, if speech/music discrimination is treated as a seg-
processing technique. The proposed method has been tested mentation problem (where each segment is labeled as either
on recordings from radio stations broadcasting over the In- speech or music), then each of the segments can be the result
ternet and, despite its simplicity, has proved to yield perfor- of a segment (region) growing technique, where one starts
mance results comparable to more sophisticated approaches. from small regions (segments) and keeps expanding them
as long as certain criteria are fulfilled. This approach has
Keywords: Speech/music discrimination, spectral-entropy,
been used in the past in the context of image segmentation,
region growing techniques
where a number of pixels are usually selected as candidates
(seeds) for region growing. In image segmentation, regions
1. Introduction grow by attaching neighboring pixels, provided that certain
The problem of speech/music discrimination is important criteria are fulfilled. These criteria usually examine the re-
in a number of audio content characterization applications. lationship between statistics drawn from the region and the
Since the first attempts in the mid 90’s, a number of algo- pixel values to be attached.
rithms have been implemented in various application fields. Following this philosophy, a feature sequence is first ex-
The majority of the proposed methods deal with the problem tracted from the audio recording, by means of a short-term
in two separate steps: firstly, the audio signal is split into processing technique. To this end, a variant of the spectral
segments by detecting abrupt changes in the signal statistics entropy is extracted per short-term frame. Once the feature
and at a second step the extracted segments are classified as sequence is generated, a number of frames are selected as
speech or music by using standard classification schemes. candidates for region expansion. Starting from these seeds,
One of the first methods focused on the real-time, auto- segments grow and keep expanding as long as the standard
matic monitoring of radio channels, using energy and zero- deviation of the feature values in each region remains below
crossing rate (ZCR) as features [1]. In [2], thirteen audio a pre-defined threshold. In the end, adjacent segments are
features were used to train different types of multidimen- merged and short segments are eliminated. All segments
sional classifiers, including a Gaussian MAP estimator and that have survived are labeled as music, whereas the rest
a nearest neighbor classifier. In [3], energy, ZCR and fun- of the feature sequence is tagged as speech. The novelty
damental frequency were used as features and segmenta- of our approach lies in the fact that ideas from the field of
tion/classification was achieved by means of a procedure image segmentation are applied in the context of off-line
based on heuristic rules. A similar approach was proposed speech/music discrimination, yielding a computationally ef-
in [4]. Frameworks based on combinations of standard Hid- ficient algorithm that achieves high discrimination accuracy.
den Markov Models, Multilayer Perceptrons and Bayesian The paper is organized as follows: the next Section fo-
Networks were used in [5] and [6]. An Adaboost-based al- cuses on feature extraction, Section 3 presents the region
gorithm, applied on the spectrogram of the audio samples, growing technique, the performance of the proposed algo-
rithm is discussed in Section 4 and finally conclusions are
drawn in Section 5.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
2. Feature extraction
copies bear this notice and the full citation on the first page. At a first step, the audio recording is broken into a sequence
c 2006 University of Victoria
of non-overlapping short-term frames (46.4ms long). From
each frame, a variant of the spectral entropy [9] is extracted 5.5
as follows, by taking into account the frequency range up to
5
approximately 2KHz (by definition, entropy is a measure of
4.5
the uncertainty or disorder in a given distribution [10]):
4
• All computations are carried on a mel-scale, i.e., the
Chromatic Entropy
3.5
frequency axis is warped according to the equation
3
f = 1127.01048 ∗ log(fl /700 + 1) 2.5
2
where fl is the frequency value on a linear scale.
1.5
• The mel-scaled spectrum of the short-term frame is 1

divided into L sub-bands (bins). The center frequen- 0.5
0 5 10 15 20 25 30
cies of the sub-bands are chosen to coincide with the Time (secs)
frequencies of semitones of the chromatic scale, i.e.,
k Figure 1. Chromatic entropy for 26 seconds of a BBC radio
f0 ∗ 2 12 recording.
fk = 1127.01048∗log( +1), k = 0, . . . , L−1
700
where f0 is the center frequency of the lowest sub- this procedure, all segments that have survived correspond
band of interest (on a linear scale). to music, whereas all frames that do not belong to such seg-
ments are considered to be speech segments. The procedure
• The energy Xi of the i-th sub-band, i = 0, . . . , L − 1,
is described in detail as follows:
is then normalized by the total energy of all the sub-
Initialization step - Seed generation: If T is the length
bands, yielding ni = PL−1 Xi
, i = 0, . . . , L − 1.
i=0
Xi of the feature sequence, a “seed” is chosen every M frames,
The entropy of the normalized spectral energy is then M being a pre-defined constant. If K is the total number
computed by the equation: of seeds and ik is the frame index of the k-th seed, then the
frame indexes of the seeds form the set
X
L−1
H=− ni · log2 (ni ) (1) {i1 , i2 , . . . , iK }
i=0
The k-th seed is considered to form a region, Rk consisting
In the sequel we will also refer to this feature by the term of a single frame, i.e., Rk = {Oik } where Oik is the feature
“chromatic entropy”. At the end of the feature extraction value of the respective frame.
stage, the audio recording is thus represented by the feature Iteration: In this step, every region, Rk , is expanded by
sequence F, i.e., F = {O1 , O2 , . . . , OT }, where T is the examining the feature values of the two frames that are ad-
number of short-term frames. Figure 1 presents the feature jacent to the boundaries of Rk . To this end, let lk and rk
sequence that has been extracted from a BBC radio record- be the indexes that correspond to the leftmost and rightmost
ing, the first half of which corresponds to speech and the frames of Rk respectively. Clearly, if Rk consists of a single
second half corresponds to music. It can be observed that frame, then lk = rk = ik . Following this notation, lk − 1
the standard deviation of chromatic entropy is significantly and rk + 1 are the indexes of the two frames which are adja-
lower for the case of music. cent to the left and right boundary of Rk , respectively. Our
algorithm decides to expand Rk to include Olk −1 , if Olk −1
3. Segmentation Algorithm is not already part of any other region and if the standard
Once the feature sequence has been extracted, speech/music deviation of the feature values of this expanded region is be-
discrimination is achieved by means of a region growing low a pre-defined threshold Th , common to all regions. In
technique. The main idea behind this approach is that, at other words, if the standard deviation of feature values for
an initialization stage, a number of frames are selected as Olk −1 ∪ Rk is less than Th , then, at the end of this step Rk
“seeds”, i.e., as candidates, that will serve as the basis to will have grown one frame to left. Similarly, if Ork +1 is not
form regions (segments). Subsequently, by means of an it- already part of any other region and if the standard deviation
erative procedure, these regions will grow (expand) while a of the feature values in Rk ∪ Ork +1 is less than Th , then Rk
criterion related to the standard deviation of the chromatic will also grow by one frame to its right. At the end of this
entropy is fulfilled. The procedure is repeated until no more step, each Rk will have grown by at most two frames. It has
region growing takes place. At a final step, neighboring re- to be noted that, certain regions may not grow at all, because
gions are merged and after merging, regions that do not ex- both frames that are adjacent to their boundaries, already be-
ceed a pre-specified length are eliminated. At the end of long to other regions. At the end of the step, it is examined
whether at least one region has grown by at least one frame. haustive approach was adopted, i.e., each parameter was al-
If this is the case, this step is repeated until no more region lowed to vary within a predefined range of values. For sys-
growing takes place. tem testing, a different dataset, D2 , was created, consisting
Termination: After region growing has been completed, of audio recordings of a total duration of 160 minutes. A
adjacent regions (if any) are merged to form larger segments. 16KHz sampling rate was used in all cases. All record-
Finally, after merging is complete, short regions are elimi- ings (of both datasets) were manually segmented and la-
nated by comparing their length with a pre-defined thresh- beled as speech or music. It has to be noted that “silent” seg-
old, say Tmin . ments, i.e. segments with almost zero energy) were treated
Ideally, all segments (regions) that survive at the end of as speech, under the observation that such segments usually
the algorithm should correspond to music and any frame occur between speech segments. This manual segmentation
outside these regions should correspond to speech. This is procedure revealed that 69.77% of the data was music and
because the proposed scheme relies on the assumption that 30.23% was speech.
music segments exhibit low standard deviation in terms of 4.2. Segmentation results for maximizing the overall ac-
the adopted feature (see Figure 1). As will be explained curacy
in the next section, our approach, despite its simplicity, ex-
The parameter estimation process, subject to maximizing
hibits a high discrimination accuracy.
discrimination accuracy for dataset D1 , led to the values of
Finally it has to be noted that the proposed algorithm
Table 1.
is dependent on three parameters, namely M , the distance
(measured in frames) between successive seeds, Th , the thres-
Threshold Min. Duration Seed Dist
hold for standard deviation (based on which region growing
takes place) and Tmin , the minimum segment length (used 0.50 3.0 sec 2.0 sec
in the final stage of the algorithm). The choice of values for Table 1. Parameter values subject to maximizing discrimina-
these parameters is explained in the next section. tion accuracy over D1
4. Experiments Using the above parameter values, our method was then
We carried out two sets of experiments, each of which on a tested on D2 . The proposed scheme classified 75.10% of
separate dataset, in order to: the data in D2 as music and 24.90% as speech. Table 2
a) Determine the values of the parameters of the method, presents the average confusion matrix C, of the discrimina-
subject to two different maximization criteria, namely over- tion results. Each element Ci,j of the matrix corresponds to
all discrimination accuracy and music precision. The latter
refers to the proportion of audio data in the recording that Music Speech
corresponds to music and was also classified as music (see Music 69.13% 0.65%
also subsections 4.2 and 4.3). It has to be noted that by max- Speech 5.97% 24.25%
imizing music precision, overall discrimination accuracy is
Table 2. Average confusion matrix for D2 using the parameter
likely to decrease. Although this is undesirable if the pro-
values in Table 1
posed method is used as a standalone discriminator, it may
not be a restriction if it is used as a low-complexity pre- the percentage of data whose true class label was i and was
processing step for music detection. In this case, high music classified to class j. From C one can directly extract the
precision (close to 100%) ensures that all detected segments following measures for each class:
are correctly classified as music, and all remaining parts of
the audio recording can be subsequently fed to other, more 1. Recall (Ri ). Ri is the proportion of data with true
sophisticated discrimination schemes, for further process- class label i, that were correctly classified in that class.
ing. For example, the recall of music is calculated as R1 =
C1,1
b) Assess the algorithm’s performance, using the parameters C1,1 +C1,2 .
extracted from maximizing the desired criteria. 2. Precision (Pi ). Pi is the proportion of data classified
4.1. Datasets as class i, whose true class label is indeed i. There-
C1,1
fore, music precision is P1 = C1,1 +C .
The audio data used for the above purposes was collected 2,1
from seven different BBC Internet radio stations, covering According to the confusion matrix, the overall discrimi-
a wide range of music genres and speakers. Obviously, the nation accuracy of our system is equal to 93.38% (C1,1 +
dataset used for testing system performance was different C2,2 ). Table 3 presents recall and precision values for both
from the dataset used for parameter tuning. More specif- speech and music. From this table, it can be seen that, more
ically, 30 minutes of audio recordings (dataset D1 ), were than 99% of the “true” music data was detected, while the
used for estimating parameter values. To this end, an ex- ”false alarm” for the music class was below 8%.
Music Recall 99.07% Music Recall 82.73%
Speech Recall 80.25% Speech Recall 98.78%
Music Precision 92.05% Music Precision 99.36%
Speech Precision 97.39% Speech Precision 71.25%
Table 3. Recall and Precision of both classes for the parameter Table 6. Recall and Precision of both classes for the parameter
values in Table 1 values in Table 4
4.3. Segmentation results for maximizing music preci- all surviving segments correspond to music. Taking into ac-
sion count these results, it can be concluded that the proposed
algorithm is capable of working both:
As explained above, we have also estimated the parameter
values subject to maximizing music precision. The resulting 1. As a standalone speech/music discriminator of high
values are presented in Table 4. performance.
2. As a computationally efficient preprocessing stage for
Threshold Min. Duration Seed Dist music detection in audio streams (when the parame-
0.30 5.0 sec 2.0 sec ters are tuned to maximize music precision). In this
latter case, non-music segments can be further processed
Table 4. Parameter values subject to maximizing music preci-
by more complex discrimination schemes.
sion
References
The average confusion matrix in this case is presented in
Table 5. It can be seen that 58.09% of the data was classified [1] J. Saunders, “Real-time discrimination of broadcast
speech/music”, in Proc. of ICASSP 1996, Vol. 2, pp.
as music and 41.91% as speech and the overall accuracy of
993-996, Atlanta, USA, May 1996.
the system was 87.58%, almost 6% lower than the accuracy
[2] E. Scheirer and M. Slaney, “Construction and Evaluation of a
presented in section 4.2. However, as can be seen in Table 6, Robust Multifeature Speech/Music Discriminator”, in Proc.
music precision is now equal to 99.36%. This leads us to the ICASSP 1997, pp. 1331-1334, Munich, Germany.
conclusion, that, when the proposed algorithm is fed with [3] Tong Zhang and C.-C. Jay Kuo, “Audio Content Analysis for
this second set of parameter values, it can be used as a low- Online Audiovisual Data Segmentation and Classification”,
complexity preprocessing step in a more sophisticated audio in IEEE Transactions on Speech and Audio Processing, Vol.
characterization system (e.g. [6]), for the initial detection of 9, No. 4, pp. 441-457, May 2001.
a smaller proportion of the music segments (smaller music [4] C. Panagiotakis and G. Tziritas, “A Speech/Music Discrim-
recall), but with an almost zero “false alarm” (0.4%). inator Based on RMS and Zero-Crossings”, IEEE Transac-
tions on Multimedia, vol. 7(1), pp. 155-166, Feb. 2005.
[5] Jitendra Ajmera, Iain McCowan and Herve Bourlard,
Music Speech “Speech/music segmentation using entropy and dynamism
Music 57.72% 12.05% features in a HMM classification framework”, Speech Com-
Speech 0.37% 29.86% munication, vol. 40, pp. 351-363, 2003.
[6] Aggelos Pikrakis, Theodoros Giannakopoulos and Sergios
Table 5. Average confusion matrix for the results obtained with
Theodoridis, “Speech/Music Discrimination for radio broad-
the parameter values in Table 4
casts using a hybrid HMM-Bayesian Network architecture”,
in Proc. of the 14th European Signal Processing Conference
(EUSIPCO-06), September 4-8, 2006, Florence, Italy.
5. Conclusions [7] N. Casagrande, D. Eck, and B. Kigl., “Frame-level audio fea-
ture extraction using AdaBoost”, in Proc. of ISMIR 2005, pp.
This paper presented a computationally efficient, off-line 345-350, London, UK, 2005.
speech/music discriminator, based on a region growing tech- [8] J.E. Munoz-Exposito et al, “Speech/Music discrimination
nique operating on a single feature that we call chromatic using a single Warped LPC-based feature”, Proc. of ISMIR
entropy. The system was tested on recorded Internet ra- 2005, pp. 614-617, London, UK, 2005.
dio broadcasts (of almost 3 hours duration) and achieved an [9] Hemant Misra, Shajith Ikbal, Herve Bourlard, and Hynek
average discrimination accuracy of 93.38%. This is com- Hermansky, “Spectral entropy based feature for robust
parable to the performance obtained with other computa- ASR”, in Proc. of ICASSP 2004, Vol. 1, pp. 193-196, Mon-
tionally more complex methods. It is worth noticing that, treal, Canada, 2004.
if the method’s parameters are tuned to maximize music [10] A. Papoulis and S. Unnikrishna Pillai, “Probability, Random
Variables and Stohastic Processes, 4th edition”, McGraw-
precision, although the system’s overall accuracy drops to
Hill, NY, 2001.
87.58%, music precision almost reaches 100% (99.3%), i.e.,

A Computationally Efficient Speech/music Discriminator For Radio Recordings

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

A Computationally Efficient Speech/music Discriminator For Radio Recordings

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Computationally Efficient Speech/music Discriminator For Radio Recordings

Uploaded by

Copyright:

Available Formats

A computationally efficient speech/music discriminator for radio recordings

Aggelos Pikrakis, Theodoros Giannakopoulos and Sergios Theodoridis

f = 1127.01048 ∗ log(fl /700 + 1) 2.5

• The mel-scaled spectrum of the short-term frame is 1

You might also like