Fundamental Frequency Estimation Techniques For Mu
Fundamental Frequency Estimation Techniques For Mu
Fundamental Frequency Estimation Techniques For Mu
net/publication/242685014
CITATIONS READS
6 2,114
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Federico Flego on 20 May 2014.
Federico Flego
Advisor:
Maurizio Omologo
March 2006
to Clara
Acknowledgments
My deepest gratitude goes to Clara, who I met just before starting the PhD
and who has always supported me during the many difficult moments. I
have no words to thank her enough, and her unconditional love is the most
important lesson I have ever received.
I also thank my parents Gianfranco and Laura and my sister Francesca,
who have always encouraged me and supported me, and not just morally!
It makes me very proud to know that I can count on them, wherever I may
be and at any time.
I wish to thank all my friends who, despite neglecting them for some
time, never stopped their encouragement and enthusiasm. I send my deep-
est affection to Sabrina and Stefano, colleagues during the PhD and real
friends. Both of them had a baby during their studies, and it is also for
this reason that I admire them... to the point that I would like to emulate
them soon! An affectionate thought goes also to Leonardo, Emanuela, Italo
and Valentina, with whom I need just a few words and a glass of good wine
to let our complicity emerge.
A particular thanks also goes to Alfiero and Luca, who “squeezed” into
their office to make room for me when I moved to the ITC-irst. They
helped me a lot, Alfiero always available for a technical and less technical
chat, Luca with his witty and sharp jokes. Beside them, I would really like
to thank all the guys of the “gioviniitc”, who helped me in the difficult
moments with their contagious free-and-easy spirit.
I would also like to thank Arianna and Paolo for everything they have
done and continue doing respectively as representatives of the PhD stu-
dents of the DIT and president of the ADI of Trento.
Finally, thanks and a wish of serenity to all those people who apply them-
selves in what they are doing with seriousness and humbleness.
Abstract
In speech processing, the estimation of fundamental frequency (f 0 ) aims
to measure the frequency with which the vocal folds vibrate during voiced
speech. This task is generally performed exploiting signal processing tech-
niques applied to the speech signal previously acquired by an acoustic sen-
sor. f0 represents a high-level speech feature which is exploited by many
speech processing applications, such as speech recognition, speech coding and
speech synthesis, to improve their performance. After decades of research
and innovation, the performance of these pitch based speech applications
has improved to the point that they are now robust for most practical appli-
cations. However, phenomena as noise and reverberation, characteristic of
real-world acoustic scenarios, have still to be coped with. Currently, perfor-
mance of f0 estimation techniques, conceived to work on high-quality speech
signals, drops dramatically whenever such adverse acoustic conditions are
considered. To overcome these limitations, the proposed f 0 estimation al-
gorithm exploits the information redundancy provided by a Distributed Mi-
crophone Network (DMN), which consists of a generic set of microphones
localized in space without any specific geometry. The DMN outputs are
parallelly processed in the frequency domain, and each channel reliability is
evaluated to derive a common representation from which f 0 is finally ob-
tained. Compared to state of the art f0 estimation techniques, this approach
demonstrated to be particularly robust. To show this fact, experimental re-
sults were obtained from tests conducted on international speech databases,
acquired from real noisy and reverberant scenarios.
As a second example of f0 based application in distant-talking contexts, a
Blind Source Separation (BSS) system was addressed. To improve its sep-
aration performance a f0 post-processing scheme, based on adaptive comb
filters, was designed. Tests conducted on reverberant speech data confirmed
the advantages of the proposed solution.
Keywords
Fundamental frequency, pitch, noise, reverberation, blind source separation
Contents
1 Introduction 1
1.1 The Context . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Solution . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Innovative Aspects . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . 6
i
3.2.1 The Discrete Fourier Transform (DFT) . . . . . . . 73
3.2.2 The spectrogram . . . . . . . . . . . . . . . . . . . 78
3.3 Applications of f0 estimation techniques . . . . . . . . . . 80
3.3.1 Speech coding . . . . . . . . . . . . . . . . . . . . . 80
3.3.2 Signal processing hearing aids . . . . . . . . . . . . 81
3.3.3 Glottal-synchronous speech analysis . . . . . . . . . 82
3.3.4 Music transcription . . . . . . . . . . . . . . . . . . 82
3.3.5 Speaker recognition . . . . . . . . . . . . . . . . . . 83
3.3.6 Automatic Speech Recognition (ASR) . . . . . . . . 85
3.3.7 Blind Source Separation (BSS) . . . . . . . . . . . 89
3.3.8 Dereverberation . . . . . . . . . . . . . . . . . . . . 91
3.4 Noise and Reverberation . . . . . . . . . . . . . . . . . . . 92
3.4.1 Environmental noise . . . . . . . . . . . . . . . . . 94
3.4.2 Reverberation . . . . . . . . . . . . . . . . . . . . . 99
3.4.3 Modeling noise and reverberation . . . . . . . . . . 105
ii
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . 157
Bibliography 193
iii
List of Tables
v
List of Figures
vii
3.2 source-filter model: time domain . . . . . . . . . . . . . . . 69
3.3 source-filter model: frequency domain . . . . . . . . . . . . 71
3.4 F 1/F 2 chart of Italian vowels . . . . . . . . . . . . . . . . 72
3.5 Fourier coefficients computed on a voiced speech segment . 74
3.6 Spectrograms of speech signal from a female speaker . . . . 79
3.7 Simplified model of an HMM based ASR system . . . . . . 87
3.8 Example of a Blind Source Separation system . . . . . . . 90
3.9 WAUTOC of (white) noisy speech signal . . . . . . . . . . 96
3.10 CMNDF of (white) noisy speech signal . . . . . . . . . . . 97
3.11 Spectrogram of a noisy speech signal . . . . . . . . . . . . 99
3.12 Reverberant room impulse response . . . . . . . . . . . . . 102
3.13 WAUTOC of a reverberant speech signal . . . . . . . . . . 104
3.14 CMNDF of a reverberant speech signal . . . . . . . . . . . 105
viii
5.10 CHIL meeting room at the Karlsruhe University . . . . . . 156
5.11 Gross error rates (20%): CHIL database . . . . . . . . . . 160
5.12 Gross error rates (5%): CHIL database . . . . . . . . . . . 162
ix
Chapter 1
Introduction
The context of this thesis is speech fundamental frequency (or pitch) es-
timation based on a multi-microphone speech input. Pitch estimation be-
longs to the Speech Processing area which, in turn, comprises many re-
search disciplines such as electrical engineering (computer science, signal
processing and acoustics), psychology (psychoacoustics and cognition) and
linguistics (phonetics, phonology and syntax). The objective of pitch es-
timation is to measure the oscillation frequency (f0 ) of the vocal folds in
voiced speech. The estimated f0 represents a useful source of information
for many speech applications such as, among others, speech recognition,
speech coding and speech synthesis. The particular acoustic scenario ad-
dressed in this thesis, considers speech signals acquired by a set of far field
microphones, whose output results thus severely degraded by the environ-
mental noise and reverberation effects. Pitch estimation based on such
a microphone setup has not been largely addressed in the literature so
far. However, it is likely to become a reference scenario considering the
growing interest for pitch based speech applications designed to work in
distant-talking contexts.
1
1. Introduction 1.2. The Problem
During the speech production process the airflow produced by the lungs
passes through the larynx, the pharyngeal, the oral and nasal cavity, fi-
nally radiating through lips and nose. The overall shape of the vocal tract
actuates as a resonator which modulates the airflow to produce the desired
sounds. When voiced speech is generated, as for example during vowel
production, the vocal folds at the top of the larynx open and close in a
quasi-periodic fashion for the air pressure which accumulates below them.
The frequency of these oscillations is given the name of speech fundamental
frequency (f0 ) and is responsible for the perceived pitch of the produced
sound. For this reason the f0 estimation algorithms are also referred to as
Pitch Detection Algorithms (PDAs), although what is actually measured
is the vocal folds oscillating frequency, a physical measurement, not the
consequent subjective perception.
To detect and estimate f0 in voiced speech, modern approaches rely
on digital signal processing techniques applied to the signals provided by
one or more microphones, which are used to record the speaker. In these
signals f0 manifests itself as a periodic pattern in the time domain, or as
a series of peaks in the frequency domain. In the first case the period
with which the pattern repeats itself (T0 ) coincides with the inverse of f0
while, in the second case, f0 determines the position of the first peak as
well as the spacing between two adjacent peaks. Pitch estimation is thus
generally carried out in one, or the other domain, trying to detect signal
self-similarities or frequency peak positions, respectively.
The main difficulties encountered during the estimation procedure are
mainly related with the inherent variability of the human voice on one
side, and with the inevitable quality loss which occurs during speech signal
acquisition, on the other side.
2
1.2. The Problem 1. Introduction
3
1. Introduction 1.3. The Solution
4
1.4. Innovative Aspects 1. Introduction
5
1. Introduction 1.5. Structure of the Thesis
6
1.5. Structure of the Thesis 1. Introduction
contexts. The rest of the chapter is thus devoted to the analysis of the
noise and reverberation adverse effects on pitch estimation.
Chapter 4 shows the limitations of state of the art pitch extraction algo-
rithms when tested on noisy and reverberant speech signals. These draw-
backs are more evident as the acoustic sensors, employed for speech signal
acquisition, are kept far from the end-user of pitch based systems. To
overcome these limitations and to allow the talker to freely move in the
space, independently from microphone position, the concept of Distributed
Microphone Network (DMN) is introduced. This microphone setup is then
exploited to derive a new pitch exctraction algorithm based on the Multi-
microphone Periodicity Function (MPF). The performance of the proposed
algorithm are then measured employing real world speech data and com-
pared with those of other state of the art algorithms. Detailed results are
presented in Chapter 5, where two different acoustic scenarios are ad-
dressed.
7
1. Introduction 1.5. Structure of the Thesis
8
Chapter 2
This chapter describes the state of the art of speech fundamental frequency
(f0 ) estimation algorithms. The term pitch is also used to indicate the fun-
damental frequency in this context, although pitch derives from psychoa-
coustic, where it refers more properly to the subjective perception produced
by voiced speech. Both measures though, refer to the phenomenon occur-
ring when voiced sounds are uttered, that is, to the periodic oscillation of
the vocal folds. This oscillation is responsible for intonation and manifests
itself in the sound pressure waveform, as a periodic pattern.
The aim of pitch estimation algorithms is to detect and measure the
period length of the repeating pattern characteristic of voiced speech. This
measure is referred to as fundamental period (T0 ) and results to be the
inverse of the fundamental frequency, that is, T0 = 1/f0 .
In this chapter the description of several state of the art pitch estimation
algorithms is given along with examples of their working principle applied
to voiced speech segments. Their description is presented considering the
classification given in [43]. The first section describes those algorithms that
operate in the time domain while the second section gives an account of
those based on short-term analysis.
The expression “time domain” refers to algorithms which directly an-
9
2. State of the Art 2.1. Introduction
2.1 Introduction
The algorithms which deal with the problem of estimating the fundamental
frequency of a periodic or quasi-periodic signal are commonly referred to as
pitch detection algorithm or PDA. What these techniques actually provide
is the estimation of the fundamental frequency, reminding however the
psychologically link between f0 and pitch, which is defined by the American
National Standards Institute (ANSI) as the auditory attribute of sound
according to which sounds can be ordered on a scale from low to high.
Such definition is not world wide accepted though, and there is a lot of
debate mostly related with the fact that it not possible to build a one-to-
one relationship between pitch and frequency. Actually the term pitch is
given two different meaning depending on the context in which it is used.
10
2.1. Introduction 2. State of the Art
Speech signal
Preprocessor Extractor Postprocessor F0
Figure 2.1: Fundamental processing blocks of a PDA. Speech signal is first preprocessed
in order to reduce data complexity or to apply linear or non linear transformations. The
extractor block is responsible for estimating the signal fundamental frequency while the
post-processing block can perform error detection or smoothing on the previously estimated
f0 values.
11
2. State of the Art 2.2. Time domain Pitch Determination
tween time domain and short term analysis based pitch extraction algo-
rithms. The rule to label an algorithm as belonging to one or the other
specific domain, is to consider the domain of the input to the extractor
block. In case the input to the extractor is the signal itself, opportunely
conditioned by the preprocessor, and pitch estimation is carried on a pe-
riod basis, that is, the algorithm is capable of determining each individual
period length, the algorithm belongs to the time domain category. When
instead the preprocessor takes short-term intervals (frames) of the input
signal, each including several periods, and provides the extractor with an
alternate representation as, for example, the autocorrelation values (lag
domain), or spectral values (frequency domain), the algorithm is said to
belong to the short term analysis category. In this case each provided pe-
riod length estimate can be considered an “average” of several contiguous
period length values.
12
2.2. Time domain Pitch Determination 2. State of the Art
TIME−DOMAIN
PITCH DETERMINATION
Figure 2.2: Simplified diagram of time domain pitch determination algorithms [43].
The Time domain based PDAs, can be further grouped into those which
aim to extract the fundamental frequency, as the zero (or nonzero) thresh-
old crossing based ones; those which perform structural analysis, basing on
the periodic exponential decay characteristic of voicing sound; those per-
forming waveform structure simplification in order to extract a sequence of
extremes from which estimate the fundamental frequency and those com-
puting parallel processing by means of multichannel1 analysis.
The simplest time domain based PDAs are the Zero-crossing Analysis Basic
Extractor (ZXABE) and the Threshold Analysis Basic Extractor (TABE)
[43], dating back to the 60s and developed on analog systems. That period
also coincided with an advance of the computer in the domain of signal
processing and these simple PDAs, and their derivations, were suitable for
being implemented as computer programs. The basic principle of ZXABE,
as shown in the left panel of Figure 2.3, is to produce a marker each time
1
The term “multichannel” used here was inherited from [43]. It indicates that the data flow of a single
input speech signal is duplicated to be processed by more than one processing unit, in a parallel fashion.
It has not to be confused with the term “multi-microphone”, which is used in this thesis to indicate
several input speech signals, proceeding from different acoustic sensors.
13
2. State of the Art 2.2. Time domain Pitch Determination
the signal change polarity, that is, each time the signal amplitude changes
from a negative to a positive value. As evident from the figure, doing
this way a lot of markers are produced even within a single pitch period
of the signal. The problem comes from the presence in the signal of the
harmonics other than the fundamental frequency. This poses the necessity
to preprocess the signal in order to provide the ZXABE, with a signal
that has only two zero crossings per period. The latter is not easy to
achieve since it requires to isolate the fundamental frequency or, at least,
to enhance it while attenuating the other harmonics. Also phase of each
harmonic should be taken in account, since on these values depends how
much the waveform will reveal the presence of the fundamental.
Zero−crossing analysis basic extractor (ZXABE) Threshold analysis basic extractor (TABE)
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
amplitude
amplitude
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1 −1
24 39 59 74 105 132 147 166 182 213 240254 26 134 242
samples samples
Figure 2.3: Examples of PDA with zero-crossing (left) and threshold analysis (right) based
extractor applied to a voiced segment of speech.
14
2.2. Time domain Pitch Determination 2. State of the Art
or detect others which are not. This comes from the difficulty of fixing a
threshold which guarantees perfect periods length detection, since ampli-
tude varies with time and there are cases where more than one high peak
occur at each period. In the latter case, setting the threshold high enough
to avoid false detection and low enough to avoid missing any target peak
is not easy to accomplish.
Another solution is to set two positive threshold, a lower and higher
one and to mark the beginning of a period only when the two are crossed
successively. Signal values which cross just one threshold repeatedly will
not generate a marker. This extractor is named “TABE with hysteresis”
and improves the performance respect to the above described extractors.
Still though, there is the need to fix the threshold values which may work
for certain speech segment and not for others.
In the three cases a common preprocessing technique is to low-pass fil-
ter the speech signal in order to attenuate higher harmonics by about 6 to
12 dB per octave. The objective is to clear out higher harmonics so that
threshold crossing is due just to the presence in the signal of the fundamen-
tal frequency. Though improving the performance, this rule of thumb is not
suited for the voiced speech signal, which continuously varies the position
and amplitude of the fundamental frequency and its harmonics. Addition-
ally the fundamental frequency is not always present in voiced speech, even
if it is perceptually perceived as dominant by the human ear. A lot of effort
was put to design complex preprocessors implementing adaptive low-pass
and non-linear filtering in order to enhance the frequency region were f 0
was expected to be and to cope with its variations during time.
Several non-linear filtering techniques were used in the preprocessors of
these extractors, with the objective of flattening the voiced speech spec-
trum. The effect of formants is to enhance some harmonics while attenuat-
ing others. Designing a preprocessor which can provide the pitch extractor
15
2. State of the Art 2.2. Time domain Pitch Determination
an input independent from the particular sound uttered, that is, with a
flat spectrum, was demonstrated to be particularly beneficial for estimat-
ing correctly f0 . To this aim half-wave or full-wave rectification, as well as
squaring or peak clipping the signal were introduced in the preprocessor
[103, 85].
Using thresholds, makes it necessary to normalize the signal amplitude
but this cannot be accomplished in advance on the whole signal, since its
dynamic varies with time. The solution in this case can be the use of a
dynamic compressor to normalize over short-term portions of the signal.
16
2.2. Time domain Pitch Determination 2. State of the Art
two algorithms were devised: the envelope modeling and the sequence of
extremes based algorithms.
Envelope modeling
17
2. State of the Art 2.2. Time domain Pitch Determination
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
amplitude
amplitude
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1 −1
57 165 272 32 57 93 139 165 201215 246 272
samples samples
Figure 2.4: Examples of PDAs with an envelope modeling based extractor applied to a
voiced segment of speech. Determining correctly the time constant τ of the decaying en-
velope is crucial for the correct behaviour of the algorithm. At the left an example with
a correctly estimated τ = 4 ms shows period markers at sample instant 57, 165 and 272.
At the right the same speech segment on which the algorithm run with τ = 2 ms detecting
many false period markers.
imposed by the hardware capabilities of that time [20, 21, 3, 26, 27].
Among others, the peak-picker algorithm [44] was based on the envelope
modeling approach and on earlier work by [41]. Being suitable for real-time
applications, given its small input output delay, was also implemented as
part of cochlear implant prosthesis for pattern processing hearing aids at
the University College London [45].
18
2.2. Time domain Pitch Determination 2. State of the Art
1. data reduction: eliminate samples from the incoming signal which does
not belong to a chosen feature;
This class of algorithms is not always suitable for real time applications
because, although they provide f0 estimation on a period by period basis,
they need to process (see step 3) segments of signal larger than one single
period, in order to choose the correct delimiters sequence.
One of the most known algorithms belonging to this class is that based on
peak detection and global correction [86, 87]. This algorithm was designed
for a speech recognizer tool and made use of maxima and minima local-
ization on a frame basis of 25 ms. Within each frame, first all maxima
and minima where searched and tested applying a set of conditions to ver-
ify whether they could be candidates period delimiters. These conditions
involved comparison with other maxima and minima, and with absolute
maxima and minima, as well with other values obtained from those by
linear interpolation. Also period length prediction is used to correct errors
basing on the past estimated period lengths, in particular markers can be
shifted, removed or inserted to adjust the final f0 contour to be consistent.
Nevertheless, as it happens to many algorithms which include pitch cor-
rection features based on the past estimated values, whenever too many
errors are encountered, the global correction routine fails and can severely
prejudice the correctness of future estimates [53].
19
2. State of the Art 2.2. Time domain Pitch Determination
Pitch chaining
20
2.2. Time domain Pitch Determination 2. State of the Art
order to reduce the number of ECs and “isolate” the “significant” ones,
which mark the beginning of each pitch period.
Mixed-feature
21
2. State of the Art 2.2. Time domain Pitch Determination
Input Signal
DETERMINATION
OF LOCAL PEAKS
(MAXIMA AND MINIMA)
Primary
PE1 PE2 PE3 PE4 PE5 PE6 Extractors
Period Estimates
Figure 2.5: Mixed-feature based PDA which exploits both regularity and peakedness of
the incoming signal. Six Primary Extractors (PE) elaborate sequences of maxima and
minima, as well as inter-peaks measurements, to produce six period values which will be
selected to provide a final estimate.
M3 M2 M1
M1
Time
M5 M4
M6 M5 M4
Figure 2.6: Example of the six individual peak functions Mi , i = 1 . . . 6 on a voiced speech
segment. Each stream of measurements with the same label will feed a Primary Extractor
(PE) which will in turn return a period estimate.
22
2.2. Time domain Pitch Determination 2. State of the Art
sponds to an extractor while row one represents the direct estimates from
the PEs, row two and three reports the estimates of the two previous pe-
riods, respectively; row four, five and six represent the sum of estimates
from the first and second, the second and third and the first to the third
rows respectively.
The final period estimate is computed as the one that has the highest
degree of coincidence, evaluated using absolute difference between values,
with the other values in the matrix.
The reason for including also rows from four to six, stems from the fact
that all extractors are biased toward too high f0 errors, that is, relative to
the second and third harmonics still present in the signal. If this would be
the case, these matrix rows will contribute to provide the correct result.
23
2. State of the Art 2.2. Time domain Pitch Determination
The inverse transfer function H −1 (z) = 1/H(z) can thus be written as:
2
X(z), S(z) and H(z) are the z-transforms of the discrete signals x(n), s(n) and h(n), respectively.
If the complex variable z is set to e2πf , the frequency spectrum is obtained.
24
2.2. Time domain Pitch Determination 2. State of the Art
p
X
−1
H (z) = di z −i , (2.3)
i=0
where x(n) represents the speech signal, ai the filter coefficients and e(n)
the error signal. Equation 2.4 is a purely recursive digital filter, that is, an
all-pole model of the vocal tract transfer function.
The vocal tract parameters vary during the speech process thus implying
that coefficients ai must be time variant as well, in order to follow their
variations. This implies that parameter estimation must be performed on
a short-term based analysis which is usually 10−30 ms length.
A typical approach to determine the predictor coefficients is to minimize
the energy of the error signal e(n) within a given frame. Solving for e(n)
gives:
defining with
25
2. State of the Art 2.2. Time domain Pitch Determination
the predicted sample. Equations 2.5 and 2.6 represent non-recursive digital
filter and the first one has the same structure of Equation 2.2.
From linear filter theory, in case of a stationary signal, the predictor
would be able to estimate perfectly and the error will be zero. Speech
signal can be considered stationary just in between glottal pulse excitation
instants, consisting of a sum of decaying sinusoids.
LPC analysis assumes the source excitation signal to be impulsive and
provides a solution for coefficients ai so that the error function e(n) can be
considered, within a certain approximation and considering the purpose of
the analysis3 , the glottal pulse function s(n).
To obtain this, the mean square of e(n) is expressed as a function of the
predictor coefficients and a set of linear equations is solved by exploiting
autocorrelation and covariance functions [55, 57].
In Figure 2.7 an example of LPC analysis is shown. On the upper left
panel three periods of a vowel sound are represented and the corresponding
magnitude spectrum is plotted at the right. The latter shows clearly the
harmonic structure of the signal as a sequence of narrow peaks. Applying
the LPC analysis using 18 coefficients ai , the inverse transfer function
H −1 (z) is estimated and its inverse, that is H(z), is shown in the right
middle panel, with formants label Fi , i = 1, . . . , 4. Once H −1 (z) is know,
it is possible to obtain the glottal pulse transfer function S(z) = X(z) ·
H −1 (z) which is reported in the bottom panel at the right. The formant
structure has been almost removed while the harmonic structure has been
preserved. In the time domain, the residual signal e(n) shows a series
of peaks corresponding to the excitation signal s(n), and is shown in the
bottom panel at the left.
LPC analysis is not free from drawbacks: the minimization operation
3
The actual excitation signal is not preserved by LPC analysis which cannot really distinguish between
components of the vocal tract and those belonging to the glottal pulse and retains the latter just to an
impulsive extent.
26
2.2. Time domain Pitch Determination 2. State of the Art
Voiced speech signal − x(n) Magnitude spectrum of the speech signal − |X(z)|
1 0
0.5 −20
amplitude
dB
0 −40
−0.5 −60
−1 −80
0 100 200 300 400 0 1000 2000 3000 4000
samples
Vocal tract magnitude transfer function − |H(z)|
x(n) 0 F1
F2
−20
F3
dB
1/H(z) −40 F4
−60
s(n)
−80
0 1000 2000 3000 4000
Glottal source signal − s(n) Magnitude spectrum of glottal source signal − |S(z)|
1
0
0.5
amplitude
−10
dB
0 −20
−0.5 −30
−1 −40
0 100 200 300 400 0 1000 2000 3000 4000
samples Hz
Figure 2.7: Example of LPC analysis. Upper left panel: three periods of a vowel sound;
upper right panel: corresponding magnitude spectrum showing the signal harmonic struc-
ture as a sequence of narrow peaks; middle right panel: vocal tract transfer function H(z)
estimated by LPC analysis (18 coefficients). Formants are labeled with F i , i = 1, . . . , 4;
bottom right panel: glottal pulse transfer function obtained as S(z) = X(z)/H(z). The
formant structure has been almost removed while the harmonic structure has been pre-
served; bottom left panel: residual signal e(n) in the time domain. The excitation signal
s(n) can be approximated with the series of peaks shown.
applied to the error signal not always preserve the excitation signal [35],
additionally, when the first formant frequency is the same of the funda-
mental frequency f0 , removing the formant effect tend to cancel the latter
from the residual signal, frustrating any further attempt to detect the cor-
27
2. State of the Art 2.2. Time domain Pitch Determination
Epoch detection
28
2.2. Time domain Pitch Determination 2. State of the Art
Input signal
ch1: 120 Hz
ch2: 257 Hz
ch3: 392 Hz
ch4: 525 Hz
ch5: 658 Hz
ch6: 791 Hz
ch7: 923 Hz
ch8: 1060 Hz
ch9: 1206 Hz
ch10: 1370 Hz
ch11: 1552 Hz
ch12: 1751 Hz
ch13: 1970 Hz
ch14: 2211 Hz
ch15: 2476 Hz
ch16: 2765 Hz
ch17: 3081 Hz
ch18: 3425 Hz
ch19: 3800 Hz
Time
Output signal
Time
Figure 2.8: Example of epoch detection on a voiced speech segment. Top left graph shows a
voiced speech segment. Below it, the bandpass filter output for each channel ch1, ch2, . . . .
and center frequency shown beside. At the right side of the figure, are the rectified and
smoothed bandpass filter outputs while the bottom right graph shows their sum.
29
2. State of the Art 2.2. Time domain Pitch Determination
• The multi-feature principle: in this case there are several PDAs which
perform parallel processing. Each PDA operates independently from
the others and provides different signal features or, alternatively, all
PDAs compute the same set of features obtained with different tech-
niques. A common stage for all PDAs can be present for preprocessing.
30
2.3. Short Term Analysis Pitch Determination 2. State of the Art
In this setup a data fusion technique has to be used to select from all
channel outputs or combine them to provide a single pitch estimate.
Whenever selection among channel results must be done, the way of how
to detect the channel providing the correct estimate is not trivial. A basic
decision rule can be to choose the channel providing the longest period
estimate, or the one that showed the highest number of occurrences of a
certain pitch value. Another issue, related with multichannel analysis is
related with the phase of period markers. Different channels can provide
period markers which might differ in phase, that is, glottal cycle begin
and end may be detected differently by each channel. This is explained
considering that different signal features are involved in each channel. In
case the particular application does not require pitch phase information,
an average can be computed among all phases.
The short term analysis based PDAs differ from the time domain based
algorithms in that pitch estimation is performed on a short segment of the
input speech signal. This implies that the estimated fundamental frequency
does not refer any more to a specific time instant (or glottal cycle) but may
include several pitch periods representing their average.
Being not a period-by-period processing technique, the short term anal-
ysis does not estimate glottal cycle phase information. In case this infor-
mation is not needed, it represents an advantage since these algorithms are
more robust to phase distortion, which can severely affect the performance
of time domain techniques.
Also this class of algorithms turns out to be more robust to noise or
signal corruptions. This because, for each estimate, a signal segment longer
31
2. State of the Art 2.3. Short Term Analysis Pitch Determination
(
6= 0, 0 ≤ n ≤ N
xs (n, q) = x(n) · w(n − q), w(n) = (2.7)
= 0, otherwise
32
2.3. Short Term Analysis Pitch Determination 2. State of the Art
Bartlett
2n/N, 0 ≤ n ≤ N/2,
2 − 2n/N, N/2 < n ≤ N,
0, otherwise
Blackman
(
0.42 − 0.5 cos(2πn/N ) + 0.08 cos(4πn/N ), 0≤n≤N
0, otherwise
Table 2.1: Some commonly used windows of length N + 1 samples (assuming N even)
symmetric respect to sample N/2 [78].
In Figure 2.9 are shown the absolute values of the Fourier transforms ex-
pressed in decibels of the windows listed in Table 2.1. The main window
characteristics in the frequency domain are the resolution capability, the
peak-sidelobe level and side lobe role-off. Resolution refers to the capabil-
ity to distinguish different tones and is inversely proportional to the main
lobe width (plotted in red in figure). The peak-sidelobe level refers to the
maximum response outside the main lobe and determines whether signals
with small peaks in the frequency domain are hidden by nearby stronger
ones. The side lobe roll-off is measured as the side lobe decay per decade 4
and is trade-off with the peak-sidelobe level [42, 78, 85].
Each set of samples involved at each processing step, is referred to as
frame. Successive frames can overlap to a certain extent, so that the inter-
4
A frequency decade is a 10-fold increase or decrease in frequency.
33
2. State of the Art 2.3. Short Term Analysis Pitch Determination
−30
−40
−50
0 0.12 0.5 1 1.5 2 2.5 3
−50
−100
0 0.23 0.5 1 1.5 2 2.5 3
−20
dB
−40
−60
−20
dB
−40
−60
−80
0 0.25 0.5 1 1.5 2 2.5 3
−40
dB
−60
−80
0 0.37 1 1.5 2 2.5 3
Radian Frequency
Figure 2.9: Fourier transforms (log magnitude) of windows listed in Table 2.1.
34
2.3. Short Term Analysis Pitch Determination 2. State of the Art
val between successive estimates can be set shorter than the frame length.
The window parameter N is important for PDAs based on short-term
analysis. It has to be large enough to include a sufficient number of signal
samples for correct f0 estimation, and small enough to capture fundamen-
tal frequency variations within short intervals. Usually a value between
20 ms and 50 ms is used depending on the application. In case a value of
50 Hz is set5 for the minimum fundamental frequency allowed, this would
imply that from one to two and a half periods respectively will be com-
prised within one frame. This concept is strongly related with the PDA
performance in case of signal perturbations. In fact, when a local per-
turbation occurs in the speech signal, due to noise or other causes, the
behaviour of the PDA depends on the extent of the irregularity duration
over the considered frame. In case the analysis frame is too short so that
it contains only or mainly perturbed signal, the estimate will be wrong.
However, in case the frame length is such that the contribution of per-
turbed signal portions is small, the algorithm will still be capable of giving
a correct estimate. It has to be recalled though, that the speech signal can
only be regarded as quasi-stationary, implying that the pitch period values
are not constant within a given frame. Consequently, the estimate will be
an average of several consecutive signal periods, becoming less accurate as
the analysis frame gets longer.
The different processing steps involved in short-term analysis are sum-
marized in the scheme of Figure 2.10. As shown, the input signal can
undergo a pre-processing step where low-pass filtering, centre clipping or
inverse filtering can be applied at this stage to reduce the signal temporal
complexity. After this step, frame division of the incoming signal takes
place and the specified short-term transformation is applied to each frame.
The output of this process is generally a signal with a peak(s) whose posi-
5
Generally the fundamental frequency of speech in adult humans is in the range of about 50 − 500 Hz.
35
2. State of the Art 2.3. Short Term Analysis Pitch Determination
SPEECH SIGNAL
LINEAR
TIME−DOMAIN PREPROCESSING
NON LINEAR
(OPTIONAL)
ADAPTIVE
SUBDIVIDE
INTO FRAMES
SHORT−TERM
TRANSFORMATION
INTERPOLATION
POST PROCESSOR
SMOOTHING
PITCH CONTOUR
36
2.3. Short Term Analysis Pitch Determination 2. State of the Art
X=W·x (2.8)
where x is the vector containing all samples considered in the frame being
processed, W is the transformation matrix and X is the output vector,
or short-term spectrum. When this transformation is used by means of a
direct implementation, the computation complexity will increase with the
square of the length of vector x. Therefore, to keep complexity low, it is im-
portant to limit the frame length. However, when spectral transformation
are involved, the frequency resolution achieved increase proportionally to
the number of signal samples used. To fulfill both requirements, a common
devised solution is to reduce the computational complexity of transforma-
tion W and to perform interpolation on the output vector X values. For
example, in case Fourier transform is applied, the Fast Fourier Transform
(FFT) algorithm [15] can be used with a computation complexity propor-
tional to the logarithm of the length of the input data. Another solution
was represented by the Average Magnitude Difference Function (AMDF)
algorithm [89], which based on summations instead of the more computa-
tionally expensive multiplications.
37
2. State of the Art 2.3. Short Term Analysis Pitch Determination
SHORT−TERM ANALSYS
PITCH DETERMINATION
ACF HPS
Statistical
AMDF Harmonic analysis
WAUTOC Cepstrum approach
... ...
N
X
1
r(τ ) = lim x(n)x(n + τ ), (2.9)
N →∞ 2N + 1
n=−N
38
2.3. Short Term Analysis Pitch Determination 2. State of the Art
From Equations 2.10 and 2.11 it is possible to state that the autocorre-
lation function r(τ ), applied to a periodic signal x(n), will show a series of
peaks at positions τ = kT0 :
N −1
1 X
r(τ, q) = [x(q + n) w(n)] · [x(q + n + τ ) w(n + τ )], (2.13)
N n=0
where q is the starting sample and w(n) is a window function which is null
for values of n outside the interval 0 ≤ n ≤ N −1.
39
2. State of the Art 2.3. Short Term Analysis Pitch Determination
0
x(n)
−0.5
−1
100 200 300 400 500 600 700 800
time
0.5
ACF(τ)
−0.5
50 100 150 200 250 300 350 400
lag (τ)
40
2.3. Short Term Analysis Pitch Determination 2. State of the Art
x(n) − CL ,
x(n) ≥ CL
y(n) = clc [x(n)] = 0, |x(n)| < CL (2.14)
x(n) + CL , x(n) ≤ −CL
41
2. State of the Art 2.3. Short Term Analysis Pitch Determination
y(n)=clc[x(n)]
−C
L
CL
0
x(n)
−0.5
−1
100 200 300 400 500 600 700 800
time
Compressed centre clipped voiced speech segment, y(n)
0
−0.2
y(n)
−0.4
−0.6
100 200 300 400 500 600 700 800
time
Autocorrelation function applied to signal y(n)
1
ACF(τ)
0.5
Figure 2.13: Top panel: compressed centre clipping function; Second panel: voiced speech
segment and centre clipping threshold set to CL = 0.35; Third panel: output of the flattener
function y(n) = clc[x(n)]; Bottom panel: example of autocorrelation function (ACF)
computed on compressed centre clipping y(n) function. The largest non-zero offset peak is
found at lag τ = 195.
42
2.3. Short Term Analysis Pitch Determination 2. State of the Art
flat spectrum.
(x, y)τ
ρτ (x, y) = , (2.15)
|x|τ |y|τ
where (x, y)τ is the inner product of the two segments x and y taken as
if they were vectors of length τ , and the normalization factors |x|τ and
|y|τ , represent the energy of each segment. This method provides a first
pitch period estimate T0 which has a maximum resolution limited by the
sampling frequency, that is, it is an exact multiple of the sampling period.
To estimate the pitch period with “infinite resolution”, as reported by the
authors, linear interpolation is applied to the second segment y so that it
perfectly matches the first segment x.
The algorithm, tested on synthetic as well as on real speech data, is
reported to perform very well, also by the accuracy point of view. Still
octave errors occur but principally during voiced/unvoiced transitions.
43
2. State of the Art 2.3. Short Term Analysis Pitch Determination
y(n)
DEC INVERSE AUTOCORRELATION PEAK
FILTER PICKER
INVERSE
FILTER
x(n) ANALYSIS Filter
LPF p=4 INTERPOLATOR
coefficients
VOICED
UNVOICED
Figure 2.14: Block diagram of the SIFT algorithm. The input signal is low-pass filtered
and decimated and then processed applying the LPC analysis to obtain the excitation
source signal. Autocorrelation is thus applied to obtain the pitch period estimation and
interpolation is used to recover the original resolution.
The first processing step is a low-pass filter which filters out all signal
information with frequencies above 900 Hz. After this, it is possible to
apply down-sampling to obtain a signal with sampling frequency of 2 kHz.
This frequency range was proved to include all necessary information for
pitch estimation and permitted to reduce the computation load. Next step
applies the inverse filtering technique by means of LPC analysis as de-
scribed in Section 2.2.3. The order of the LPC analysis is set to four, since
the frequency range 0 ÷ 1 kHz generally includes just two formants. The
estimated coefficients are then used to drive a filter which approximates
the inverse vocal tract transfer function and whose output y(n) represents
the glottal excitation source. Autocorrelation is then applied to y(n) to
estimate its periodicity. Since at this point the frequency resolution is low
(2 kHz), interpolation around the autocorrelation peak found is necessary
to provide a more precise estimate.
This scheme, compared to the autocorrelation approach, proved to be
44
2.3. Short Term Analysis Pitch Determination 2. State of the Art
more robust to formant effects and provided better results. Also it can
estimate voiced/unvoiced activity, since the peaks of the autocorrelation
function, when applied to the estimated source excitation signal, better
reveal the degree of periodicity of the analyzed signal. The main limi-
tations associated with this algorithm instead, are those associated with
LPC analysis, such as the cancellation of the excitation signal whenever a
formant position coincides with the fundamental frequency.
q+N −1
1 X
AMDF(τ, q) = |x(n) − x(n + τ )| (2.16)
N n=q
Equation 2.16 was originally presented in [63] and, a few years later, also
in [89]. Similarly to the ACF, AMDF compares two segments of signal x(n)
which are delayed of τ samples respect each other. In case a voiced speech
signal is analyzed and τ equals its fundamental period T0 , the function
exhibits a minimum. The AMDF bases on summations for the computation
and it is thus faster respect to the ACF by a computational point of view.
Nevertheless, it is more sensitive to changes in signal amplitude [43], being
thus more prone to pitch estimation error.
Figure 2.15 shows the AMDF applied to a segment of voiced speech
from a male speaker. The minimum of the function is found at τ = 198
samples, in accordance with the actual value of the signal pitch period.
45
2. State of the Art 2.3. Short Term Analysis Pitch Determination
0
x(n)
−0.5
−1
100 200 300 400 500 600 700 800
time
0.8
AMDF(τ)
0.6
0.4
0.2
The ACF and AMDF exhibit similar characteristics: while the Autocor-
relation Function produces a peak in correspondence of the pitch period
46
2.3. Short Term Analysis Pitch Determination 2. State of the Art
N
X −1
[x(q + n) w(n)] · [x(q + n + τ ) w(n + τ )]
n=0
wautoc(τ, q) = (2.17)
X−1
q+N
+ |x(n) − x(n + τ )|
n=q
The Equation 2.17 is the ratio of Equations 2.13 and 2.16, where the
parameters q and N indicate the starting point of signal x(n) and the
number of samples involved for the computation, respectively. The term
in the denominator is necessary to avoid division by zero in case the
summation of the AMDF resulted null.
An example of the WAUTOC function applied to a segment of voiced
speech is shown in Figure 2.16. The estimated pitch period resulted τ = 198
samples in accordance with the estimates provided by the ACF and AMDF
individually and shown in Figures 2.12 and 2.15.
This method resulted more robust to noise conditions compared to the
previous described methods. In fact, the signal components which belong
to the noise source, produce a different effect in the numerator and denom-
inator of Equation 2.17, while the periodic signal components, proceeding
from the voice source, show a common behaviour, exploited by the wautoc
function.
47
2. State of the Art 2.3. Short Term Analysis Pitch Determination
0
x(n)
−0.5
−1
100 200 300 400 500 600 700 800
time
0.2
WAUTOC(τ)
0.1
−0.1
The YIN algorithm is a time domain based algorithm derived from the
autocorrelation function which represents one of the state of the art among
the pitch detection algorithms [17].
The basic building block of this algorithm is the difference function:
N
X −1
d(τ, q) = [x(q + n) − x(q + n + τ )]2 , (2.18)
n=0
which, expanding the term inside the square brackets, can be expressed in
term of the Autocorrelation Function r(τ, q) in Equation 2.13:
48
2.3. Short Term Analysis Pitch Determination 2. State of the Art
The first two terms at the right-hand side of Equation 2.19 are energy
terms. Assuming them constant, the difference function d(τ, q) would ex-
press the opposite variations of the autocorrelation function r(τ, q). This is
not always true, since the second term depends on the variable τ and may
vary depending on the signal amplitude. Nevertheless, as reported by the
author, Equation 2.18 proved to behave better than the Autocorrelation
Function. It resulted less sensitive to changes in signal amplitudes, being
thus less prone to ”too low/too high” f0 estimation errors.
The difference function, as the ACF and AMDF, has an absolute min-
imum for τ = 0 and can produce additional dips at frequencies corre-
sponding to a strong first formant F1 . The frequency region of F1 and of
the fundamental frequency f0 overlap, thus making difficult to set a lower
limit in the pitch period search range.
To overcome this limitations, the Cumulative Mean Normalized Differ-
ence Function was derived:
1, τ = 0,
0
d (τ, q) = d(τ, q) (2.20)
(1/τ ) Pτ d(j, q) ,
otherwise.
j=1
49
2. State of the Art 2.3. Short Term Analysis Pitch Determination
0
x(n)
−0.5
−1
100 200 300 400 500 600 700 800
time
0.8
d’(τ)
0.6
0.4
0.2
50 100 150 200 250 300 350 400
lag (τ)
50
2.3. Short Term Analysis Pitch Determination 2. State of the Art
51
2. State of the Art 2.3. Short Term Analysis Pitch Determination
stant with frequency, the relative resolution will get lower for decreasing
values of the estimated f0 .
To overcome these limitations, other approaches measure the spacing
within higher harmonics of the fundamental frequency and estimate f 0
computing their weighted average.
M
X M
Y
2
P (k) = log |X(mk)| = 2 log |X(mk)|, (2.21)
m=1 m=1
52
2.3. Short Term Analysis Pitch Determination 2. State of the Art
log |X(k)|2
0
x(n)
20
−0.5
−1 0
200 400 600 800 0 200 400 600 800
time Hz
Compression: 1:2 Compression: 1:3
40 40
log |X(2k)|2
log |X(3k)|2
20 20
0 0
0 100 200 300 400 0 100 200 300
Compression: 1:4 Compression: 1:5
40 40
log |X(4k)|2
log |X(5k)|2
20 20
0 0
0 50 100 150 200 0 50 100 150
Harmonic Product Spectrum (HPS) HPS, 0 ÷ 250 Hz
log P(k)
log P(k)
100 100
50 50
0 0
0 200 400 600 800 0 50 100 150 200 250
Hz Hz
53
2. State of the Art 2.3. Short Term Analysis Pitch Determination
the result, where the largest peak represents the estimated fundamental
frequency, which is f0 ≈ 100Hz.
This algorithm has two main advantages: it is particular robust to noise
and does not need the fundamental frequency to be particularly strong to
provide the correct estimate.
54
2.3. Short Term Analysis Pitch Determination 2. State of the Art
60], “maximizes the energy of the signal frequency components that pass
through a spectral comb”; the second, reported in [79], “minimizes the
difference between the input spectrum and reference spectra”. Both the
spectral comb and the reference spectra characteristics depend on the pa-
rameter p, which represents the trial fundamental frequency. The term
“trial” [43] refers to the fact that p is varied within a given range of fre-
quency values and, for each of them, the score provided by the matching
procedure is evaluated. The frequency value that obtained the best score,
will then be output as the estimated f0 .
where m is the spectrum bin index and s a positive integer. Given the
discrete frequency spectrum X(k) computed from the speech signal x(n),
its absolute value X 0 (k) = |X(k)| is derived to compute the harmonic
estimator function XC (p) as follows:
N/2p
X
XC (p) = X 0 (lp)C(lp, p), (2.23)
l=1
where N/2 is less or equal the spectrum bin index corresponding to the
Nyquist frequency. Equation 2.23 reaches its maximum when the spacing
between the peaks of the spectral comb C(m, p) match the harmonic struc-
ture of the voiced speech spectrum X(k). The value of p, corresponding
to this maximum, provides thus the fundamental frequency estimate.
6
Actually, p represents here the the index position in the discrete spectrum relative to the trial
frequency.
55
2. State of the Art 2.3. Short Term Analysis Pitch Determination
The reason for assigning a decreasing amplitude to the comb filter peaks
in Equation 2.22, lays in the fact that a fundamental frequency harmonic
that matches a certain peak of the comb C(m, p), with p corresponding
to the actual fundamental frequency, will also match a peak, with lower
weight, of the comb filter C(m, p0 ), with p0 sub-multiple of p.
In the latter case, the difference in weighting will guarantee that the
value of XC (p0 ) will be less than XC (p), thus avoiding that a sub-multiple
of f0 is provided as the final estimate.
Another approach, similar to the one just described, bases on the dif-
ference between the magnitude spectrum X 0 (k) and a reference spectrum,
which is defined as
(
|H(m)| m = lp; l ∈ Z+ ,
R(m, p) = (2.24)
0 otherwise,
where the function H(m) represents the vocal tract transfer function, es-
timated applying LPC analysis to the speech signal as reported in Sec-
tion 2.2.3. The frequency comb filter resulting from Equation 2.24 is sim-
ilar to that of Equation 2.22 with the difference that each peak weight is
now related to the current vocal tract configuration.
To estimate the fundamental frequency, the spectral distance function
is calculated, over L harmonics, as follows:
L
1X
D(p) = | log R(lp, p) − log X 0 (lp)|, (2.25)
L
l=1
and the value of p for which D(p) reaches its minimum provides the es-
timated fundamental frequency. The main advantage of this approach, is
that the formant positions and amplitudes do not affect the computation
of the spectral distance. In fact, taking the difference of the logarithmic
56
2.3. Short Term Analysis Pitch Determination 2. State of the Art
Cepstrum processing
57
2. State of the Art 2.3. Short Term Analysis Pitch Determination
0
−0.5
−1
0 50 100 150 200 250 300 350 400
samples
Log of the speech signal magnitude spectrum |X(k)|
0
log |X(m)|
−0.5
−1
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Hz
Log magnitude of the Vocal tract transfer function H(m)
0
log |H(m)|
−0.5
−1
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Hz
Distance function D(p)
2.5
2
D(p)
1.5
1
0.5
50 100 125 150 200 250 300 350 400 450 500
Hz
Figure 2.19: Example of LPC based spectral distance function. Top panel: voiced speech
signal x(n); Second panel: logarithmic magnitude of the signal spectrum X(k); Third
panel: logarithmic of the vocal tract transfer function H(m), estimated by means of LPC
analysis; Bottom panel: distance function D(p) computed as the log difference between
X(k) and H(m). Its minimum determines the estimated fundamental frequency, f 0 =
125 Hz.
58
2.3. Short Term Analysis Pitch Determination 2. State of the Art
Equation 2.26.
For this, the logarithmic of the signal power spectrum is computed so
that the product turns into a sum:
and its inverse discrete Fourier transform is calculated, providing the power
cepstrum7 , denoted with x(d), where the variable d takes the name of “que-
frency” and is a measure of time, as the lag variable τ in the autocorrelation
function:
As shown in the center panel of Figure 2.20, the log power spectrum of
a voiced speech signal has the shape of a high frequency cosine-like ripple
due to the harmonics, modulated by a low frequency ripple (plotted with
dashes) due to the vocal tract effect. These two components are additive
in the log domain. If they are thought of as time domain signals, their
Fourier transform will ideally be a spectrum with a pulse in correspondence
of the fundamental frequency, and a spectrum with energy just in the low
frequency region, respectively. This is approximately the behaviour that
the functions s(d) and h(d) show, respectively8 .
The sum of these functions, (Equation 2.28) is plot in the bottom panel
of the figure which evidences the peak due to the high-frequency component
at quefrency d = 129 samples, which will be the final pitch estimate.
Ideally, the contribute of the excitation source in the cepstrum domain,
s(d), shall be a train of impulses. Actually, due to the windowing operation
7
Cepstrum and complex cepstrum are obtained when the power spectra in Equation 2.27 is substituted
by spectra and amplitude spectra, respectively.
8
Computing the inverse or direct discrete Fourier transform of even functions, as log |X(m)| 2 , returns
the same result.
59
2. State of the Art 2.3. Short Term Analysis Pitch Determination
0.5
amplitude
−0.5
−1
0 50 100 150 200 250 300 350 400
samples
Log of the speech power spectrum |X(m)|2
2
Vocal tract response
log |X(m)|2
−1
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Hz
Power cepstrum
0.6
0.4
x(d)
0.2
0
−0.2
0 20 40 60 80 100 120 129 140 160
d
Figure 2.20: Example of cepstrum processing. Top panel: voiced speech signal x(n); Middle
panel: logarithm of the power spectrum of signal x(n). The high frequency cosine-like
ripple is plotted with a continuous line, while the vocal tract contribute is plotted with
a dashed line; Bottom panel: cepstrum function for signal x(n). The peak at quefrency
d = 129 represents the estimated signal fundamental period.
60
2.3. Short Term Analysis Pitch Determination 2. State of the Art
the other hand, it needs several harmonics so that a peak in the cepstrum
is produced, being thus not suitable to estimate the pitch of sinusoidal
signals. Even so, after it was first published, cepstrum processing became
a reference pitch estimation technique for many pitch detection which were
consequently compared against it.
The dominance spectrum was firstly proposed in [67, 66], where a robust
fundamental frequency estimation technique is presented. The method is
regarded by the authors, as a frequency domain based method, since it
exploits the Instantaneous Frequency (IF) defined as the phase derivative
with respect to time of a sinusoidal component [1].
Defining with φ(f ) the phase of the speech signal component output by
a narrow band-pass filter with center frequency f , the IF φ̇(f ) is defined
as its phase derivative with respect to time.
The degree of dominance D0 (fi ) is thus defined as:
i+K/2
X
[φ̇(fk ) − fi ]2 · X(fk )2
1 k=i−K/2
D0 (fi ) = log , B(fi )2 = (2.29)
B(fi )2 i+K/2
X
X(fk )2
k=i−K/2
where X(fk ) represents the value of the discrete Fourier transform of x(n)
at the frequency value relative to the k-th bin. Function B(fi )2 is derived as
the weighted average of the squared difference between the center frequency
fi , and the IFs φ̇(fk ), computed over a frequency range of K + 1 frequency
bins.
When a harmonic component of a voiced speech signal coincides with
the bin center frequency fi , the instantaneous frequency φ̇(fk ) takes a value
61
2. State of the Art 2.3. Short Term Analysis Pitch Determination
−20
dB
−30
−40
0 200 400 600 800 1000
Hz
−20
dB
−30
−40
0 200 400 600 800 1000
Hz
Figure 2.21: Top: dominance spectrum of a voiced speech signal with f0 ≈ 117 Hz;
Bottom: power spectrum computed on the same segment of speech signal.
62
2.3. Short Term Analysis Pitch Determination 2. State of the Art
L
X
Dt0 (fi ) = {D0 (lfi ) − E(D0 (fi ))}, (2.30)
l=1
where E(D0 (fi )) is the average value of D0 (fi ) over all frequency bins and
L is the number of harmonics considered in the computation. Then the
value of fi for which Equation 2.30 takes a maximum, is chosen as the final
fundamental frequency estimate:
a = x + gn . (2.33)
The objective is to find fˆ0 , σ̂ 2 and x̂ such that they are the most likely
values, in the least-squares sense, for f0 , σ 2 and x.
63
2. State of the Art 2.3. Short Term Analysis Pitch Determination
As shown in [33, 111], this can be achieved exploiting the Gaussian char-
acteristics of the noise source and modeling the signal a(n) as a stochastic
process as follows:
( K−1
)
1 1 X
g(a|f0 , x, σ 2 ) = exp [a(n) − x(n)]2 . (2.34)
(2πσ 2 )K/2 2σ 2 n=0
64
Chapter 3
The speech production mechanism has been studied since ancient times
and, nowadays, the functions of the organs involved during speech pro-
duction as well as their effects on the uttered sound characteristics are
well known. When voiced sounds are produced, the vocal folds oscillates
regularly under the air pressure which accumulates below them and this
phenomenon is the main responsible of the pitch perceived by a listener.
Pitch is thus a subjective perception which is strongly related with the
speech fundamental frequency f0 , that measures the frequency of such os-
cillations. In the context of speech applications, and particularly when
fundamental frequency estimators are concerned, the terms “pitch” and
“fundamental frequency” are usually used with the same meaning.
To estimate f0 a common approach consists in acquiring the speech
signal by means of an acoustic sensor (microphone) and analyzing the pro-
vided waveform. The analysis is generally carried out relying on signal
processing techniques and on the source-filter model, which approximates
the speech production mechanism as a vocal tract filter driven by an exci-
tation signal.
In case the processed speech signal is not degraded by the ambient
65
3. From speech modeling to pitch based applications 3.1. Speech Production
66
3.1. Speech Production 3. From speech modeling to pitch based applications
a resonator modulating the airflow that is finally radiated through lips and
nose.
Figure 3.1: Vocal tract configuration with raised soft palate for articulating non-nasal
sounds [19].
As soon as this happens, the airflow velocity increases and, for the
Bernoulli effect, the air pressure in the larynx decreases. This cause the
67
3. From speech modeling to pitch based applications 3.1. Speech Production
vocal folds to close rapidly and the process repeats this way in a quasi-
periodic fashion as long as a steady supply of pressurized air is generated
by the lungs through the larynx.
The frequency with which the glottis vibrates during phonation, deter-
mines the fundamental frequency f0 of the laryngeal source and largely
depends on the tension of the laryngeal muscles and the air pressure gen-
erated by the lungs, contributing to the perceived pitch of the produced
sound.
While frequency is a physical measurement of a vibration, pitch is re-
lated to the human perception and, the relationship between them has
been studied in depth and involves complex psychoacoustic phenomena 1 .
Although what the solutions proposed in this field actually do is f 0 es-
timation, often they are regarded as pitch detection algorithms. Since the
psychological relationship between the f0 of a given signal and the relative
perceived pitch is well known2 , the above distinction is not so important
given that, a true pitch detector, should take into account perceptual mod-
els in order to estimate pitch and give a result on a pitch scale.
Although, in most European languages, individual phonemes are rec-
ognizable regardless of the pitch, this is mostly responsible for intonation
patterns associated with questions and statements and carries information
about speaker emotional state. In tonal languages instead, pitch motion
of an utterance contributes to the lexical information in a word.
The frequency spectrum of voiced speech reveals high energy in the fre-
quency regions relative to the fundamental frequency and its harmonics,
which falls off gradually for increasing values of frequency. The final spec-
1
For example the note A above middle C is perceived to be of the same pitch as a pure tone of 440Hz,
but does not necessarily contain that frequency.
2
Pitch is loosely related with the base 2 logarithmic of the fundamental frequency, that is, for every
doubling of f0 , the perceived pitch increases of about an octave. The relation is, however, biased by many
factors such as the sound frequency, intensity, harmonic content, etc [16, 109].
68
3.1. Speech Production 3. From speech modeling to pitch based applications
T0 = 1/f0
t
oral cavity
Voiced:
periodic glottal t
excitation (when periodic source excitation)
Figure 3.2: source-filter model: time domain. At the left is the random varying waveform
in the case of unvoiced speech (top) or glottal pulse shaped waveform representing voiced
speech (bottom). One of this sources (or a mixture of both) is filtered by the vocal tract
(centre) which is represented in gray as a nasal plus oral cavity. The output (right) is an
example of voiced speech which is the result of filtering the periodic source signal with the
impulse response of the vocal tract and considering the lips radiation effect.
69
3. From speech modeling to pitch based applications 3.1. Speech Production
where x(n) is the sampled speech signal sample, s(n) is the sampled version
of the glottal pulse excitation signal (voiced speech) or a random discrete
function (unvoiced speech), and h(n) is the sampled vocal tract impulse
response, which includes here the lips radiation effect.
In the z-transform domain (or discrete frequency domain) Equation 3.1
turns into
being X(z), S(z) and H(z) the z-transformations of x(n), s(n) and h(n),
respectively. Equation 3.2 results very useful since in the z-domain, the
convolution operator ‘∗’ turns into a multiplication and this makes it pos-
sible to obtain S(z) by simple multiplication of X(z) by the inverse filter
1/H(z).
Figure 3.2 shows the source-filter model in the time domain, a schematic
model of the human speech production system where the source is a a
combination of periodic pulses, generated by vocal cords vibrations at the
glottis, and of an contribution from turbulent flow. When only the first
contribution is present the output of the generation process is named voiced
speech while, when only turbulence flows is generated, unvoiced speech is
produced.
Figure 3.3 shows the same schematic model but in the frequency domain.
The periodic source here is represented as a series of frequency lines, spaced
f0 Hz one from the other and falling off gradually.
The oral and nasal airways, as well as the lips radiation effect, are shown
here as a time-varying acoustic filter which reflects the overall shape, length
and volume of the vocal tract.
70
3.1. Speech Production 3. From speech modeling to pitch based applications
Unvoiced: Envelope of
noise source speech spectrum
F1
(flat spectrum) F2 F3
Voiced:
discrete harmonic f0
f f0 3f0 5f0 . . . f
spectrum Vocal tract transfer function (periodic source)
f0 f
Figure 3.3: source-filter model: frequency domain. At the left (top) is an almost flat spec-
trum of a random signal representing unvoiced speech source and the spectrum (bottom)
of a periodic glottal source, characterized by equally spaced (f0 ) spectral lines. One of
these sources (or a mixture of both) is filtered by the vocal tract transfer function (centre).
In case of voiced speech, the articulatory organs are positioned so that specific frequency
regions, i.e. formants (F1 , F2 , . . . ), of the input source are amplified. The output (right)
is an example of voiced speech whose spectrum is the product of the equally spaced spectral
lines by the vocal tract transfer function.
71
3. From speech modeling to pitch based applications 3.1. Speech Production
Figure 3.4: F 1/F 2 chart of Italian vowels for males, females, children and infants deter-
mined from four groups consisting of four subjects each. Each ellipses show the area of
existence for each vowel in the F 1−F 2 plane and is centered on the mean values of the
estimated formants. The axis lengths and their orientation are determined by the standard
deviation and covariance of F 1 and F 2 respectively [113].
speech produced by female and male speakers have different formant fre-
quency ranges, being these determined by the different size of their vocal
tract. However the formant frequencies ratio keep consistent across males,
females. This is depicted in Figure 3.4, where a F 1/F 2 chart of Italian
vowels for males, females, children and infants is shown.
72
3.2. Basic of f0 Estimation 3. From speech modeling to pitch based applications
Z T ∞
X
1 2
−j 2πm 2πm
Xm = x(t) e T t dt x(t) = Xm e j T t . (3.3)
T −T
2 m=−∞
73
3. From speech modeling to pitch based applications 3.2. Basic of f0 Estimation
x(t): vowel ’a’ of a male speaker with period T Coefficients |X | of the Fourier series
n
0.5
0.08
0 0.06
amplitude
|X |
n
0.04
−0.5
0.02
T
−1 0
100 200 300 400 500 600 X0 X2 X4 X6 X8 X10 X12 X14 X16 X18 X20 X22
samples
Figure 3.5: Left: Example of voiced speech segment (‘a’ vowel) with period T ; Right:
subset of Fourier coefficients Xn relative to the voiced speech segment (labels are relative
to even coefficients only).
In particular, the expression at the right, shows that x(t) can be writ-
ten as a summation of infinite weighted complex exponentials, each with
frequency multiple of the fundamental frequency, that is, mf0 = m/T .
The complex weights Xm (right panel of Figure 3.5), are obtained from
a period of the signal itself, as shown from the left equation of (3.3), and
represent the harmonic contribute of the signal at that particular frequency
mf0 . Those frequencies are harmonically related, meaning that the ratio
between each of them and the lowest one3 , f0 , is a whole-number.
It is important noting that the integration limits of left equation of (3.3)
can be changed to include any whole number of periods T , adjusting con-
sequently the normalization factor 1/T , and the result for the coefficients
Xm will be the same.
In the practice, any handled signal is represented by a discrete set of
values x(nTs ), n ∈ Z, obtained from sampling the continuous periodic
signal x(t) with sampling frequency fs = 1/Ts . The correspondent version
of the Fourier decomposition in the discrete time domain, is represented
3
f0 is considered here as the lowest frequency, since X0 represents the mean of the signal in a period,
a component not showing an oscillating behaviour and easily removable. Components X −i for i ≥ 1 are,
as the Fourier theory shows for real x(t), complex conjugate of Xi .
74
3.2. Basic of f0 Estimation 3. From speech modeling to pitch based applications
then by the Discrete Fourier Transform (DFT) and can be applied to any
discrete periodic signal [78]:
N −1 N −1
1 X nm
X nm
Xm = x(nTs )e−j2π N x(nTs ) = Xm ej2π N . (3.4)
N n=0 m=0
In the practice, voiced speech signal is far from showing such a perfect
periodic behaviour. During phonation, articulatory movements continu-
ously take place to permit transitions between different phonemes thus
changing formants position and amplitudes. Pitch also is not stationary:
the glottis changes its fundamental frequency depending on intonation and
emotional state. Both these phenomena entail that the output signal can-
not be regarded as stationary. Therefore each instantaneous period, that
75
3. From speech modeling to pitch based applications 3.2. Basic of f0 Estimation
is, the signal segment between each pair of glottal closure instants, changes
its duration and shape slowly over time.
In addition, when the DFT is computed on voiced segments, the period
length is not known in advance, since if it was known, there would be no
need of performing f0 estimation. This makes it impossible to fulfill the
second requirement listed above posing the need to introduce some approx-
imations.
N −1
1 X mn
Xn0 (m) = x(n0 + n)w(n)e−j2π N , (3.5)
N n=0
76
3.2. Basic of f0 Estimation 3. From speech modeling to pitch based applications
77
3. From speech modeling to pitch based applications 3.2. Basic of f0 Estimation
78
3.2. Basic of f0 Estimation 3. From speech modeling to pitch based applications
1500
1000 −100
500
1500
1000 −100
500
0
−0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
time (sec)
Figure 3.6: A wide-band (top) and narrow-band (middle) spectrograms are shown along
with the speech signal from a female speaker (bottom), used to derive them. The spoken
sentence is: “to the third class”.
79
3. From speech modeling to pitch based applications 3.3. f0 based applications
the expense of a poorer resolution in the time dimension. This time, the
graph is characterized by horizontal striations, indicating that neighboring
spectra vary slowly and smoothly with time. The reason for this is that
the vocal tract movements can be considered slow compared to the length
of the analysis window.
80
3.3. f0 based applications 3. From speech modeling to pitch based applications
tions, one of the prior concern when designing new speech related appli-
cations, is the available bandwidth, which heavily affects the final systems
design, in terms of feasibility, performance and cost. The main consequence
of this, was that speech coding algorithms started to be developed in or-
der to reduce the bit rate of the speech data to be transmitted or stored.
Many solutions were devised in this field, all exploiting the high speech sig-
nal correlation between adjacent time samples. In fact, as the source-filter
model of Figures 3.2 and 3.3 shows, the speech signal production process
can be decomposed as a chain of basic blocks, each one driven by its own
parameters.
These parameters, that is, voiced/unvoiced information, fundamental
frequency and formant positions, are sufficient to synthesize the origi-
nal speech signal and need very low bandwidth to be transmitted. A
well known speech coding method is the Code Excited Linear Prediction
(CELP) algorithm, which bases on LPC analysis (see Section 2.2.3). This
technique is capable to compress speech signal sampled at 8 kHz with
16-bit resolution, down to 2.4 ÷ 4.8 kbit/sec [95].
81
3. From speech modeling to pitch based applications 3.3. f0 based applications
82
3.3. f0 based applications 3. From speech modeling to pitch based applications
as those presented in Chapter 2, can not be used for signals were multiple
pitch streams are present.
The current approaches for multiple pitch estimation, involve complex
algorithms which exploit the findings in the speech pitch estimation field
and base on auditory scene analysis, trying to mimic the human auditory
system. Perceptual cues are also used, exploiting spatial proximity of the
events in the time and frequency domain, harmonic relationships and sig-
nal changes as onsets, offsets, amplitude and frequency modulations. But,
differently from the speech processing field, where the interest for improve-
ments in speech pitch estimation is high and often driven by economic
interests, in the musical field, relative little work has been done so far.
Therefore, nowadays, there exist very few completely automatic system
able to transcribe real-world music performances and usually, many restric-
tions have to be set on the analyzed musical piece. These restrictions are
relative to the type of instruments, musical genre and maximum polyphony
allowed, as well as the presence of percussive sounds or other effects.
An interesting description of f0 based music scene description systems
can be found in [39, 40].
Humans have the natural ability to recognize persons, whom they are ac-
quainted with, just hearing their voice. In fact, the speech signal carries
many clues which are characteristic of each individual.
These clues, can be divided into high level features, as dialect, pronun-
ciation preferences, melody, prosodic patterns, talking style, etc. and low
level features, as pitch period, formant transitions, timbre, rhythm, tone,
etc.
Automatic speaker recognition systems are designed to recognize who
is speaking, by means of different approaches as dynamic time warping
83
3. From speech modeling to pitch based applications 3.3. f0 based applications
The security field is the main responsible for the growing interest in
speaker recognition systems. Possible target applications, among others,
84
3.3. f0 based applications 3. From speech modeling to pitch based applications
85
3. From speech modeling to pitch based applications 3.3. f0 based applications
that the ASR based technology reached the commercial market. These
earliest systems had a limited vocabulary of about a thousand words and
could not work in real time, usually being three times slower than humans.
These limitations were mostly imputable to the lack of fast computation
capabilities, as those provided by modern computers.
However, thanks to the advances in computer technology, during the
past decade there was a very significant progress in this field, and many
ASR based products and services, started to appear on the market. As
an example, modern ASR based dictation systems are capable of accu-
racy5 levels of more than 95%, with a transcription speed of more than
160 words per minutes. Also, dictation systems that can work in real-time
with a 100,000-word vocabulary are not far to be achieved.
86
3.3. f0 based applications 3. From speech modeling to pitch based applications
Feature Viterbi
VAD Acoustic
Input Speech Extraction Algorithm Models
Front−End
Best matched Lexical
ASR − Basic scheme Transcription Rules
Figure 3.7: Simplified model of an HMM based Automatic Speech Recognition system.
87
3. From speech modeling to pitch based applications 3.3. f0 based applications
Pitch information plays an important role in the ASR process and rep-
resents one of the distinctive features useful to distinguish the different
parts of an utterance. Pitch, in fact, other than providing a measure of the
signal periodicity, it implicitly provides information about voicing. The
type of phonation, that is, voiced or unvoiced speech, is very useful to dis-
7
The adaptation is carried out training the recognizer using speech databases previously corrupted by
noise and reverberation effects. These new datasets are usually obtained with computer simulations which
can recreate the desired acoustic conditions by means of noise and reverberation models, as described in
Section 3.4.3.
88
3.3. f0 based applications 3. From speech modeling to pitch based applications
ambiguate between certain phonemes which are very similar among them
as, for example, /z/ (voiced) and /s/ (unvoiced). Additionally, its varia-
tion with time conveys prosodic information, useful for deciding between
statements, questions or other sentence patterns [99, 105].
Moreover, in case of scenarios characterized by adverse acoustic con-
ditions (noise and reverberation), pitch related high-level features, such
as voiced/unvoiced and prosodic informations, turn out to be more ro-
bust than low-level features, such as short-term informations related to
the speech spectrum (e.g. MFCCs). This could be explained considering
that high-level features extracted from succesive signal frames are gener-
ally correlated between each other. Pitch values, for example, vary slowly
with time and can be approximated with a lognormal distribution8 [104].
For all the above reasons, if f0 is accurately estimated in a noisy and
reverberating context, it can be used to improve the robustness of ASR
systems designed to work on distant-talking scenarios [54, 75].
89
3. From speech modeling to pitch based applications 3.3. f0 based applications
green waveforms) are active at the same time and two microphones provide
the BSS system with the captured speech mixture. The signals plotted at
the right of the BSS block, represent the extracted signals, obtained from
the mixture.
Blind Source
Separation
Figure 3.8: Example of underdetermined Blind Source Separation System. The three
speech sources plotted at the left are active at the same time. The BSS system capture the
speech mixture by means of two microphones and recover the original individual signals.
90
3.3. f0 based applications 3. From speech modeling to pitch based applications
speech sources, the BSS systems are very useful in all applications that
have to cope with the cocktail party problem, i.e., several speaker talking
simultaneously. Noise robust speech recognition, high-quality hands-free
telecommunication systems or speech enhancement in hearing aids are, for
example, some of the possible applications of these systems. Also, speech
encryption applications based on BSS systems exist, where the objective
is to separate speech signals that were intentionally mixed beforehand.
3.3.8 Dereverberation
91
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
92
3.4. Noise and Reverberation 3. From speech modeling to pitch based applications
93
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
94
3.4. Noise and Reverberation 3. From speech modeling to pitch based applications
left side of the figure show the original clean speech segment x1 (n) (top)
and signals x2 (n) (middle) and x3 (n) (bottom), obtained from x1 (n) after
adding white noise with a signal to noise ratio11 (SNR) of 0 and −5 dB,
respectively. On the right is reported the effect of the noisy signals on
the Weighted autocorrelation function, which is computed for each sig-
nal shown at the left. White noise is, by definition, uncorrelated and, as
reported in Section 2.3.1, the WAUTOC function is capable to detect pe-
riodic components while rejecting those uncorrelated, as white noise. This
is evident from the figure in the case of SNR = 0 dB, where the estimated
pitch period is τ = 197, in accordance with the actual pitch period value.
However, for lower levels of SNR, the analyzed signal is dominated by
the noise components. For signal x3 (n) in the bottom panel, the pitch pe-
riod was incorrectly estimated at τ = 104 samples. The latter is a typical
octave error, which happens when the estimated pitch period results twice
or half the actual value. In this case, the noise components affected signif-
icantly the frequency region relative to the signal fundamental frequency,
and the first harmonic was wrongly detected as f0 .
95
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
Voiced speech segment − clean signal WAUTOC function applied to signal x (n)
1
0.5
0.4
WAUTOC1(τ)
0
x (n)
0.2
1
−0.5
0
−1
100 200 300 400 500 600 700 800 50 100 150 200 250 300 350 400
x1(n) + white noise (SNR = 0 dB) WAUTOC function applied to signal x2(n)
0.5 1
WAUTOC2(τ)
0
x (n)
0.5
2
−0.5
0
−1
100 200 300 400 500 600 700 800 50 100 150 200 250 300 350 400
x (n) + white noise (SNR = −5 dB) WAUTOC function applied to signal x (n)
1 3
1 1
WAUTOC3(τ)
0.5
x (n)
0
3
−1
100 200 300 400 500 600 700 800 50 100 150 200 250 300 350 400
time lag (τ)
Figure 3.9: Weighted autocorrelation function computed on noisy signals. Left panels
show, from the top, a clean voiced speech signal, the same signal with white noise added
with a signal to noise ratio of 0, and −5 dB, respectively. The WAUTOC function is
computed for each speech signal and plotted in the right panels. The estimated pitch
period values are τ = 198, 197 and 104 samples, respectively.
make, consequently, the CMNDF turn very noisy, providing thus the wrong
pitch estimate at τ = 304 samples.
96
3.4. Noise and Reverberation 3. From speech modeling to pitch based applications
CMNDF (τ)
1
0 0.8
x (n)
0.6
1
−0.5
0.4
−1 0.2
200 400 600 800 100 200 300 400
x (n) + white noise (SNR = 0 dB) CMNDF function applied to signal x (n)
1 2
CMNDF (τ)
0.5 1
2
x (n)
0 0.8
2
−0.5
0.6
−1
200 400 600 800 100 200 300 400
x1(n) + white noise (SNR = −5 dB) CMNDF function applied to signal x3(n)
1 1.1
CMNDF (τ)
0.5 1
3
x (n)
0.9
0
3
0.8
−0.5
0.7
200 400 600 800 100 200 300 400
x1(n) + white noise (SNR = −10 dB) CMNDF function applied to signal x4(n)
1 1.1
CMNDF (τ)
1
4
x (n)
0 0.9
4
0.8
−1 0.7
200 400 600 800 100 200 300 400
time lag (τ)
Figure 3.10: Cumulative Mean Normalized Difference Function computed on noisy signals.
Left panels show, from the top, a clean voiced speech signal, the same signal with white
noise added with a signal to noise ratio of 0, −5, and −10 dB, respectively. The CMNDF
is computed for each speech signal and plotted in the right panels. The estimated pitch
period values are τ = 197, 198, 199, and 304 samples, respectively.
12
Unfortunately, the military interest in speech processing applications is very high. It is quite common,
recently, to come up with articles relating about robust speech recognition in presence of tank, jet cockpit,
machine gun or helicopter rotor thickness noises.
97
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
The last distinction refers to the way the noise and the target signal in-
teract: additive noise is supposed to add linearly to the target signal in the
time domain, while convolutional noise is more related to the room acous-
tical properties. The former represents ideal acoustical conditions, which
are never met in reality, but is useful for system analysis purposes. The
latter is a closer representation of what actually happens in any real con-
text, even though it does not take into account possible non-linear acoustic
effects.
13
Speech babble is audible during, for example, a cocktail party and it is one of the most difficult noise
that pitch extractor algorithms have to deal with. In fact, depending on its intensity, several additional
pitches and formants, belonging to the different voices in the babble, can add to the target speech signal.
14
See Note 12.
98
3.4. Noise and Reverberation 3. From speech modeling to pitch based applications
3000 0
2000 −50
Hz
1000 −100
0
−0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
3000 0
2000 −50
Hz
1000 −100
0
−0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
time (sec)
Figure 3.11: Effect of train coach noise on a segment of speech signal. The two panels at
the top show the spectrogram and the waveform of a clean speech signal recorded from a
female speaker. In the two panels at the bottom, noise recorded in a train coach was added
to the clean speech signal with an SNR of 10 dB.
3.4.2 Reverberation
Whenever a pitch extractor algorithm has to deal with speech signals cap-
tured by a microphone positioned far from the talker, its performance de-
creases. Compared to the close-talk recording case, when the speech sound
has to propagate through the environment to reach a distant acoustic sen-
sor, is more subject to noise and reverberation effects.
The reverberation word has its root in Latin, meaning to “beat back
99
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
100
3.4. Noise and Reverberation 3. From speech modeling to pitch based applications
As pointed out in [107], the chirp-like signals with a flat overall power
spectrum have the important property that their autocorrelation is an
almost perfect Dirac delta function. As a consequence, the sequence y(n)
of Equation 3.7 can be easily deconvolved by simply cross-correlating it
with the original sequence p(n). The result, apart from the contribution
of the loudspeaker frequency response, is the impulse response h(n) of the
considered acoustic channel.
In case several reverberant speech dataset are needed, various micro-
phones can be placed to record the chirp-like signal at the same time.
Once the impulse response relative to the acoustic channel from the loud-
speaker to each microphone is estimated, it can be used to convolve the
signals from any clean speech database available.
In Figure 3.12, the top panel shows a segment of voiced speech signal
recorded by a close-talk microphone. Middle panel shows an example of
room impulse response. The first peak from the left, occurring at about
35 ms, determines the delay with which the direct sound propagates from
the source position to the particular point in the room, where the impulse
response was estimated. The successive peaks, as that occurring at about
42 ms, are due to early reflections. These are directional reflections gener-
16
A linear swept-frequency cosine signal.
101
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
ally well defined and are directly related to the shape and size of the room,
as well as to the furniture and to wall surface materials. The tail of the im-
pulse response is formed by diffuse reverberation, or late reflections, which
are more random and difficult to relate to the physical characteristics of
the room.
Voiced speech segment − clean conditions
0.5
0
x(n)
−0.5
−1
35 229 427 634
time (samples)
0.5
ir(n)
−0.5
−1
0.03 0.04 0.05 0.06 0.07 0.08 0.09
time (seconds)
Voiced speech segment − reverberant conditions
0.5
0
x’(n)
−0.5
−1
37 90 228 260 427 634
time (samples)
Figure 3.12: Top panel: voiced speech signal. Middle panel: reverberant room impulse
response. Bottom Panel: reverberant voiced speech signal obtained as the convolution of
clean signal and room impulse response.
The bottom panel shows the result of the convolution of the close-talk
speech signal (top panel) with the room impulse response (middle panel).
102
3.4. Noise and Reverberation 3. From speech modeling to pitch based applications
To show the effect of reverberation on each pitch epoch, the result has been
time-aligned with the plot of the top panel. In the clean speech signal, the
glottal closure instants are clearly visible at lag values τ = 35, 229, 427
and 634. The reverberation effect changes significantly the waveform and,
as evident from the figure, only the local minima at lag values τ = 427
and 634 are still unambiguously detectable. Instead the negative peaks at
τ = 37, and 228, correspondent to peaks of the clean segment at τ = 35,
and 229, are exceeded by those at τ = 90, and 260.
This is the reason why reverberation is regarded as a convolutive noise,
that degrades the speech quality and intelligibility. Examples of the effects
of this speech quality degradation are visible in Figure 3.13 and 3.14, where
the weighted autocorrelation (WAUTOC) and YIN algorithms have been
applied to the reverberant speech segment showed in the bottom panel of
Figure 3.12. In Figure 3.13, the WAUTOC function is calculated, first
on the clean speech signal, providing a pitch estimate of τ = 198 samples,
then on the reverberant signal, providing the wrong pitch estimate τ = 340
samples.
It is interesting to note that not even a peak is present in the bot-
tom right panel around τ = 200. The reverberation effect has completely
canceled out the period component relative to the fundamental frequency.
Nevertheless, the reverberant speech signal is perfectly decoded by the
human auditory system resulting thus perfectly intelligible.
The same considerations are applicable considering Figure 3.14, where
the Cumulative Mean Normalized Difference Function (CMNDF) is com-
puted. Still the estimation of the pitch period on the reverberant signal
fails, providing τ = 378 samples, while the correct one is around τ = 197
samples, value provided by the CMNDF on the clean speech segment.
Beside all the above considerations, it has to be pointed out that, con-
103
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
Voiced speech segment − clean signal WAUTOC function applied to signal x (n)
1
0.5 0.2
WAUTOC (τ)
0 0.1
1
x (n)
1
−0.5 0
−1
−0.1
100 200 300 400 500 600 700 800 50 100 150 200 250 300 350 400
x (n) convolved with a room impulse response WAUTOC function applied to signal x (n)
1 2
0.2
0.5
WAUTOC (τ) 0.1
2
0
x (n)
2
−0.5 0
−1
−0.1
100 200 300 400 500 600 700 800 50 100 150 200 250 300 350 400
time (samples) lag (samples)
volving the room impulse responses with clean speech signals, produces
just an approximation of the reverberation effects. Reverberation, in fact,
includes also non linear phenomena which can not be modeled by this
method. To test a system in a real reverberant scenario, it will be thus
preferable to acquire the speech data recording it directly from the ambient
where the talker is speaking, or where a loudspeaker is used to reproduce
a given database. This operation, though being more time-consuming and
less versatile, compared to the room impulse responses method, provides
the closest test conditions to the real environment. The latter approach,
that is, the direct acquisition of reverberant speech data, has been used
for testing the pitch extractor algorithms that are proposed in this the-
sis. The databases resulted from the recordings that have been carried
out in different real noisy and reverberant scenarios, will be described in
104
3.4. Noise and Reverberation 3. From speech modeling to pitch based applications
Voiced speech segment − clean signal CMNDF function applied to signal x (n)
1
0.5
1
CMNDF1(τ)
0 0.8
x (n)
1
0.6
−0.5
0.4
−1
0.2
200 400 600 800 100 200 300 400
x (n) convolved with a room impulse response CMNDF function applied to signal x (n)
1 2
0.5 1
CMNDF2(τ)
0.8
0
x (n)
2
0.6
−0.5
0.4
−1
200 400 600 800 100 200 300 400
time (samples) lag (samples)
Chapter 5.
When a PDA has to be tested on real-world speech data, that is, on data
affected by environmental noise and reverberation, a speech dataset re-
flecting such conditions is needed, along with the reference pitch values
necessary for performance evaluation. To obtain the speech dataset for the
target scenario a solution is to record a talker (or several talkers) in the
considered environment. Once the data is collected, the pitch reference
values, needed to evaluate a PDA performance, have to be derived. Since
this results in a time-consuming procedure, an alternative is to reproduce
in the selected scenario an already pitch-labeled speech database by means
of a loudspeaker. The latter procedure permits, in fact, to obtain a new
105
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
speech database with relative little effort and the original pitch references
can be reused to evaluate a given PDA.
When one of the above methods is applied, several microphones are usu-
ally employed, placed in different environment positions. This permits to
obtain several versions of the original database, each characterized by dif-
ferent noisy and reverberant conditions. Also the whole set of microphone
outputs can be needed, in case a PDA with multi-microphone processing
capabilities has to be tested. The speaker or loudspeaker position is usually
fixed during the recordings, although when spontaneous speech is needed
the talker is not constrained to stand in a given position. An example of
spontaneous speech used to test PDAs performance in this thesis is repre-
sented by the seminar sessions recordings described in Section 5.3.
where x(n), hi (n) and ri (n) represent the speech signal, the acoustic im-
pulse response between the speech source and the i-th microphone, and
the noise signal affecting the considered sensor, respectively. x(n) repre-
sents here the speech signal recorded by a close-talk microphone, which is
convolved with the room impulse response hi (n) to obtain its reverberant
version, as previously introduced in Section 3.4.2.
106
3.4. Noise and Reverberation 3. From speech modeling to pitch based applications
Considering the additive term r(n) in Equation 3.8, this model permits
to artificially add noise in the time domain using random signal generators
that can simulate noisy patterns with a given distribution. Or, alterna-
tively, it can be added using noise recorded from real environments. The
result provided by this model is not realistic since it considers the noise
effects as additive and independent from the environment acoustic.
A more precise model, which takes into account both convolutional and
additive noise effects, can thus be written as
K
X
yi (n) = x(n) ∗ hi (n) + rk (n) ∗ ĥki (n) + ri0 (n), (3.9)
k=1
where the new variables K and ĥki (n), indicate the number of noise sources,
each located in a known position, and the impulse response between the
k-th noise source rk (n) and the i -th microphone, respectively. The term
ri0 (n) can still be used to model possible additive noise components.
When the pitch extractor algorithm has to be tested on close-talk sig-
nals, the above described models can be used setting the term hi (n) in
Equation 3.8 and 3.9, so that it introduces just the delay with which the
speech signal reaches the acoustic sensor17 , and a possible attenuation.
Whenever the reverberation effects have to be modeled instead, the term
hi (n) is set to the value of the room impulse response, measured as de-
scribed in Section 3.4.2.
17
A delayed version of a discrete signal x(n) is obtained convolving it with the delta of Kronecker
function, centered at the time sample corresponding to the propagation time.
107
3. From speech modeling to pitch based applications 3.4. Noise and Reverberation
108
Chapter 4
Multi-Microphone Approach
109
4. Multi-Microphone Approach
110
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
One of the scenarios on which the interest of the speech processing commu-
nity has recently moved on, is the “meeting room” context, where several
talkers are involved in the speech recognition process. In this context, a
uniform coverage must be guaranteed so that each speaker position can
be estimated and tracked. Along with distant ASR, this new environment
111
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
112
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
2 c
The maximum inter-microphone distance d allowed for a linear array, is given by d ≤ 2fmax , where
c ≡ 330.7 m/s is the sound speed, and fmax is the maximum frequency present in the signal. For a
16 kHz sampled speech signal, considering fmax = 8 kHz, it results approximatively d ≤ 2 cm.
113
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
N
X −1
[xi (q + n) w(n)] · [xi (q + n + τ ) w(n + τ )]
n=0
wautoci (τ, q) = (4.1)
X−1
q+N
+ |xi (n) − xi (n + τ )|
n=q
M
X
wautocM (τ, q) = wi · wautoci (τ, q) (4.2)
i=1
114
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
115
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
0.5
x(n)
−0.5
−1
100 200 300 400 500 600 700 800
−1
100 200 300 400 500 600 700 800
time (samples)
0.4 wautoc(τ,q)
0.3 wautoc (τ,q)
mic
wautoc(τ)
0.2 wautocM(τ,q)
0.1
−0.1
50 121 123 179 250 300
lag (τ)
Figure 4.1: A clean voiced speech signal and its reverberant version are shown in the top
and middle panels, respectively. The bottom panel shows the WAUTOC computed on:
the clean speech signal (black line), providing the correct pitch estimate τ = 121; the
reverberant speech signal (blue line), providing the wrong estimate τ = 179; ten outputs of
a Distributed Microphone Network (red line), providing, within a small error, the correct
estimate τ = 123.
2
3 f0 .
This is due to the regular presence of secondary peaks in the speech
signal, as those occurring at samples 27, 155, 280, etc., which mix in the
reverberant signal those produced by the glottal closure instants.
To plot the red line, a Distributed Microphone Network consisting of
10 microphones, was used. The speech reverberant outputs xi (n), 1 ≤
i ≤ 10 were used in the computation of the multi-microphone WAUTOC
(Equation. 4.2), which peaks at τ = 123. The correct pitch period value is
thus obtained despite the distorted speech signals used. The small error in
116
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
the estimate is not considered a serious issue here since, as stated in [17], if
an initial estimate is correct to within 20% respect the actual one, several
techniques are available to refine it.
The YIN algorithm was introduced in Section 2.3.1 as one of the state of
the art among the pitch detection algorithms [17]. For this reason it was
chosen here to derive a multichannel version, suitable to work with inputs
provided by a DMN.
Given a speech signal xi (n) captured by the i-th microphone, and recall-
ing the difference function d(τ, q) of Equation 2.18, a channel dependent
difference function can be written as,
N
X −1
di (τ, q) = [xi (q + n) − xi (q + n + τ )]2 , (4.3)
n=0
The main reason for deriving Equation 4.4 is twofold: on the one hand
it does not have a dip in correspondence of the zero lag (τ = 0), as Equa-
tion 4.3 does. This implies that no limit in the search range of τ is needed.
On the other hand, it provides a normalized function to which a threshold
117
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
can be applied to avoid subharmonic error due to other dips, deeper than
that relative to the pitch period. Normalization is also exploited by the
YIN algorithm to apply post-processing in a later step so that pitch esti-
mation errors are further reduced.
M
1 X di (τ )
dM (τ ) = . (4.5)
M i=1 max{di (τ )}
τ
118
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
119
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
0.5
x(n)
−0.5
−1
100 200 300 400 500 600 700 800
0
mic
x
−1
100 200 300 400 500 600 700 800
time (samples)
1 d(τ,q)
0.8 d’ (τ,q)
mic
CMNDF(τ)
0.6 d’’(τ,q)
0.4
0.2
0
50 122 125 181 250 300
lag (τ)
Figure 4.2: A clean voiced speech signal and its reverberant version are shown in the
top and middle panels, respectively. The bottom panel shows the CMNDF computed on:
the clean speech signal (black line), providing the correct pitch estimate τ = 122; the
reverberant speech signal (blue line), providing the wrong estimate τ = 181; ten outputs of
a Distributed Microphone Network (red line), providing, within a small error, the correct
estimate τ = 125.
The f0 extraction algorithm based on the MPF can be classified under the
frequency domain category and, in particular, it includes a processing that
resembles that described in [90].
Considering a Distributed Microphone Network context, the different
paths, from the source to each microphone, are affected differently by the
non linear reverberation effects, which can enhance some frequencies while
120
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
attenuating others.
The peaks in the magnitude spectrum which refer to f0 and its har-
monics, are thus altered by the linear and non-linear reverberation effects
in both their dynamics and frequency location. Nevertheless, as shown
in Section 3.4.2, the reverberation effects, that common speech processing
applications usually deal with, can be linearly approximated by means of
the room impulse responses. This implies that the actual amount of non
linear distortion introduced by the ambient can be considered limited and,
as a consequence of this, the peak frequency shifts will be limited to a
small frequency interval. Hence, the common harmonic structure across
the different magnitude spectra, can be exploited for better estimating the
fundamental frequency.
An example of this is shown in Figure 4.3, where each output of a DMN
consisting of 10 microphones, is plotted (left) along with the correspond-
ing frequency spectrum (right). The corresponding clean speech segment,
captured by a close-talk microphone, and its spectrum, are shown at the
top of the figure, plotted in blue.
The considered clean speech segment corresponds to that used to test
the multi-microphone versions of the WAUTOC and the YIN algorithms
in Sections 4.1.1 and 4.1.2, respectively. The pitch period that was then
estimated was of τ = 121 and 122 samples, respectively. Considering the
sampling frequency fs = 20kHz, this corresponds to a value for the funda-
mental frequency falling in the approximated range 164 ≤ f0 ≤ 165.3 Hz.
Comparing the reverberant signal spectra in the right column of the
figure, with that of the clean signal (top) it is interesting to note how the
reverberation detrimental effects changes the spectra shape. The peaks,
relative to f0 and its harmonics are attenuated differently and their po-
sitions is not constant across the different channels. Also spurious peaks
appear beside those corresponding to f0 and its multiples. If the funda-
121
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
Close−talk voiced speech signal x(n) Close−talk voiced speech signal spectrum X(k)
1
0.5
0
164.8 329.6 492.3 657.9 810.6
Microphones output signal x (n) Microphones output spectrum Xi(k)
i
CH 1
CH 2
CH 3
CH 4
CH 5
CH 6
CH 7
CH 8
CH 9
CH 10
100 200 300 400 500 600 700 800 164.8 329.6 492.3 657.9 810.6
time (samples) frequency (Hz)
Figure 4.3: Voiced speech segments output from a Distributed Microphone Network con-
sisting of 10 microphones. Left: time domain; Right: frequency domain. The top panels
show the speech segment captured by a close-talk microphone and its spectrum.
122
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
being k the frequency bin index. The real valued contributes X i (k) are then
normalized and used to compute a weighted sum over all DMN channels:
M
X Xi (k) Nf
Xave (k) = ci · , 1≤k≤ + 1, (4.8)
i=1
max{Xi (k)} 2
k
where the weights ci represent the reliability of each channel and their
123
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
N
expression will be derived in the following. The index k is limited to 2f + 1
since the result of Equation 4.7 is an even function, respect to the index
Nf
2 + 1.
The last step, to derive the Multi-microphone Periodicity Function, is
obtained computing the Inverse FFT (IFFT) of Xave (k), as follows:
Nf Nf
mpf(τ ) = IFFT{Xave ([1, . . . , 2 +1, 2 , . . . , 2])} (4.9)
where the argument of the IFFT is a vector whose Nf elements are the
Xave (k) values, with k first ranging from 1 to Nf /2+1, then decreasing from
Nf /2 to 2, so that the original symmetry of Xi (k) is restored. The function
mpf(τ ) results thus a minimum phase signal with characteristics similar to
those of the autocorrelation function described in Section 2.3.1. The main
difference is that, while the ACF can be obtained as the inverse Fourier
transform of the magnitude spectrum raised to the second power (power
spectrum), the mpf is obtained raising the magnitude spectra, provided by
Equation 4.7, to the first power. The reason for this choice will be given
in Appendix C.
The resulting mpf function has thus the same properties of a generalized
autocorrelation function, and the lag value at which a maximum is found,
can be considered as the fundamental period T0 of the analyzed frame.
Interpolation can also be applied before searching for its maximum, in order
to compensate for the resolution loss, occurred when the input signal was
originally down-sampled. Once established the minimum and maximum
value that the estimated pitch period can assume, T0 is computed as
and all the process is repeated for the next frame of speech. Estimating the
reliability of each channel contribute in Equation 4.8, that is, evaluating
124
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
how much each Xi (k) gets closer to the spectrum of the close-talk signal,
is carried out estimating the weights ci in a blind fashion. This is accom-
plished in two steps. First a reference spectrum is derived as the product
of the channel magnitude normalized spectra:
M
Y Xi (k)
XP (k) = . (4.11)
i=1
max{Xi (k)}
k
This results in a function XP (k) that will retain the information common
to the different channels while rejecting frequency patterns not common to
all channels. The result of Equation 4.11 can be though of as an estimate
of the close-talk speech spectrum.
The second step compute each weight ci basing on the Cauchy-Schwartz
inequality applied to functions Xi (k) and XP (k) considering them as if they
were vectors:
PK
k=1 XP (k)Xi (k) Nf
c i = qP qP , K= + 1. (4.12)
K 2 K 2 2
k=1 XP (k) k=1 Xi (k)
125
4.1. Distributed Microphone Network
....
x1 x2 xM
126
Xi
Xi Xi
K
XP (k)Xi(k)
X
u K u K
u u
i=1 i=1
XP2 (k) Xi2(k)
u X u X
u u
t t
k=1 k=1
Figure 4.4: Simplified scheme of the pitch extractor based on the Multi-microphone Periodicity Function. The speech signal
down-sampling blocks and the MPF interpolation blocks are not shown.
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
frequency positions of f0 and its harmonics are plotted across all spectra
panels of Figure 4.3. The first harmonic at 329.6 Hz, is the one that is
best matched in the several channel spectra. Disregarding the first and the
second channels, in all the others it reflects quite well the amplitude and
position of that in the close-talk speech spectrum. For the second harmonic
at 492.3 Hz, a similar consideration can be made. Except for the fourth
and tenth channels, were it is completely misplaced, in the other spectra
it matches the reference one. For f0 and the other harmonics, the spectra
peaks either do not match in amplitude (as mostly happen for the third
harmonic), either in frequency position (fourth harmonic), either in both
amplitude and position (f0 ).
0.5
0
0 100 200 300 400 500 600 700 800 900 1000
0.5 Xave(k)
0
0 100 200 300 400 500 600 700 800 900 1000
frequency (Hz)
0.5
0
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
Figure 4.5: Top panel: spectrum of the close-talk voiced speech segment plotted in the top
left panel of Figure 4.3. Middle panel: reference spectrum XP and averaged spectrum Xave
computed on the channel contributes shown in the left column of Figure 4.3.
127
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
The function XP (k) is used to obtained the channel weights, which are
plotted in the bottom panel of the figure. The situation described above
is coherently reflected by the ci values: channels 4 and 10 are considered
the least reliable, while channels 8 and 9 demonstrated the best similarity
to the spectrum of the close-talk speech signal.
4
In the cases where a single speech input is used, the latter represent the unique term in the summation
of Equation 4.8, and the relative ci coefficient is set to 1.
128
4.1. Distributed Microphone Network 4. Multi-Microphone Approach
0.5
x(n)
−0.5
−1
100 200 300 400 500 600 700 800
−1
100 200 300 400 500 600 700 800
time (samples)
1 mpf (τ)
cl
0.8
mpfmic(τ)
mpf(τ)
0.6
mpf(τ)
0.4
0.2
0
50 100 150 200 250 300 350
lag (τ)
Figure 4.6: A clean voiced speech signal and its reverberant version are shown in the
top and middle panels, respectively. The bottom panel shows the MPF computed on: the
clean speech signal (black line); the reverberant speech signal (blue line); ten outputs of a
Distributed Microphone Network (red line). All tests provided, within a small error, the
correct pitch estimate, that is, τ = 121, 127 and 124, respectively.
129
4. Multi-Microphone Approach 4.1. Distributed Microphone Network
130
Chapter 5
Experimental Results
In the previous chapters, many pitch extraction algorithms have been de-
scribed and discussed. However, the focus has been restricted to two state
of the art algorithms, the Weighted Autocorrelation and the YIN algo-
rithms, and to the Multi-microphone Periodicity Function, proposed in this
thesis to overcome some limitations imposed by the traditional approaches,
when tested in reverberant and noisy scenarios. This chapter describes the
speech material that was collected and adapted to test these algorithms,
the error measure adopted for their evaluation and the obtained results.
Particular care and effort were dedicated to this testing phase, since it is
very important to have a large amount of speech data and a particularly
precise pitch reference, against which to compare the PDAs output. Some
further considerations about the characteristics that pitch reference labels
shall possess, are given in Appendix B.
131
5. Experimental Results 5.1. Performance evaluation
132
5.1. Performance evaluation 5. Experimental Results
is not rare that the majority of the pitch extractors agree on an erroneous
f0 estimation. To be sure of the final labels reliability, a manual checking
is needed anyway. This method was used to obtain the CHIL database,
described in Section 5.3.
There are various measures that can be used to evaluate the quality of a
pitch extraction algorithm. The principal ones are response time, accuracy,
resolution and complexity. The response time measures the delay with
which the device adapts to a sudden change in the pitch or provides its
estimate after a unvoiced/voiced transition occurs. Accuracy is related
with the result reliability, while resolution is concerned with the precision
with which the provided value matches the reference pitch. Complexity is
principally involved with the amount of hardware resources needed to run
the algorithm, in terms of memory and computational requirements. This
measure turns out to be very important in real-time applications, where
no delay is allowed between the current analyzed frame and the provided
pitch estimate.
In this thesis, the principal method used to evaluate the PDAs perfor-
mance is the Gross Error Rate (GER). This is calculated considering the
number of f0 estimates which differ by more than a certain percentage θ
from the reference values. Considering a total of Nfr pitch values estimated,
its formulation can be written as
N
( )
100 X fr
|fˆ0i − f0i |
GER(θ) = > θ% , (5.1)
Nfr i=1 f0i
where fˆ0i and f0i are the fundamental frequency estimated and the refer-
ence one relative to the i-th frame, respectively. The term in curly brackets,
133
5. Experimental Results 5.1. Performance evaluation
where NGER(θ) equals the number of elements in the set Ω(θ). Although
this measure provides a very precise indication of the PDA resolution ca-
pabilities, the GER is mainly used here for the experiments evaluation. In
fact, as stated in [17], once the initial estimate provided is within 20% of
being correct, there exist many further processing techniques available to
refine its value.
134
5.2. Keele 5. Experimental Results
may happen that if the V/UV information provided by each PDA is used
to evaluate its estimates, these voicing segmentations do not coincide for
all devices. Therefore it will not be possible to establish whether a partic-
ular device performs better than the other due to its good pitch estimation
capabilities, or to its precision in V/UV decisions. To avoid this possi-
ble ambiguity, tests carried out in the following, assume a common V/UV
sections segmentation for all tested algorithms.
5.2 Keele
The Keele database consists of five male and five female English speak-
ers who pronounced phonetically balanced sentences from the “The North
Wind Story”, for a total duration of 9 minutes. A close-talk microphone
was used in a soundproof room to record the readers while a laryngograph
was simultaneously employed to track the signal generated by the speakers
vocal folds. The sampling frequency used to digitize the signal was set to
fs = 20 kHz with 16 bit resolution. The pitch reference files contain V/UV
information and a pitch estimate every 10 ms frame length of the speech
signal. Pitch values were extracted applying the autocorrelation function
to the laryngograph output [80].
135
5. Experimental Results 5.2. Keele
−0.1
−0.2
3.5 3.505 3.51 3.515 3.52 3.525 3.53 3.535 3.54 3.545
Original Keele laryngograph signal
5000
lx(n)
−5000
3.5 3.505 3.51 3.515 3.52 3.525 3.53 3.535 3.54 3.545
Keele laryngograph signal high−pass filtered
5000
lxhp(n)
−5000
3.5 3.505 3.51 3.515 3.52 3.525 3.53 3.535 3.54 3.545
Pitch reference values
300
τ (lag)
200
Figure 5.1: Example of pitch references based on laryngograph signal. A voiced speech
segment of a male speaker and the relative laryngograph signal lx(n) are plotted in the
first and second panels, respectively. The third panel shows the high-pass filtered version,
lxhp (n), obtained from lx(n). The bottom panel compares the pitch references obtained
using the Praat tool [11], applied to lxhp (n), and the original ones, provided with the
Keele database.
x(n), and the relative laryngograph signal lx(n), respectively. The latter
is affected by a slowly varying bias, more evident in the leftmost part of
the panel. This is probably due to movements of the speaker during the
original recordings. To eliminate this bias a high-pass linear phase filter,
136
5.2. Keele 5. Experimental Results
5.2.1 Scenario
137
5. Experimental Results 5.2. Keele
~0.8
m~1m~1m~1m
M10 M9
M8 M7
M6
m * m
~1.5
0.4
~3m~2mM5
0.2m
P2
M4
m
3.5 m ~2.3 m
0.4
P1
M3
hroom:
3m
~2m
hmic:
m
7m~1m
1.7
M1 M2
Figure 5.2: Office with ten microphones and loudspeaker placed in two positions, one
marked with P 1 with 30 degrees top right orientation, the second in P 2, directed from left
to right. The room is quiet, except for a computer fan marked with a “ ∗”.
5.2.2 Results
138
5.2. Keele 5. Experimental Results
Graphs of Figure 5.3, 5.4, 5.5 and 5.6 refer to pitch estimation results
measured in terms of GER(20) and GER(5), respectively. On x-axis, the
number of the considered DMN microphone is indicated, while on y-axis,
the measured GER is shown. Each algorithm is marked with a specific
symbol, that is, the “H” for WAUTOC, “•” for YIN and “” for MPF.
To distinguish between the different analyzed contexts, results have been
plotted with different colors: red for single distant channel context, black
for joint multi-microphone scenario, and blue for results obtained using
the close-talk signal. Table 5.1, 5.2, 5.3 and 5.4 numerically summarize
the results shown in the graphs.
On the top right position of Figure 5.3 is shown the office environment
with the DMN and the speaker position and orientation. As shown by the
red graphs, which report the results obtained by the three PDAs applied
on each distant microphone individually, the 3-rd, 4-th, 5-th and 10-th
microphones provided the most corrupted signal. The high GER(20) values
obtained are due to the presence of windows located above the first group
of microphones. These are characterized by a higher reflection coefficient
compared to the surrounding walls. Instead, microphone 10 falls almost
outside the sound field produced by the speaker. Consequently, it cannot
139
5. Experimental Results 5.2. Keele
capture sound proceeding from the direct path properly and, thus, the
reflected components have a greater detrimental influence. This holds, to
some extent, also for microphones 1 and 9. The best result is obtained
using the speech signal recorded by the 8-th microphone. It is positioned
almost in front of the source signal and far enough from the reflecting
windows before mentioned.
Among the single-channel versions of tested algorithms, WAUTOC pro-
vided the worse results, while the MPF the best one. YIN algorithm gave
instead, GER(20) values in between.
GER(20) WAUTOC YIN MPF
close-talk 4.51 2.04 2.68
single-mic 14.56 11.56 9.42
multi-mic 8.09 6.80 5.39
Table 5.1: Gross error rates (20%) obtained applying WAUTOC, YIN and MPF, respec-
tively, to the Keele speech dataset. Values refer to the curves depicted in Figure 5.3 and
in the second row the averages, computed for each red curve, are reported. Bold font is
used to indicate the best result obtained in each acoustic condition.
140
5.2. Keele 5. Experimental Results
Wautoc: close−talk
YIN: close−talk M1 M2
MPF: close−talk
Keele P1
18
16
14
12
GER(20)
10
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
Figure 5.3: The three red curves show gross error rates (20%) derived by applying
WAUTOC (H), YIN (•) and MPF (), respectively, to each of the 10 microphone sig-
nals. Results refer to the loudspeaker in position P 1. The six horizontal lines indicate the
performance provided by each of the three algorithms on the close-talking signals (blue)
and the corresponding performance obtained by their multichannel version applied to all
the far microphone signals (black).
141
5. Experimental Results 5.2. Keele
(black lines), the trend provided by the single distant microphones is con-
firmed. In this case too, the MPF provided the lowest GER(20), 5.39%,
compared to the values of 6.8% and 8.09% given by YIN and WAUTOC,
respectively. It is interesting to note that the MPF further reduced the
GER in comparison with the best result that had previously achieved from
any single microphone. This to confirm, as pointed out in Section 4.1.3,
the ability of the proposed algorithm to exploit the information redun-
dancy offered by the DMN, and to reject reverberant contributes which
affect differently each channel.
Besides, comparing the above results with the value 16.2%, obtained
applying traditional beamforming techniques, as shown in [6], it confirms
how the latter techniques are unsuitable for such a microphone disposition
(Section 4.1).
142
5.2. Keele 5. Experimental Results
Wautoc: close−talk
YIN: close−talk M1 M2
MPF: close−talk
Keele P1
30
25
GER(5)
20
15
10
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
Figure 5.4: The three red curves show gross error rates (5%) derived by applying
WAUTOC (H), YIN (•) and MPF (), respectively, to each of the 10 microphone sig-
nals. Results refer to the loudspeaker in position P 1. The six horizontal lines indicate the
performance provided by each of the three algorithms on the close-talking signals (blue)
and the corresponding performance obtained by their multichannel version applied to all
the far microphone signals (black).
143
5. Experimental Results 5.2. Keele
Table 5.2: Gross error rates (5%) obtained applying WAUTOC, YIN and MPF, respec-
tively, to the Keele speech dataset. Values refer to the curves depicted in Figure 5.4 and
in the second row the averages, computed for each red curve, are reported. Bold font is
used to indicate the best result obtained in each acoustic condition.
mation of the strength against high signal distortion, better values for each
single distant microphone are obtained. This behaviour will be confirmed
for tests carried out with the loudspeaker placed in position P 2. In this
scenario the overall reverberation effect is stronger and MPF is shown to
provide even better results.
The first thing to note when considering the experiments run with the loud-
speaker located in position P 2, is the higher reverberation which affects the
speech signals. This is visible if Figure 5.5 is considered. In fact, red curves,
obtained by testing the three algorithms on each single distant microphone,
have a higher average value compared to those of the P 1 scenario. In
particular, it is interesting to note that the presence of windows on the
top part of the right wall, still affects negatively the signal acquisition by
the microphones which are below it. This is true especially for the 5-th
microphone, which provides one of the worst contributes, as shown by the
high GER.
The best acquisitions were those of microphones 8, 9 and 10, for their
close placement near the sound source and far from the windowed wall. As
it did in the P 1 case, YIN results still lay in between those provided by
the MPF algorithm, and the WAUTOC results, which turned out to be
144
5.2. Keele 5. Experimental Results
the worst.
GER(20) WAUTOC YIN MPF
close-talk 4.51 2.04 2.68
single-mic 17.10 14.50 10.98
multi-mic 10.14 8.97 7.00
Table 5.3: Gross error rates (20%) obtained applying WAUTOC, YIN and MPF, respec-
tively, to the Keele speech dataset. Values refer to the curves depicted in Figure 5.5 and
in the second row the averages, computed for each red curve, are reported. Bold font is
used to indicate the best result obtained in each acoustic condition.
The results obtained from the close-talk scenario (blue line), were al-
ready commented in the previous section. They represent a GER lower-
bound for the three tested algorithms and are reported in this graph just
for comparison purposes. What it is interesting to note here, is the general
worsening of the GER figures when the reverberant signals are used, both
in single or multi-channel fashion.
In the latter case (black line), the MPF achieved the best result, GER(20) =
7%, followed by YIN and WAUTOC algorithms with a GER(20) of 8.97%
and 10.14, respectively. Also in this scenario, MPF further reduced GER
respect to the best result that had previously achieved from any single mi-
crophone. This happened also for the WAUTOC algorithm which obtained
the best improvement, comparing with single-distant microphone scenario.
However, this time domain based algorithm demonstrated its ineffective-
ness to process reverberant signals, compared to the other PDAs that have
been considered.
145
5. Experimental Results 5.2. Keele
Wautoc: close−talk
YIN: close−talk M1 M2
MPF: close−talk
Keele P2
22
20
18
16
14
GER(20)
12
10
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
Figure 5.5: The three red curves show gross error rates (20%) derived by applying
WAUTOC (H), YIN (•) and MPF (), respectively, to each of the 10 microphone sig-
nals. Results refer to the loudspeaker in position P 2. The six horizontal lines indicate the
performance provided by each of the three algorithms on the close-talking signals (blue)
and the corresponding performance obtained by their multichannel version applied to all
the far microphone signals (black).
146
5.2. Keele 5. Experimental Results
As pointed out in the previous section, despite the fact that MPF algo-
rithm does not perform post-processing for pitch estimate refinement, in
case of strong reverberation conditions, it is able to provide the best per-
formance. As shown in Figure 5.6, in the close-talk case (blue line) YIN
and WAUTOC performed better, being designed to cope better with the
clear periodicity of close-talk speech signals. Instead, when reverberant
signals are considered, that is in the single distant and multi-microphone
contexts, MPF still provided pitch estimates with the best resolution.
GER(5) WAUTOC YIN MPF
close-talk 8.49 7.42 10.46
single-mic 29.99 27.80 25.77
multi-mic 22.89 21.88 21.59
Table 5.4: Gross error rates (5%) obtained applying WAUTOC, YIN and MPF, respec-
tively, to the Keele speech dataset. Values refer to the curves depicted in Figure 5.6 and
in the second row the averages, computed for each red curve, are reported. Bold font is
used to indicate the best result obtained in each acoustic condition.
147
5. Experimental Results 5.2. Keele
Wautoc: close−talk
YIN: close−talk M1 M2
MPF: close−talk
Keele P2
35
30
25
GER(5)
20
15
10
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
Figure 5.6: The three red curves show gross error rates (5%) derived by applying
WAUTOC (H), YIN (•) and MPF (), respectively, to each of the 10 microphone sig-
nals. Results refer to the loudspeaker in position P 2. The six horizontal lines indicate the
performance provided by each of the three algorithms on the close-talking signals (blue)
and the corresponding performance obtained by their multichannel version applied to all
the far microphone signals (black).
148
5.2. Keele 5. Experimental Results
From this, two other datasets were derived adding to the 3-rd channel
white noise with a SNR of 0 and 5 dB, respectively. The procedure was
then repeated to derive two more datasets using babble noise instead of
white noise.
GER(20) results provided by the multichannel version of WAUTOC
(blue line), YIN (black line) and MPF (red line) applied to the five speech
datasets obtained, are shown in Figure 5.7. The upper panel in the figure
shows tests conducted on the speech signals contaminated by white noise,
while the lower panel describes the performance obtained employing speech
data to which babble noise was added. In both scenarios two versions of
the MPF were tested: the first with all weights ci set to 1 (dashed line) so
that all microphone contributes where equally considered in Equation 4.8,
the second with weights provided by Equation 4.12.
The common x-axis reports which of three speech datasets was consid-
ered for each scenario, indicating, from left to right, decreasing SNR levels
measured on the third microphone output.
In the upper right part of the figure the loudspeaker position and direc-
tion and the three microphones considered, M 1, M 2 and M 3 are shown.
Red color was used to plot microphone M 3 to indicate that its output was
contaminated with different SNR values.
As shown in the figure, the GER(20) provided by the three algorithms in
both the white and babble noise scenarios, worsened as the SNR of the third
channel decreased. Also a common observable trend in all tests is that MPF
provided the lowest GER(20) while YIN and WAUTOC performed worse.
As indicated from the results relative to the “no noise added” case, that
is, when the original speech signals were used, the MPF function provided
almost the same GER(20) value when its two versions were tested. Channel
reliability estimation resulted thus not particularly advantageous in this
particular noise-free scenario. The opposite can be stated instead for the
149
5. Experimental Results 5.2. Keele
WAUTOC
YIN
MPF (ci=1)
P1
M3
MPF (ci ≠ 1)
M1 M2
Keele P1. Microphones 1,2 and 3 − white noise added to signal from microphone 3
12
11
10
GER(20)
Keele P1. Microphones 1,2 and 3 − babble noise added to signal from microphone 3
16
15
14
13
GER(20)
12
11
10
9
8
7
6
no noise added 0 dB −5 dB
SNR of signal from microphone 3
Figure 5.7: Gross error rates obtained by the multichannel version of each algorithm
under different noisy conditions. Only three microphones were used and noise was added
to channel 3 at different SNR levels. The upper panel shows results obtained on speech
data contaminated with white noise, while lower panel refers to the babble noise scenario.
150
5.2. Keele 5. Experimental Results
results obtained in noisy conditions. For decreasing SNR levels, the MPF
version with weights estimation provided by Equation 4.12, demonstrated
to be the more robust in both the white and babble noise conditions.
As it also resulted for WAUTOC and YIN in the tests showed in Fig-
ure 3.9 and 3.10, the three algorithms resulted more robust to white noise,
in fact the curves plotted in the upper panel of Figure 5.7 resulted more
flat compared with those of the lower panel.
Babble noise instead represents a more difficult noise that the algo-
rithms have to cope with, since its spectrum rather than being flat, as for
the case of white noise, can resemble that of voiced speech, becoming thus
a misleading source of information for the f0 estimator. This can be seen
in the lower panel observing that the GER(20) provided by both WAU-
TOC and YIN increased of almost 6% passing from the clean scenario to
the −5 dB SNR one. Also the MPF version with all weights ci set to 1
performed considerably worse with decreasing babble noise SNRs, passing
from a GER(20) of almost 7% in clean conditions to about 11% in the
worst conditions.
When MPF channel reliability estimation was instead exploited, the
GER(20) increase with decreasing SNR values, resulted the lowest com-
pared to all other cases. The reason for this is that channel reliability
estimation permitted to perform the f0 estimation basing on the most
noise-free channels, that is, those relative to microphones M 1 and M 2.
A second test, that was carried out to test the usefulness of weights
ci , considered the whole set of DMN channels. In this case microphones
M 5, M 6 and M 7 were contaminated with babble noise with decreasing
SNR. As the Figure 5.8 reports, GER(20) values estimated in the noise-
free scenario are the same showed with black curves in Figure 5.3. All
algorithms performed worse with decreasing SNR values although MPF
151
5. Experimental Results 5.2. Keele
provided the best results and its weights estimation based version limited
performance deterioration due to the more difficult acoustic conditions.
The three algorithms behaviour resulted similar to that shown in the lower
panel of Figure 5.7 confirming thus the conclusions already drawn for that
scenario.
WAUTOC
M10 M9 M8 M7
M6
YIN M5
M4
MPF (c =1) P1
i
M3
MPF (c ≠ 1) M1 M2
i
Keele P1. All microphones − babble noise added to signal from microphone 5, 6 and 7
13
12
11
10
GER(20)
no noise added 0 dB −5 dB
SNR (babble noise) of microphone 5, 6 and 7
Figure 5.8: Gross error rates obtained by the multichannel version of each algorithm under
different noisy conditions. The whole set of DMN outputs was used and babble noise was
added to channels 5, 6 and 7 at different SNR levels.
152
5.3. CHIL 5. Experimental Results
5.3 CHIL
One of the speech corpora collected under the CHIL project1 , consists of
13 recordings, each about 5 minutes length, from female and male speakers
extracted from real seminar sessions. These were scientific presentations,
held at the Karlsruhe University, and the main difference with the Keele
corpora, is that in this case spontaneous speech is dealt with. Each speaker,
during the talk, wore a “Countryman E6” close-talking microphone, to
capture a noise-free, non reverberant speech signal, and moved freely in
the area labeled “speaker area”, showed in Figure 5.10. Other distant-talk
microphones were used for the recordings and will be described in the next
section. The sampling frequency of the recorded signal was set to 44.1 kHz.
To obtain the reference pitch labels, three existing pitch extractor algo-
rithms were used:
SFS: a free computing environment for conducting research into the nature
of speech [47];
To merge the tern of pitch estimates provided by the three PDAs at each
processed frame, their variance was computed and, in case it was below
a certain threshold δ, the mean was retained as the merged pitch value.
Otherwise, 0 was assigned to the final reference value, with the convention
that a null pitch estimate means that the underlying speech segment is to
1
Computers in the Human Interaction Loop (CHIL) is an Integrated Project (IP 506909) under the
European Commission’s Sixth Framework Program. A description of the used speech corpora can be
found at http://chil.server.de, http://www.nist.gov/speech and http://www.clear-evaluation.org.
153
5. Experimental Results 5.3. CHIL
The pitch values obtained with this method, can be considered a very
reliable reference against which to test the performance of the algorithm
proposed in this thesis. In fact, it is unlikely, even if not impossible, that
all the three PDAs described above provide the wrong estimate. But each
PDA bases on a different internal algorithm and the probability that all of
them provide exactly the same wrong estimate, can be considered a rare
case. Nevertheless, it could be that a few references can still result wrong,
but considering the amount of data collected for testing, the latter will
reduce to an insignificant percentage.
2
This corresponded to about 20 minutes of voicing parts, for a total amount of about 125000 pitch
reference values.
154
5.3. CHIL 5. Experimental Results
Praat
400
WaveSurfer
SFS
300
Hz
200
100
0
6.04 6.05 6.06 6.07 6.08 6.09
time (samples) x 10
6
400
300
Hz
200
100
0
6.04 6.05 6.06 6.07 6.08 6.09
time (samples) x 10
6
Figure 5.9: The top panel shows the pitch estimates obtained from the CHIL speech corpora
using the Praat, WaveSurfer, and SFS algorithms, respectively. The merging procedure
which creates pitch reference labels for this speech dataset, considers only values from each
PDA which are very close to each other. The resulting reference is plotted in the bottom
panel.
5.3.1 Scenario
In Figure 5.10, the plan of the CHIL room prepared at the Karlsruhe
University for seminars and meetings recording is shown. The room is
7.10 m × 5.90 m wide and the ceiling height is 3 m. There is one entrance
in the north wall, and two more doors in the south wall leading to other
offices. The room was filled with different audio/video sensors, since it
155
5. Experimental Results 5.3. CHIL
was prepared to be used in the CHIL project context, which will be briefly
outlined in Chapter 7. Among others devices, some of which not shown in
the figure, 4 fixed color cameras positioned in the corners, and 4 inverted
“T”-shaped microphone arrays (drawn in magenta) are shown, as well as
4 single-distant microphones placed on the top of a table.
y
room height: 3m
A
Screen
Speaker
x
aerea
Table
C
for
5.90 m
Table
B
microphones
meetings
Microphone
cluster
DCamera
7.10 m
Figure 5.10: Plan of the CHIL seminar and meeting room at the Karlsruhe University.
Four cameras were placed at each corner and four inverted “T”-shaped microphone arrays
(magenta color), labeled with letters “A”, “B”, “C” and “D”, are positioned as shown.
The four single microphones on the table and other devices not shown in the figure were
not used for the test described in this thesis.
As shown in the figure, the microphone arrays are labeled with the
letters “A”, “B”, “C” and “D”, and their layout and coordinates are shown
156
5.3. CHIL 5. Experimental Results
Table 5.5: Microphones coordinates of the inverted “T”-shaped arrays used in the CHIL
meeting room at Karlsruhe University. Left table reports the coordinates x, y and x of the
bottom-left microphone (labeled 1) of each array. Figure on the right shows the frontal
view of each array and its microphones relative positions and distances.
For the experiments carried out in this thesis, the 16 outputs from the
microphones arrays were recorded synchronously with the signal proceed-
ing from the close-talk microphone worn by the speaker. After that, they
were aligned to compensate for the propagation delay with which the talker
speech reached each far microphone. Considering that the talker was mov-
ing during the speech, her/his average position was calculated, considering
the speaker area shown in the room map.
5.3.2 Results
157
5. Experimental Results 5.3. CHIL
algorithms. In all the three conditions, the analysis step was set to 10 ms,
that is each PDAs computed 100 pitch estimates/sec. To optimize each
algorithm performance, each algorithm was tested varying the analysis win-
dows length and type (the latter only for WAUTOC and MPF). Setting
the analysis windows to 30 ms, 40 ms, and 60 ms, for YIN, WAUTOC
(rectangular window) and MPF (Hamming window) based algorithms, re-
spectively, each algorithms achieved the best results.
The graphs reported in Figure 5.11 and 5.12, refer to pitch estimation
results measured in terms of GER(20) and GER(5), respectively. The two-
letter labels in the x-axis indicate the array and which of its microphone
is considered, in accordance with the convention explained in Table 5.5.
The top-right part of the figure recalls the relative position between the
DMN elements and the talker, so that dependency of the results on the
latter can be verified. On the y-axis, is shown the measured GER and
different symbols are used to mark the results obtained from each PDA:
“H” for WAUTOC, “•” for YIN and “” for MPF. To distinguish between
the different analyzed contexts, the results have been plotted with different
colors: red for the single distant channel context, black for the joint multi-
microphone scenario and blue for the results obtained using the close-talk
signal. Table 5.6 and 5.7 numerically summarize the results shown in the
graphs.
158
5.3. CHIL 5. Experimental Results
are, in fact, the lowest ones, considering each method separately. This is
due to the proximity of the capturing device to the speaker and by the fact
that the latter often turns her/his head toward the screen, which is situated
just beneath the microphone array. The curves show then an increasing
GER(20) value, as the talker-microphone distance increases, and the trend
is confirmed for each method.
In these conditions, WAUTOC provided the worse results, while MPF
the best. The YIN algorithm gave instead, GER(20) values in between,
even if closer to the WAUTOC curve.
GER(20) WAUTOC YIN MPF
close-talk 1.60 0.13 0.14
single-mic 15.84 13.83 6.30
multi-mic 7.05 4.05 2.15
Table 5.6: Gross error rates (20%) obtained applying WAUTOC, YIN and MPF, respec-
tively, to the CHIL speech dataset. Values refer to the curves depicted in Figure 5.11 and
in the second row the averages, computed for each red curve, are reported. Bold font is
used to indicate the best result obtained in each acoustic condition.
159
5. Experimental Results 5.3. CHIL
WAUTOC: single mic
A
YIN: single mic
Speaker
MPF: single mic
Wautoc: multi−mic
YIN: multi−mic
B C
MPF: multi−mic
Wautoc: close−talk
YIN: close−talk D
MPF: close−talk
Chil room
20
15
GER(20)
10
0
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4
Figure 5.11: The three red curves show gross error rates (20%) derived by applying
WAUTOC (H), YIN (•) and MPF (), respectively, to each of the 16 microphone signals.
The six horizontal lines indicate the performance provided by each of the three algorithms
on the close-talking signals (blue) and the corresponding performance obtained by their
multichannel version applied to all the far microphone signals (black).
160
5.3. CHIL 5. Experimental Results
(black lines), the trend provided by the single distant microphones is con-
firmed. Even if YIN and WAUTOC algorithms demonstrated the best
relative improvement, compared to the single channel case, in this case
too, MPF provided the lowest GER(20), 2.16%, compared to the values of
4.05% and 7.05% given by YIN and WAUTOC, respectively.
Table 5.7: Gross error rates (5%) obtained applying WAUTOC, YIN and MPF, respec-
tively, to the CHIL speech dataset. Values refer to the curves depicted in Figure 5.12 and
in the second row the averages, computed for each red curve, are reported. Bold font is
used to indicate the best result obtained in each acoustic condition.
161
5. Experimental Results 5.3. CHIL
WAUTOC: single mic
A
YIN: single mic
Speaker
MPF: single mic
Wautoc: multi−mic
YIN: multi−mic
B C
MPF: multi−mic
Wautoc: close−talk
YIN: close−talk D
MPF: close−talk
Chil room
25
20
GER(5)
15
10
0
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4
Figure 5.12: The three red curves show gross error rates (5%) derived by applying
WAUTOC (H), YIN (•) and MPF (), respectively, to each of the 16 microphone signals.
The six horizontal lines indicate the performance provided by each of the three algorithms
on the close-talking signals (blue) and the corresponding performance obtained by their
multichannel version applied to all the far microphone signals (black).
162
Chapter 6
N X
X L
xj (n) = hji (l)si (n−l +1), j = 1, · · · , M, (6.1)
i=1 l=1
where si (n) represents the i-th source, xj (n) the signal observed by the j-th
sensor, and hji (n) the room impulse response of length L, which models
1
The activity presented in this chapter was conducted while I was at the NTT Communication Science
Laboratories, Kyoto, JAPAN.
163
6. f0 in Blind Source Separation 6.1. Binary mask based BSS
the delay and reverberation room effects from the i-th source to the j-th
sensor.
Here, the under-determined case is addressed, that is, N > M , with
N = 3 and M = 2 and separation is carried out in the time-frequency
domain. In this domain, speech signals sparseness can be assumed [12],
and the convolutive mixtures of Equation 6.1 can be written in terms of
instantaneous mixtures
" # " # S1 (ω, m)
X1 (ω, m) H11 (ω, m) H12 (ω, m) H13 (ω, m)
= S2 (ω, m) , (6.2)
X2 (ω, m) H21 (ω, m) H22 (ω, m) H23 (ω, m)
S3 (ω, m)
The variables ω and m indicate the frequency and frame indexes of the
short-time Fourier transforms of the sources S(ω, m), the observed signals
X(ω, m), and of the mixing matrix H(ω, m), respectively. Each j, i-th
component of the latter 2 × 3 matrix represents the transfer function from
the i-th source to the j-th sensor.
In the determined or overdetermined case, the inverse of the mixing
matrix H(ω, m) can be computed and used to easily solve Equation 6.3 for
the sources values Si . Considering the underdetermined case though, the
solution is not straightforward since the mixing matrix, as in this example,
is not invertible. To solve the under-determined BSS problem, several
methods based on source sparseness have been proposed [12, 88].
The method that will be explained in the following and whose building
blocks are reported in Figure 6.1, exploits the sparseness assumption and
supposes consequently, that most of the signal samples can be considered
164
6.1. Binary mask based BSS 6. f0 in Blind Source Separation
null in the given domain. This makes it possible to assume that sources
overlap at rare intervals [10]. Given this hypothesis, each target speaker
can be extracted by selecting from the mixture just those time-frequency
bins at which the speaker is considered to be active or predominant.
Mix1
Figure 6.1: The scheme shows the basic building blocks of an underdetermined (three
speakers, two microphones) BSS system. A binary mask, designed exploiting the Direc-
tion Of Arrival (DOA) of each speaker signal, is applied to the common time-frequency
representation to extract each output.
X1 (ω, m)
ϕ(ω, m) = ∠ . (6.4)
X2 (ω, m)
The result of Equation 6.4 permits then to obtain the Direction Of
Arrival (DOA) for each time-frequency bin, computed as
ϕ(ω, m) · c
θ(ω, m) = cos−1 , (6.5)
ω·d
where c is the speed of sound and d is the microphone spacing. For each
frequency index, computing the histogram of θ(ω, m) reveals three peaks
centered approximately on the actual DOA of the sources (an example
is given in Figure 6.2), which can therefore be estimated by employing a
clustering algorithm such as k-means.
165
6. f0 in Blind Source Separation 6.1. Binary mask based BSS
Histogram of DOAs
# of occurences
θ θ θ degree
1 2 3
Figure 6.2: Histogram of the DOAs computed from Equation 6.5. The peaks of the his-
togram are centered on the actual directions of arrival, θ1 , θ2 and θ3 , of the three speakers
talking at the same time.
(
1, θ̃k − ∆ ≤ θ(ω, m) ≤ θ̃k + ∆
Mk (ω, m) = k = 1, 2, 3 (6.6)
0, otherwise
166
6.1. Binary mask based BSS 6. f0 in Blind Source Separation
167
6. f0 in Blind Source Separation 6.1. Binary mask based BSS
Speaker: F5 output F5
1844 1844
1688 1688
1531 1531
1375 1375
1219 1219
Hz
906 906
1688
750 750
1531
594 594
1375
438 438
1219
281 281
1063
125 Hz 125
906
50 55 60 65 70 75 80 85 90 750 50 55 60 65 70 75 80 85 90
time (frame) time (frame)
594
438
Speaker: M1 output M1
1844 281 1844
1531 50 55 60 65 70 75 80 85 90 1531
time (frame)
1375 1375
1219 1219
1063 1063
Hz
Hz
906 906
750 750
594 594
438 438
Binary mask: blue −> F5, red −> M1, green −> M2
1844
281 281
1688
125 125
1531
50 55 60 65 70 75 80 85 90 50 55 60 65 70 75 80 85 90
time (frame) 1375 time (frame)
1219
1063
Speaker: M2 output M2
1844 1844
Hz
906
1688 1688
750
1531 1531
594
1375 1375
438
1219 1219
281
1063 1063
125
Hz
Hz
906 906
50 55 60 65 70 75 80 85 90
time (frame)
750 750
594 594
438 438
281 281
125 125
50 55 60 65 70 75 80 85 90 50 55 60 65 70 75 80 85 90
time (frame) time (frame)
Figure 6.3: On the left side are represented the spectrograms computed on a speech segment
uttered by speakers F 5, M 1 and M 2, respectively. The mixture spectrogram provided
by each microphone is shown at the top of the center column. Below it, the estimated
binary mask is plotted, using the blue, red and green colors, to indicate the time-frequency
locations that will be used for the spectrogram reconstruction of the F 5, M 1 and M 2
speakers, respectively. The latter are shown in the right column.
168
6.1. Binary mask based BSS 6. f0 in Blind Source Separation
•
•
0
Figure 6.4: The graph shows the continuous mask obtained by means of linear inter-
polation of the DOA of each speaker. Given the current estimated DOA, θ(ω, m), the
time-frequency bin with coordinates (ω, m) of mask M1 (ω, m), M2 (ω, m) and M3 (ω, m),
is assigned the value marked with the blue, red and green circle, respectively.
169
6. f0 in Blind Source Separation 6.1. Binary mask based BSS
As shown in the graph, the mask value M3 (ω, m), is set to 0, since
the estimated DOA lies in between the first and the second actual DOAs.
Moreover, being the latter closer to θ̃2 , M2 (ω, m) will be given the highest
coefficient (red circle), since it is more likely that the mixture bin with po-
sition (ω, m), belongs to the second speaker. Other alternatives to the lin-
ear interpolation are polynomial interpolation or directivity pattern based
masks, as described in [5].
As stated in the previous section, the outputs of a binary masks based BSS
system result distorted as a consequence of the fact that the sparseness
assumption is not always satisfied. Applying continuous masks, instead of
binary ones, demonstrated to be more beneficial for reducing the overall
distortion than cross-speaker interference.
In fact, each signal yk (n) extracted by means of continuous masks ac-
counts for the target speaker si (n), i = k, and a certain amount of residual
interference due to interfering speakers si (n), i 6= k. To improve separa-
tion and sound quality, thus reducing musical noise, an extra processing
stage is employed as shown in Figure 6.5. In the scheme proposed here,
the f0−VUV estimation block is responsible for estimating both the funda-
mental frequency and the voiced/unvoiced (V/UV) information from each
of the extracted signals yk (n). Each signal f0k will then be used to tune
one different adaptive FIR or IIR filter, which will be active only on voiced
segments indicated by the VUVk signal with which it is driven.
The FIR filter is responsible for the harmonic enhancement of the target
speaker yk (n), while IIR filters suppress the interference caused by the
other speakers in the mixture. The final output yk0 (t) is then obtained by
selecting the FIR filter output for speech segments labeled as voiced, and
the IIR filter output for unvoiced segments. To drive this selection, the
170
6.1. Binary mask based BSS 6. f0 in Blind Source Separation
l6=k,p
Xj (ω,m)
BSS IIR IIR p6=k,l
Figure 6.5: After blind source separation, each output yk (n) is processed by a PDA to
extract the pitch information f0k , as well as the voiced/unvoiced (V/UV) information
VUVk . These signals are used to drive FIR and IIR comb filters that enhance the dete-
riorated harmonic structure of voiced segments (FIR) and remove the interference due to
the voicing parts of the competing speakers.
171
6. f0 in Blind Source Separation 6.1. Binary mask based BSS
a
0
1 a a
−1 1
a a
0.5 −2 2
y (n), h(n)
0
i
−0.5
−1
−1.5
Tm−1 - Tm - Tm+1 - Tm+2 -
1.841 1.842 1.843 1.844 1.845 1.846 1.847 1.848 1.849
time (samples) x 10
5
Figure 6.6: Adaptive FIR filter (red) and speech waveform (black) with varying pitch
period. The FIR filter coefficients ai are plotted with red circles superimposed to the
voiced speech segment currently being processed. The spacing between each coefficient is
adjusted using the pitch information, which, at each time instant, provides the different
pitch periods Tr values.
periods of the target speaker, so that they will add constructively. Since
residual components from interfering speakers do not exhibit such periodic
behaviour, they will be further reduced by the averaging procedure. This
results in the restoration of harmonic components continuity, being advan-
tageous for reducing musical noise.
172
6.1. Binary mask based BSS 6. f0 in Blind Source Separation
While the FIR filter enhances the voiced sections of the target speaker
yk (n), IIR filters are given the task of removing interferences of competing
speakers. This is carried out by filtering the yk (n) sections which are
unvoiced at the same time while the competing speakers are voicing. The
filter used is an adaptive IIR comb filter [68], with a transfer function given
by
QNIIR −1
k=1 (1 + αk z + z −2 )
H(z) = QNIIR , (6.8)
(1 + ρα z −1 + z −2 )
k=1 k
173
6. f0 in Blind Source Separation 6.2. BSS performance
−5
−10
dB
−15
−20
−25
−30
0 100 200 300 400 500 600 700 800 900 1000
−5
−10
dB
−15
−20
−25
−30
0 100 200 300 400 500 600 700 800 900 1000
frequency (Hz)
monics removal. This because it provides a more abrupt and higher cutoff
ratio in the frequency locations of interest than its FIR counterpart. The
latter in fact, must have a short impulse response to satisfy the quasi-
stationarity assumption valid for voiced segments.
This section describes the results obtained applying the f0 based method
just described, to enhance the quality of the outputs of a binary mask based
Blind Source Separation (BSS) system. Several factors affect the overall
performance of such an extended BSS system as, for example, the reverber-
ation level of the considered environment, the characteristics of the speech
174
6.2. BSS performance 6. f0 in Blind Source Separation
inputs, and the PDAs ability to estimate the pitch values correctly. There-
fore, to evaluate the proposed BSS system performance, speech input data
was carefully prepared to include both the reverberant and non-reverberant
scenario, and several (?) pitch extraction techniques were tested. Results
are thus given both in terms of Signal to Distortion Ratio (SDR), Signal
to Interference Ratio (SIR), and in terms of GER(20) and RMSE(20).
P 2
y (n)
SIRk = 10 log P Pn ksk 2
(6.9)
n( i6=k yksi (n))
P 2
n xjsk (n)
SDRk = 10 log P 2
(6.10)
n (xjsk (n) − αyksk (n − D))
Indicating with sk the speech signal generated by the k-th speaker, and
with yk the relative output provided by the BSS system, the following
meaning is given to variables of Equations 6.9 and 6.10: yksi is the k-th
separating system output when only si is active and sl , l 6= i is silent;
xjsk is the observation provided by microphone j when only sk is active.
Parameters α and D are used to compensate for the amplitude and phase
175
6. f0 in Blind Source Separation 6.2. BSS performance
difference between xjsk and yksk . To evaluate the performance of the pro-
posed method, SIR and SDR are computed using measurements from both
microphones and the best value is retained.
The proposed BSS system exploits pitch information to improve its sep-
aration performance which, in turn, depends on the accuracy and the reso-
lution with which the employed PDA estimates the pitch values. To show
the dependency of separation performance on pitch estimation quality, the
GER(20) and RMSE (or “fine pitch error”) measures will be computed.
These error measures were previously described in Section 5.1.1.
In Figure 6.8 the setup used for the BSS experiments is shown. Speakers
positions are indicated with loudspeaker symbols, each of which refers to
signals s1 , s2 and s3 , respectively. The DOAs for the three speakers was set
to 45◦ , 90◦ and 135◦ , respectively, and the distance microphones-speaker,
for the reverberant case, was set to 1.1 meters. Two omnidirectional mi-
crophones, distant 4 cm from each other, were used and are marked with
circles.
To simulate an anechoic environment, i.e. T60 = 0 ms, mixtures Xj (ω, m)
were obtained computing Equation 6.2 with values Hji (ω) set as follows
" # " # S1 (ω, m)
X1 (ω, m) e(jωτ11 ) e(jωτ12 ) e(jωτ13 )
= S2 (ω, m) , (6.11)
X2 (ω, m) e(jωτ21 ) e(jωτ22 ) e(jωτ23 )
S3 (ω, m)
where τji represents the time delay with which sound propagates from the
d
i-th speaker to the j-th microphone. Its value is computed as τji = cj cos θi ,
being dj the j-th microphone position, and θi the i-th source direction.
176
6.2. BSS performance 6. f0 in Blind Source Separation
4.45 m
s1
m
1
1.
4 cm
135
◦
3.55 m
2.25 m
s2
90 ◦
1.75 m
s3
45 ◦
Figure 6.8: Room for BSS tests. The setup used comprises 2 microphones (black circles)
and 3 loudspeakers (used to reproduce messages si , i = 1, 2, 3) positioned as shown.
For the reverberant case instead, the speech data was convolved with
room impulse responses recorded in a real room, 4.45 m × 3.55 m wide
and 2.50 m high, as shown in the figure. The measured reverberation time
was T60 = 130 ms and each impulse response hji (n) has been used to model
the reverberation effects on a sound propagating from source si to the j-th
microphone.
6.2.3 Results
177
6. f0 in Blind Source Separation 6.2. BSS performance
For estimating the pitch values necessary to drive the comb filters de-
scribed in Section 6.1.2, three algorithms were tested: WAUTOC, YIN
and MPF2 . WAUTOC and YIN were introduced in Chapter 2 and MPF
in Section 4.1.3. Although the latter algorithm is not used here in its
multi-microphone derivation, becoming thus similar to the ACF approach,
it is tested for comparison purposes and to show the advantages of the
frequency domain analysis applied to reverberant signals.
For f0 estimation, the frame size was set to 30 ms, 40 ms, and 60 ms
considering the YIN, WAUTOC (rectangular window) and the MPF (Ham-
ming window) based algorithms, respectively. Pitch values were estimated
every 1 ms and the same VUVk signals were used in all experiments to
provide uniform test conditions to the different PDAs employed. These
signals were derived from the re-estimated Keele pitch reference values, as
explained in Section 5.2.
The parameters of the comb FIR and IIR filters instead, were set to
NFIR = 5, NIIR = 5 and ρ = 0.995. The values used for the several pa-
2
The proposed BSS system was also tested using the pitch estimated values provided by the REPS
algorithm [67], and the obtained results were reported in [30].
178
6.2. BSS performance 6. f0 in Blind Source Separation
rameters involved, were chosen to obtain the best performance from each
pitch estimation algorithm and from the proposed BSS system.
Binary mask
The binary mask based BSS approach, described in Section 6.1, is assumed
here as the baseline system against which to compare the proposed BSS
system. The results obtained with this system, in terms of SIR and SDR
are reported in Table 6.1, where the left column refers to the anechoic
scenario, and the right column to the reverberant (or echoic) one.
Speech signals acquired in the anechoic scenario better satisfy the sparse-
ness assumption. As a consequence of this, the histograms (Figure 6.2)
computed on the estimated DOAs θ(ω, m) have well localized and sharp
peaks along the θ axes, making the estimation of θ˜i values more reliable.
When the reverberant scenario is considered instead, reverberation causes
signals to overlap more in the time-frequency domain. This makes the esti-
mation of θ(ω, m) more difficult and less reliable. This in turn explains the
performance degradation shown in the table, for both the SIR and SDR
values.
Estimating correctly the pitch values from the outputs provided by the
binary mask based BSS system, turns out to be a difficult task. In fact, if
179
6. f0 in Blind Source Separation 6.2. BSS performance
Table 6.1: SIR and SDR values obtained by the binary mask based BSS system. Left
column refers to tests performed in an anechoic scenario, right column reports results
measured in a reverberant context.
the results of GER(20) and RMSE(20) obtained from the three considered
PDAs on the original Keele signals, are compared with those computed on
the outputs of the BSS system, an evident performance degradation occurs.
Table 6.2 shows the results obtained using the unprocessed Keele signals,
while Table 6.3 shows those obtained after the mixtures were processed
by the BSS system. It turns out that the most difficult scenario is the
reverberant one, where the best GER(20), provided by the MPF algorithm,
was not lower than 16.59%. A better trend is observable for the anechoic
case, where the best performance is achieved by the YIN algorithm, with
GER(20) = 4.72%.
The better performance demonstrated by the MPF algorithm in rever-
berant conditions, which reduces to an ACF computed through FFT in the
single channel case, further strengthens the hypothesis that the frequency
domain based approach is more suitable for processing signals severely de-
teriorated by reverberation.
Continuous mask
When the estimated DOA for a particular mixture time-frequency bin dif-
fers considerably from any estimated centroid θ˜i , the probability of speaker
superposition is considered to be higher than when the DOA coincides with
one of the centroids. In such a case, this time-frequency bin will generate
180
6.2. BSS performance 6. f0 in Blind Source Separation
Table 6.2: The GER(20) and RMSE(20) obtained processing the Keele database with
the WAUTOC, YIN and MPF algorithms, are compared. The original Keele signals were
down-sampled to 8 kHz before the estimation was carried out.
Table 6.3: Performance evaluation of the WAUTOC, YIN and MPF algorithms applied
to the output of the binary mask based BSS system. Results, given in terms of GER(20)
and RMSE(20), show the deterioration that occurs when reverberant signals are processed
(bottom) if compared with the anechoic scenario (top).
distortion in the speaker signal selected for the target, whereas there will
be information missing in the spectrograms of the other extracted signals.
To partially overcome this problem, continuous masks are employed, and
each mask weight is assigned a value linearly proportional to the distance of
the estimated DOA from each centroid, for every bin under consideration.
The resulting SIR and SDR measured on the output of the continu-
ous mask based BSS system, are reported in Table 6.4. Although the
SIR measured in echoic conditions slightly decreases, there is an overall
improvement in interference, as well as in distortion reduction, for both
the echoic and anechoic scenarios. The greater improvement obtained in
terms of SDR, demonstrates the advantages of using continuous masks,
181
6. f0 in Blind Source Separation 6.2. BSS performance
Table 6.4: SIR and SDR values obtained by the continuous mask based BSS system. Left
column refers to tests performed in an anechoic scenario, right column reports results
measured in a reverberant context.
Table 6.5: Performance evaluation of the WAUTOC, YIN and MPF algorithms applied to
the output of the continuous mask based BSS system. The better quality output signals
provided by this BSS system reflects in higher GER(20) and RMSE(20), compared to
the binary mask scenario. The first two rows show the results obtained in the anechoic
context, while the bottom part of the table reports the higher error rates relative to the
reverberant case.
182
6.2. BSS performance 6. f0 in Blind Source Separation
After applying comb filtering to the BSS outputs obtained with continuous
masks, the results shown in Table 6.6 were obtained. Comparing the SIR
and SDR values with those from Table 6.4, it could be noted that SIR val-
ues generally increased at the expense of SDR values. That is, the f 0 based
comb filtering technique proved to be effective for eliminating interference
and restoring signal harmonics, though at the expense of introducing little
distortion. The highest improvement was that of the SIR value in rever-
berant conditions, which increased from 10.55 dB to 11.45 dB, after comb
filtering was applied. This was obtained employing the f0 values provided
by the MPF algorithm, which in turn produced the lowest GER(20) com-
pared to WAUTOC and YIN, in the same conditions.
Table 6.6: Results obtained in terms of SIR and SDR after applying f0 based comb
filtering to the outputs provided by the continuous masks BSS system.
In the anechoic scenario, there are very little variations between the
SIR and SDR values that were obtained employing the different PDAs to
estimate f0 . Despite YIN provided the best GER(20) score in anechoic
conditions (Table 6.5), this is not clearly reflected by the figures of the
upper part of Table 6.6. This could be explained considering that, during
processing, the impulse response h(n) and the transfer function H(z) of
the FIR and IIR comb filters, respectively, are updated at the sample level.
183
6. f0 in Blind Source Separation 6.2. BSS performance
Given that the original time interval between each estimated f0 is of 1 ms,
parabolic interpolation was applied to obtain the pitch values needed in
between. Possible octave errors in pitch estimation, will inevitably affect
the result of interpolation, but in a way not easy to foresee, since it also
depends on the way these errors group together.
Also the fine precision with which f0 values are estimated influences the
comb filtering processing, most of all that based on the FIR filter. The
measured GER(1) for the YIN and MPF algorithms, derived in anechoic
conditions, were of 31.48% and 29.77%, respectively. This could explain
why the use of these algorithms provided almost the same SIR and SDR
values in anechoic conditions, while their performance in Table 6.5 were
different. The latter consideration does not imply, in this context, the
superiority of an algorithm with respect to the other. In fact, none of
the considered PDAs was designed to provide very precise f0 values since,
once a pitch estimate is correctly estimated within a neighborhood of the
reference, its value is easily refined by many available techniques. Instead,
the main point here, is the important role of f0 information in restoring
and enhancing the harmonic structure of speech voiced sections.
This can be verified when observing the relative improvement given
by the proposed BSS approach over the baseline system, as reported in
Table 6.7.
Even though the comb filtering procedure reduced slightly the SDR
values obtained after the continuous mask application, combining the two
techniques provided an overall increase of both SIR and SDR values, in
both the anechoic and reverberant scenarios.
184
6.2. BSS performance 6. f0 in Blind Source Separation
Table 6.7: Relative improvement obtained by the continuous mask + f0 based BSS system
with respect to the reference BSS system. Results are presented in terms of relative
improvement percentages, calculated comparing values from Table 6.6 with those from
Table 6.1.
185
6. f0 in Blind Source Separation 6.2. BSS performance
186
Chapter 7
7.1 Conclusions
187
7. Conclusions and Future Work 7.1. Conclusions
188
7.1. Conclusions 7. Conclusions and Future Work
189
7. Conclusions and Future Work 7.2. Future work
190
7.2. Future work 7. Conclusions and Future Work
1
Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy.
2
This information can be used, in turn, to avoid that the speech recognizer associates a nonsense
transcription to a non-speech event, such as, for example, a cough.
191
Bibliography
193
BIBLIOGRAPHY
[9] L. L. Beranek. Concert and Opera Halls: How They Sound. Acous-
tical Society of America, New York, 1996.
194
BIBLIOGRAPHY
[19] P. B. Denes and E. N. Pinson. The Speech Chain; the Physics and
Biology of Spoken Language. Freeman, New York, 2nd edition, 1993.
195
BIBLIOGRAPHY
[36] B. Gold and N. Morgan. Speech and audio signal processing; Process-
ing and perception of speech and music. John Wiley & Sons, 2000.
196
BIBLIOGRAPHY
197
BIBLIOGRAPHY
198
BIBLIOGRAPHY
[68] A. Nehorai and B. Porat. Adaptive comb filtering for harmonic signal
enhancement. IEEE Trans. ASSP, ASSP-34:1124–1138, Oct. 1986.
199
BIBLIOGRAPHY
[71] A. M. Noll. Short time spectrum and cepstrum techniques for vocal
pitch detection. J. Acoust. Soc. Am., 36:296–302, 1964.
200
BIBLIOGRAPHY
201
BIBLIOGRAPHY
[90] S. Sagayama and S. Furui. Pitch extraction using the lag window
method. Proc. of IECEJ, 1978. (in Japanese).
202
BIBLIOGRAPHY
[101] C. P. Smith. Device for extracting the excitation function from speech
signals. United States Patent No. 2,691,137, Oct 1954.
203
BIBLIOGRAPHY
[112] L. A. Yaggi. Full duplex digital vocoder. Technical Report Nr. sp16-
A63, Texas Instruments, 1963.
204
Appendix A
Time-frequency Uncertainty
Principle
Let consider the Fourier transformation for a generic continuous signal x(t),
205
A. Time-frequency Uncertainty Principle
Z ∞ ∗
X(f ) = x(t) ej2πf t dt, (A.1)
−∞
the complex conjugate term ej2πf t against which x(t) is integrated, rep-
resents the transformation kernel. Indicating with bp (t) a generic kernel
function, being p a parameter, the above Fourier kernel can be written as
bp (t) = ej2πpt and provides poor time resolution. In fact this particular
kernel is defined over all t ∈ (−∞, ∞) and, its Fourier transform B p (f ) is
a Dirac delta centered in p, ∆(f −p), being thus well localized in frequency.
On the other hand, if the kernel in the transformation in (A.1) were
set to bp (t) = ∆(t − p), optimal time resolution will be provided, but no
frequency resolution at all will be available. This is also evident considering
the absolute value of the Fourier transform of bp (t), which is |Bp (f ) = 1|
for all frequencies regardless of the parameter p.
Defining with ∆2t and ∆2f the variances of bp (t) and Bp (f ), respectively,
the time-bandwidth product ∆t ∆f depends on the particular choice of bf (t)
and holds the time-frequency uncertainty principle
1
∆t ∆f ≥ . (A.2)
2
This limits the time and frequency resolutions achievable with a par-
ticular kernel, being them tied by equation (A.2). The lowest achievable
value for the time-bandwidth product is ∆t ∆f = 21 and is provided by the
Gaussian pulse,
1 1 2 1 2
b(t) = √ e− 2 t , B(f ) = e− 2 f , (A.3)
2π
which will thus provide the best joint time-frequency resolution.
Both equations in (A.3) are neither band-limited nor time-limited, but
concentrated around their mean (which is zero) thus providing the lowest
206
A. Time-frequency Uncertainty Principle
207
A. Time-frequency Uncertainty Principle
208
Appendix B
Method
209
B. Characteristics of the Reference Pitch Values
Characteristics
210
B. Characteristics of the Reference Pitch Values
a given speech database, shall have the same characteristics, that is, they
shall be very precise at the pitch period level and reliable. Such reference
pitch values would thus represent the upper bound quality achievable by
a PDA and shall represent, in my humble opinion, the unique term of
comparison for all PDAs that have to be tested. Otherwise, adapting the
reference pitch values to the reliability or precision characteristics of the
tested PDA, will provide a biased feed-back on its performance, not even
useful to be compared with the performance obtained by other devices.
211
B. Characteristics of the Reference Pitch Values
212
Appendix C
Generalized Autocorrelation
213
C. Generalized Autocorrelation
sizes spectral peaks in relation to noise but, at the same time, flattens
spectrum dynamics.
On the contrary, cepstrum was reported to perform better than ACF 2 on
clean speech signals, but rather poorly on noisy signals. The authors con-
clusions on the four considered approaches are that PDAs based on ACF 0.5
and ACF1 demonstrated to be “less sensitive to noise than the cepstrum
and less sensitive to strong formants than the autocorrelation PDA and
thus represent a good compromise when the environmental signal condi-
tions are unknown”.
In this thesis a value of g = 1 was chosen for deriving the MPF function
described in Section 4.1.3. This value was suggested by some preliminary
tests conducted on a large amount of reverberant speech data and reported
in Figure C.1. In these experiments the MPF algorithm was tested on both
the Keele and CHIL databases and their relative scenarios (see Section 5.2
and 5.3), varying the g parameter of Equation C.2, which was used instead
of Equation 4.7, in the range 0.2 ÷ 2.
The figures shows the GER(20) obtained for each value of g considering
first the close-talk signals (left panels), then the reverberant outputs of the
Distributed Microphone Network employed (right panels).
The results show that in all conditions the lowest GER(20), indicated
214
C. Generalized Autocorrelation
with a red circle, is obtained using g < 2. Apart from the close-talk version
of the CHIL spontaneous speech database, for which g = 1.7 provided the
lowest GER(20), in all other cases the best mpf performance were obtained
with 0.5 ≤ g < 1.
Chil (spontaneous speech) − clean Chil (spontaneous speech) − reverberant
0.8
5
GER(20)
GER(20)
0.6
4
0.4 3
0.2 2
GER(20)
5 8
4
6
3
4
0.5 1 1.5 2 0.5 1 1.5 2
GER(20)
5 10
4 8
3
6
0.5 1 1.5 2 0.5 1 1.5 2
FFTexp FFTexp
Figure C.1: Dependency of the mpf GER(20) on the parameter g of Equation C.2.
215
C. Generalized Autocorrelation
there is a strong indication that a value for g close to 0.5 results beneficial
for pitch extraction from reverberant speech signals.
Given these results, it could have been possible to set g to achieve the
best results for each given scenario. Alternatively, the average of the best
g values measured in the different contexts could have been used.
However, to avoid the introduction of a critical parameter to be esti-
mated from the data, turning thus the performance of the proposed mpf
function strongly dependent on the given task, g = 1 was used for all tests
described in Section 5. Setting g = 0.5 would have provided even better
performance than that reported, but it was considered not a good general
choice given that for g < 0.5, as shown in the figure, GER(20) starts to
increase noticeably.
216