0% found this document useful (0 votes)
120 views27 pages

Sound Source Localization

Sound Source Localization project report
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views27 pages

Sound Source Localization

Sound Source Localization project report
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Masters Thesis MEE04:20

Acoustic speech localization with


microphone array in real time

Mikael Swartling

Examensarbete
Teknologie Magisterexamen i Elektroteknik

Blekinge Tekniska Hogskola


Januari 2005

Blekinge Tekniska Hogskola


Sektionen for Teknik
Avdelningen for Signalbehandling
Examinator: Nedelko Grbic
Handledare: Nedelko Grbic

Acoustic speech localization with microphone


array in real time
Mikael Swartling
Blekinge Institute of Technology

Abstract
The purpose of this thesis is to evaluate and implement algorithms
for robust localization and tracking of moving acoustic sources in real
time using a microphone array. To identify inter-sensor delays, the
generalized cross correlation is used together with a filter bank. From
the inter-sensor delays, position is estimated using a linear intersection
algorithm. Position estimates are associated with tracks, which are filtered by a Kalman filter. Results from two real-room experiments are
presented to demonstrate the localization and tracking performance,
along with a discussion on real time implementation issues.

Contents
1 Introduction

2 Delay estimation

2.1

Signal model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

The generalized cross correlation method . . . . . . . . . . . .

2.3

Angle of arrival . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4

Multiple sensors . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

Optimizing the cross correlation function . . . . . . . . . . . .

3 Filter banks

11

4 Position estimation

11

4.1

Source localization problem . . . . . . . . . . . . . . . . . . . 11

4.2

Linear intersection . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Track association and filtering

13

5.1

Track association . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2

Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Experiments
6.1

Testing the angle of arrival . . . . . . . . . . . . . . . . . . . . 16


6.1.1

6.2

16

Bias and variance . . . . . . . . . . . . . . . . . . . . . 17

Testing the localization and tracking . . . . . . . . . . . . . . 17


6.2.1

Two fixed talkers . . . . . . . . . . . . . . . . . . . . . 18

6.2.2

Single moving talker . . . . . . . . . . . . . . . . . . . 18

7 Real time implementation

18

8 Conclusion and further development

19

List of Figures
1

Delay due to extra progagation distance. . . . . . . . . . . . .

Path of possible source locations. . . . . . . . . . . . . . . . .

Sensor arrangement and delays when using multiple sensors. .

Uniform DFT analysis filter bank. . . . . . . . . . . . . . . . . 11

Linear intersection. . . . . . . . . . . . . . . . . . . . . . . . . 14

Room with moderate echo. . . . . . . . . . . . . . . . . . . . . 22

Room with low echo. . . . . . . . . . . . . . . . . . . . . . . . 22

Bias of estimated angles. . . . . . . . . . . . . . . . . . . . . . 23

Standard deviation of estimated angles. . . . . . . . . . . . . . 24

10

Two speakers having a conversation. . . . . . . . . . . . . . . 25

11

Two speakers having a conversation. . . . . . . . . . . . . . . 25

12

Single speaker moving in a circle. . . . . . . . . . . . . . . . . 26

13

Single talker moving in a circle. . . . . . . . . . . . . . . . . . 26

Introduction

An array of microphones has the ability to be steered electronically to change


its directivity pattern to only receive sounds from certain directions. This
ability can be used to replace directed microphones, as it has the advantage
of rapidly changing its directivity pattern, allowing it to pick up new sources
and follow source movements. Instead of steering the arrays directivity pattern to a specific location, it can also be used to search for acoustic sources
by dynamically forming the directivity pattern to sweep over the surrounding
environment.
The problem of locating a source is often split into three parts; intersensor delay estimation, position estimation and tracking association and
filtering. The most important of these parts is a precise and robust algorithm
for inter-sensor delay estimation, since the delay estimates forms the base for
further calculations and location estimates. To work in real time, it must
also be computationally inexpensive to be able to process the signals as they
are sampled and to provide a continuous flow of inter-sensor delay estimates
to the location estimator.
All three parts will be discussed in this report. Experiments are also
performed to demonstrate the performance, along with a discussion on real
time implementation issues and finally, conclusions and possible further developments are given.

Delay estimation

2.1

Signal model

Given two spatially separated sensors (in this thesis, the sensors are microphones), the signal received from an acoustic source at one sensor will be
shifted in time relative the other sensor due to an extra propagation distance
from source to sensor. Figure 1 illustrates this delay where the source is
located in the near and far field, respectively. In the near field case, the
direction of arrival is different for the two sensors. In the far field case, the
direction of arrival can be considered parallel and will therefore be the same
for both sensors.
Assuming the relative attenuation between the two sensors is negligible,

(a) Near field source.

(b) Far field source.

Figure 1: Delay due to extra progagation distance.

the received signals x0 (t) and x1 (t) can be modelled as


x0 (t) = s (t 0 ) + n0 (t)
x1 (t) = s (t 1 ) + n1 (t)

(1)

where s(t) is the acoustic source signal, 0 and 1 are the propagation delays
from the source to the sensors and n0 (t) and n1 (t) are noise signals. The
noise received at the sensors are considered mutually uncorrelated and also
uncorrelated with the source signal. The relative delay between the sensors,
= 1 0 , is the delay caused by the extra propagation distance.
The task is to estimate the delay from finite size blocks of data from
x0 (t) and x1 (t). To track a talker, to locate new sources and alternate between
several sources quickly, a method to quickly estimate the delay is required.

2.2

The generalized cross correlation method

The method used to estimate inter-sensor delays in this thesis is based on


the generalized cross correlation method, described in [KC76]. The delay is
estimated by maximizing the cross correlation between the two signals x0 (t)
and x1 (t), and can be expressed as
= arg max
Rx0 x1 ( )

(2)

The cross correlation Rx0 x1 ( ) is related to the cross power spectrum Gx0 x1 ()
by the Fourier transform as
Z

Rx0 x1 ( ) =

Gx0 x1 () ej d

(3)

The cross power spectrum of x0 and x1 , Gx0 x1 (), is calculated as


Gx0 x1 () = X0 () X1 ()

(4)

where X0 () and X1 () are the Fourier transforms of x0 and x1 , respectively,


and denotes complex conjugate.
The generalized cross correlation is defined in [KC76] as
Rx0 x1 ( ) =

() Gx0 x1 () ej d

(5)

where () is a general weighting function. The generalized correlation


method known as phase transform, or PHAT, is obtained by setting the
weighting function to
P HAT () =

1
|Gx0 x1 ()|

(6)

This weighting function normalizes the absolute value of all coefficients in


the cross spectrum to unity, and uses only the phase information to calculate
the cross correlation.

2.3

Angle of arrival

When the time delay of arrival is estimated and the array geometry is known,
a direction of arrival can also be estimated. From a given delay, a path can
be calculated along which the source is located. It is not possible, using only
two sensors, to determine where along the path the source is located. The
path is a parabolic curve in two dimensions as illustrated by the dashed line
in figure 2(a). The curve is actually mirrored along the line connecting the
two sensors. However, only one half-space is concidered here; the source is
assumed to be located in front of the sensor array.
In the far field, the parabolic curve approaches a straight line. Assuming
the source is always located in the far field, it is possible to approximate the
6

(a) Near field source.

(b) Far field source.

Figure 2: Path of possible source locations. A source located in the near field
results in a parabolic curve of possible source locations (a), and
in the far field the parabolic path ca be approximated by a straight
line (b).

parabolic curve with a straight line, as shown in figure 2(b). The angle is
the angle or arrival for a distant source.
The angle of arrival can be calculated as
1

= sin

c
d fs

(7)

where c is the speed of sound, d is the distance between the two sensors,
fs is the sample rate and is the estimated delay between the two sensors
measured in samples. An estimate of the variance of the estimated angle is
[ADBS95]
]
V [
]
(8)
V [
cos2

2.4

Multiple sensors

To increase the accuracy of the delay estimate, multiple sensors can be used.
Here, the sensors are placed on a line, evenly spaced. Assuming a far-field
source, the sensor arrangement and their delays relative other sensors are as
in figure 3.
The SRP based algorithms, steered response power, are algorithms based
on steering a beamformer, searching for maximum power output. The type
7

m0

m1

m2

mN-1

Figure 3: Sensor arrangement and delays when using multiple sensors.

of beamformer used is a delay-and-sum beamformer, which delays the output


signals from the individual sensors and them sums them together to form the
output of the beamformer.
A generalization of the GCC-PHAT is the SRP-PHAT algorithm, defined
as
= arg max

1 Z
N
2 N
X
X

P HAT () Gx0 x1 () ej(mn) d

(9)

n=0 m=n

where P HAT () is the weighting function defined in (6).


The SRP-PHAT algorithm maximizes the cross correlation between all
combinations of sensor pairs in the array. As the number of sensor increases,
 
the variance of the estimate decreases. For N sensors, there is a total of N2
pairs of sensors for which the sum of the cross correlation is being maximized.

2.5

Optimizing the cross correlation function

The optimization problem presented in (9) generally lacks a closed form


solution, so a numerical search method is used. The method used is the
Golden section search, described in [LRV01]. The Golden section search is a
one dimensional search method that searches for a maxima (or minima when
minimizing a function) between two end-points.
The first thing to do before optimizing is to determine the interval over
which to optimize. The relative delay between two sensors in the array
can never be larger than the delay caused by the distance between the two
sensors. The largest relative delay occurs when the source is located on the
8

line connecting the two sensors. Therefore, in (9), it is known that


"

d fs d fs
,

c
c

(10)

This is also the interval of for which (7) is defined, since the domain of
sin1 is [1, 1].
Assume the search interval for iteration i is [i , i ], where i < i . Two
new points, li and ri , are choosen such that i < li < ri < i . The search
interval is then updated depending on the function values at the points li and
ri . If f (li ) > f (ri ), the new search interval [i+1 , i+1 ] = [i , ri ], otherwise
[i+1 , i+1 ] = [li , i ].
By keeping the ratio bewteen all points constant for each iteration, the
inner points li and ri can be reused in the next iteration, not only as an
endpoint for the new search interval, but also as one of the new inner points.
Therefore, only a single new point and corresponding function value must be
calculated for each iteration. The ratio between the points can be expressed
as
ri i
li i
=
(11)
ri i
i i
The ratio is the Golden ratio, hence the name of the algorithm. The Golden
ratio is calculated as

3 5
=
0,3820
(12)
2
The algorithm for the Golden section search is shown in algorithm 1.
Eligible parameters in the algorithm are the search interval [, ] and the
tolerance . The algorithm returns the value that maximizes the function
f ( ) over the search interval, with a tolerance of  units.
For the Golden section search to work, the function being optimized must
be unimodal; it must have one, and only one, maxima in the interval being
optimized. In general, the cross correlation is not unimodal. However, investigating the cross correlation for real recordings have shown that the cross
correlation can, in practice, under the circumstances given in this thesis, be
considered unimodal often enough for the Golden section search to be an
option. Sometimes the optimization returns a local maxima instead of the
global (in the range specified) maxima, but not often enough to notably affect
the general performance.

Algorithm 1 The Golden section search algorithm.


Require: < and  > 0
Ensure: = arg max f ( ) 

1: l = + ( )
2: r = l + ( l)
3: fl = f (l)
4: fr = f (r)
5: while <  do
6:
if fl < fu then
7:
=l
8:
l=r
9:
r = + ( )
10:
fl = fr
11:
fr = f (r)
12:
else
13:
r
14:
rl
15:
l + (r )
16:
fr fl
17:
fl f (l)
18:
end if
19: end while
20: if fl > fr then
21:
l
22: else
23:
r
24: end if

10

X(z)
z

X (z)

X (z)

H (z)

H (z)

-1

-1

-1

IFFT

N-1

H (z)

N-1

X (z)

Figure 4: Uniform DFT analysis filter bank.

Filter banks

The generalized cross correlation, described in section 2.2, estimates the intersensor delays using the cross power spectrum. The cross power spectrum
is calculated as shown in (4). Instead of calculating the discrete Fourier
transform of the signals x0 and x1 directly, a uniform DFT analysis filter
bank is used.
The signal x (n) is decomposed into a set of N subbands by the filter bank.
The filter bank consists of a set of bandpass filters derived from a prototype
filter. The prototype filter is a lowpass filter whose frequency response is
shifted in frequency domain, making it a bandpass filter. The prototype filter
is used to create one bandpass filter for each of the N subbands, with center
, n = 0 . . . N 1, for the n:th subband. After filtering, the
frequency at 2n
N
subband signals are decimated. If the sample rate of the subband signals are
decimated by a multiple of the number of subbands, N , an efficient polyphase
implementation is possible, as shown in figure 4

4
4.1

Position estimation
Source localization problem

From a set of N pairs of sensors {mi0 , mi1 }, i = 0 . . . N 1, the time delay


between the two sensors in the pair, given the knowledge about the position

11

for the two sensors, mi0 and mi1 , and the position of the source, s, is
|s mi0 | |s mi1 |
(13)
c
where c is the speed of sound. For each pair, there is an estimated time
delay i between the two sensors, and an estimated variance i . If the delay
estimates i are corrupted by uncorrelated, zero-mean gaussian noise, the
maximum likelihood estimate of the source location sM L is found by minimizing a least-square error function JM L (s) [BAS97].
T ({mi0 , mi1 } , s) =

sM L = arg min
JM L (s)
s
where
JM L (s) =

N
1
X
i=0

4.2

1
[
i T ({mi0 , mi1 } , s)]2
i2

(14)

(15)

Linear intersection

Minimizing the error function in (14) involves searching for a position s from
which the theoretical delays, as closely as possible, matches the measured
delays. Instead of using a numerical search method to find the location of
the source, a numerically less expensive closed-form solution is used instead.
The algorithm used is based on the Linear intersection algorithm described
in [BAS97], modified from three- to two-dimensional intersections.
Once the direction of arrival is calculated for each sensor pair, the intersection of all estimated directions of arrival, together with the sensor position,
can be calculated. Given the position of sensor pair i, mi , and its direction
of arrival, vi , any point pi on the line originating from the array location in
the direction vi can be described as
pi = mi + ti vi

(16)

where ti > 0, as shown in figure 5. pi also describes all possible locations of


the source as seen from the sensor pair. By using two pairs, {mi0 , mi1 } and
{mj0 , mj1 }, the source location can be found by calculating the intersection
of the lines pi and pj .
pi = pj mi + ti vi = mj + tj vj
ti vi tj vj = mj mi

(17)

On matrix form, the equation becomes


Vt = m
12

(18)

where

V=

vi vj

t=

ti

tj

(19)

and
m = mj mi

(20)

t = V1 m

(21)

Seeking t, the solution is


and the intersection point can then be calculated as
sij,LI = mi + ti vi = mj + tj vj

(22)

When using N > 2 sensor pairs, or more generally, sensor subarrays


when
 
multiple sensors are used per pair for increased accuracy, N2 possible
intersections can be calculated; one for each combination of 2 subarrays.
Assuming there are at least 2 subarrays, the final position can be estimated
as
1
NP
2 NP
sij,LI
i=0 j=i
 

sLI =
(23)
N
2

Since no information regarding propagation delay from the source to a


sensor subarray, or between subarrays, is available, problem arises when the
source is located near the line connecting the two subarrays or far away from
the subarray compared to the distance between them. In those cases the
direction of arrival vectors are almost parallel, and the matrix V in (21) is
badly conditioned, or even non-invertable.

Track association and filtering

This section describes the algorithm used for tracking sources from individual
positional estimates. Section 4 describes an algorithm to estimate a position
for the source given the time delay between sensors in a sensor array, and
using several sensor subarrays to estimate a position. The algorithm gives a
set of points sampled at a certain time interval. The positional estimates are
distorted by noise and needs to be filtered spatially.

13

p1
p0
v0

v1

m0

m1

Figure 5: Linear intersection.

5.1

Track association

When there are multiple sources being located (for example, two or more talkers having a conversation), simply filtering the samples as they are calculated
is not an option. An algorithm to determine which source a sample belongs
to must be implemented, and only then can samples be filtered properly. The
track association algorithm is based on a method described in [SBS97].
A track is a state vector following a source. When a new sample is
calculated, one of the currently stored tracks is first associated with it. The
track associated with the sample is the nearest track, but the track must also
be within a certain distance from the sample.
If no track is good enough to be associated, a new track is created. An
association can fail because of two main reasons; the sample belongs to a
completely new source, or the sample was distorted by so much noise it fell
outside the acceptance region for the correct source. When a new track is
created, it is not yet known whether the sample is a new source being active,
or just a noise-corrupted sample from a current track. Therefore, all new
tracks are marked as potential tracks, so if no new samples falls within the
acceptance regions within a certain time, it can be assumed it was created
from a noise-corrupted sample and it will be dropped. However, if more
samples starts to fall within the acceptance region, it is assumed that the
track is indeed tracking an active source, and the track is promoted to an
active track.
A track associated with a sample is updated. The sample is added to
the list of samples for that track, and eventually filtered to smooth the path

14

formed by the samples.


When a track is not updated with new samples within a certain time,
the track is considered abandoned, and the track is dropped from the list
of potential or active tracks. A completed track is an active track that was
dropped. Potential track not yet promoted to active tracks are not considered
completed tracks when they are dropped. That is because a potential track
is a track that is not yet classified as being a real source.

5.2

Filtering

Filtering is performed using a Kalman filter. The source being tracked is


assumed to be humans talking, and since the source can move around, a
simple Newtonian motion model is used to model the motions of the talker.
Therefore, the state vector for the Kalman filter is
x
n =

xn yn x n y n

iT

(24)

where xn and yn represents the two-dimensional position of the source, and


x n and y n the velocity, at iteration n.
The filter used is a one-step predictor as described in [Hay02]. The transition matrix F is
#
"
I2 T I2
F=
(25)
02
I2
and the measurement matrix C is
C=

I2 02

(26)

where In is an n n identity matrix, 0n is an n n zero-matrix and T is the


time since last update of the state vector. The filter is updated at constant
time intervals T , so the transition matrix F is also constant, and the inverse
of the transition matrix is
"

I2 T I2
02
I2

(27)

The correlation matrices for the process and measurement noise, Q1 and Q2
respectively, is
Q1 = q1 I4 , Q2 = q2 I2
(28)
where q1 and q2 are the variances of the process and measurement noise.

15

n,
The algorithm for estimating the sources state vector at iteration n, x
given the estimated position samples, yn , is show in algorithm 2. The initial
0 is the estimated position and velocity of the source at the
state vector x
time the Kalman filter starts tracking the source. The position is estimated
from the samples collected before the track was promoted to an active track
(see section 5.1) and the velocity is assumed to be zero. The initial predicted
state-error correlation matrix K0 = 04 .
Algorithm 2 Kalman filter based on one-step prediction.
1: for n = 1, 2, 3 . . . do
h
i1
2:
Gn = F Kn CH C Kn CH + Q2
3:
an = yn C
xn
n+1 = F
4:
x
xn + Gn an
5:
Kn+1 = F [Kn F1 Gn Kn ] FH + Q1
6: end for
Instead of iterating through all the samples at once with the for-loop in
algorithm 2, each new sample calculated will trigger a single pass in the loop.
This is necessary for real time filtering where the filtered result is needed as
new samples are calculated.

Experiments

6.1

Testing the angle of arrival

The algorithm to estimate the angle of arrival is evaluated using measurements with different types of sound and room environments and from different
angles relative the sensor array. The three scenarios are:
Speech in a room with low echo.
Speech in a room with moderate echo.
White gaussian noise in a room with low echo.
The speech used is pre-recorded speech of random phrases. The room is
of size 45 m. One wall have an acousting damper covering it, and the
other walls are unblocked walls, giving a moderate echo. Along the walls
are some tables with computer equipment and home entertainment systems,
16

speakers and some chairs. Figure 6 shows a general overview of the room, the
placement of the sensor and placement of the source in the different angles.
The source is placed in four angles; 0 , 22,5 , 45 and 67,5 . Figure 7 shows
the same room, but with acoustic dampers placed along the walls around the
sensor array to reduce the echo.
The sound is played using a speaker placed at the angles shown in figure 6
and 7, at a distance of 2 m away from the array. The sound is played
at normal speech level. Noise is present in the form of computer fans and
ventilation, and the signal to noise ratio at the sensors are about 15 dB.
The sample rate is 8 kHz. The array consists of 6 microphones with an
inter-sensor distance of 4 cm.
6.1.1

Bias and variance

Bias is the introduction of an offset in the estimated parameter compared


to the real parameter. Figure 8 shows the estimated angles for the different
scenarios. The performance is evaluated as a function of the number of
subbands in the DFT filter bank.
White noise is fairly accurate to locate. As the angle of arrival approaches
the edges and as the reverberation level increases, the bias also increases. By
using a high number of subbands and with a source not located at the edge
of a sensor array, the bias can be kept below 5 . That is roughly equivalent
to an offset of about 2,5 dm, 3 m away from the array.
The variance, or the standard deviation, of the estimate is a measurement
of how much a specific sample generally deviates from the average value.
Figure 9 shows the deviation measured at different angles for the different
scenarios.
As with bias, the variance of white noise is very low. For speech, the
variance is about the same for low and moderate echo as long as the source
is not located near the edge of the sensor array.

6.2

Testing the localization and tracking

The localization and tracking algorithms are tested in the same room as
before. Two scenarios are tested:
Two fixed talkers having a conversation.
Single talker moving in a circle.
17

In both scenarios, the sample rate is 8 kHz and 512 subband filter bank
is used.
6.2.1

Two fixed talkers

The scenario setup is given in figure 10. The distance between the two
subsensor arrays is 1,5 m, and the two talkers are located 1,7 m out from the
arrays.
The scenario simulates two talkers having a conversation. The test consists of three phases. They begin by speaking one at a time for about 20 s
each. Then they start talking for 5 s each to simulate more rapid changes in
the location estimates, and in the last phase they talk at the same time to
see how the algorithms handle two simultaneous sources.
Figure 11 shows the result from the evaluation after track association
and filtering. Figure 11(a) shows the x and y position components over time.
The first two phases pass without problems, the sources are clearly separated
and located. In the third phase, the algorithm can find two separate sources
and can track them independently, although tracks are sometimes lost and
recreated. Figure 11(b) shows the positions of the sources as a view from
above.
6.2.2

Single moving talker

The setup in this scenario is shown in figure 12. The distance between the
sensor subarrays is, as in the previous scenario, 1,5 m. The talker is now
moving in a circle, about 1,8 m out from the arrays. The result from this
evaluation is shown in figure 13, where figure 13(a) shows the x and y position
components over time and figure 13(b) the position from above.

Real time implementation

The algorithms were first implemented and evaluated in Matlab. When


the algorithms was working properly, the Matlab M-code was translated,
by hand, to C++. Around the translated code, an interface was implemented
for interaction with the user. The program is written for the the Windows
platform, using the ASIO standard for communication with sound recording equipment. Because everything were thoroughly tested in Matlab, the

18

translation went smooth. The general structure of the code in both Matlab
and C++ are similar, so the translation was basically a line-by-line translation.
The main concern in the beginning was the available CPU time. It was
later found that it wasnt really the biggest problem in implementing the
algorithms in real time. A standard-equiped Pentium 4 at 1,5 GHz could
easily handle 2-3 arrays with 4-6 sensors per array, at sample rates up to 16
kHz, enough to sample speech at good quality, and filter banks with 1024
subbands. As new computers have significantly more computing power, the
CPU time is not a problem unless the arrays becomes too large and too many.

Conclusion and further development

Different algorithms was first evaluated to estimate the angle of arrival.


Other than the Steered response power algorithm described in this thesis,
the algorithms tried initially was the following.
Using the cross correlation calculated in time domain and search for a
peak in the cross correlation.
Using an LMS-filter where the adaptive filter is used to estimate the
delay between a signal from a reference sensor and the other sensors.
The slope of the phase response of the filter determines the delay. Ideally, the impulse response of the filter is a delayed -impulse, and the
phase response is a straight line.
Estimating the slope of the phase of the cross power spectrum, as described in [ADBS95]. Ideally, only a delay is present, and the cross
power spectrum is on the form ej .
Except for the first, using the cross correlation calculated in time domain,
they all work well on synthetic data. The cross correlation calculated in time
domain did not have enough resolution as the delay could only be estimated
as multiples of the sampling period. When real recorded data was used, the
LMS-filter and the cross power spectrum method was too inaccurate when
estimating the slope of the phase.
For speech in reverberant rooms, only the SRP algorithm used in this
thesis worked well enough to be used in practice. Together with the PHATweighting function in the general cross correlation, the SRP-PHAT algorithm
forms a robust method of estimating the angle of arrival for a sensor array.
19

It it also a good choise for real time applications, as its doesnt require
much computing power compared to whats available in a standard desktop
computer.
The filter bank was also a huge improvement compared to only using the
DFT. The filter bank forms a time-averaged spectrum, making the important phase information less variant for the inter-sensor delay estimator. The
computational complexity of the filter bank is higher, but well within the
limits for real time applications and the improved precision was well worth
it.
The linear intersection, a closed-form algorithm, is computationally very
efficient. By associating samples with tracks, and spatially filtering the
tracks, the location algorithms is able to quickly locate and track multiple
sources; not just alternating sources, but also, to some extent, simultaneous
sources.
Further, the algorithms can be improved with smart acoustic detectors
and classificators to classify sounds and locate only certain types of events
(or ignore them), such as tracking speech only or locating noise sources. The
method for detecting multiple sources can also be improved. The current
implementation relies on the two sources being at about the same signal
power level at the subarrays.

20

References
[ADBS95] John E. Adcock, Joseph H. DiBiase, Michael S. Brandstein, and
Harvey F. Silverman. Practical issues in the use of a frequencydomain delay estimator for microphone-array applications, January 1995.
[BAS97]

Michael S. Brandstein, John E. Adcock, and Harvey F. Silverman. A closed form location estimator for use with room environment microphone arrays. IEEE Transaction on Speech and Audio
processing, 5(1):4550, January 1997.

[Hay02]

Simon Haykin. Adaptive filter theory. Prentice Hall, fourth edition, 2002.

[KC76]

Charles H. Knapp and G. Clifford Carter. The generalized correlation method for estimation of time delay. IEEE Transaction on
Acoustics, Speech and Signal Processing, 24(4):320327, August
1976.

[LRV01]

Jan Lundgren, Mikael Ronnqvist, and Peter Varnblad. Linj


ar och
icke-linjar optimering. Studentlitteratur, 2001.

[SBS97]

Douglas E. Sturim, Michael S. Brandstein, and Harvey F. Silverman. Tracking multiple talkers using microphone-array measurements. IEEE Transaction on Acoustics, Speech and Signal
Processing, 1:371374, 1997.

21

200 cm

0=0

1=22,5
2=45
3=67,5

Figure 6: Room with moderate echo.

200 cm

0=0

1=22,5
2=45
3=67,5

Figure 7: Room with low echo.

22

45
Speech, moderate echo
Speech, low echo
Noise, low echo

20
15

Speech, moderate echo


Speech, low echo
Noise, low echo

40

Angle of arrival [degrees]

Angle of arrival [degrees]

35
10
5
0
5
10

30
25
20
15
10

15
5
20
64

128

256

512

1024

0
64

2048

128

256

Subbands

512

1024

2048

Subbands

(a) Real angle is 0 .

(b) Real angle is 22,5 .


90

Speech, moderate echo


Speech, low echo
Noise, low echo

65
60

Speech, moderate echo


Speech, low echo
Noise, low echo

85

Angle of arrival [degrees]

Angle of arrival [degrees]

80
55
50
45
40
35

75
70
65
60
55

30
50
25
64

128

256

512

1024

45
64

2048

128

256

Subbands

512

1024

Subbands

(c) Real angle is 45 .

(d) Real angle is 67,5 .

Figure 8: Bias of estimated angles.

23

2048

10

10

Speech, moderate echo


Speech, low echo
Noise, low echo
Standard deviation [degrees]

Standard deviation [degrees]

Speech, moderate echo


Speech, low echo
Noise, low echo

10

10

10

64

10

10

128

256

512

1024

10

2048

64

128

256

Subbands

512

(a) Standard deviation at 0 .


2

10

Speech, moderate echo


Speech, low echo
Noise, low echo
Standard deviation [degrees]

Speech, moderate echo


Speech, low echo
Noise, low echo
Standard deviation [degrees]

2048

(b) Standard deviation at 22,5 .

10

10

10

10

1024

Subbands

64

10

10

128

256

512

1024

10

2048

Subbands

64

128

256

512

1024

Subbands

(c) Standard deviation at 45 .

(d) Standard deviation at 67,5 .

Figure 9: Standard deviation of estimated angles.

24

2048

75 cm 75 cm

Speaker A
y-axis
x-axis

Speaker B
150 cm

Figure 10: Two speakers having a conversation.

3
Position, x [m]

0
2
0

20

40

60
Time [s]

80

100

Position, y [m]

1
120

Position, y [m]

1
2

20

40

60
Time [s]

80

100

120

(a) x and y values as a function of time.

0
Position, x [m]

(b) x and y values against eachother.

Figure 11: Two speakers having a conversation.

25

75 cm 75 cm

y-axis
x-axis

150 cm

Figure 12: Single speaker moving in a circle.

3
Position, x [m]

0
2
0

10

20
Time [s]

Position, y [m]

1
30

Position, y [m]

1
2

10

20
Time [s]

30

(a) x and y values as a function of time.

0
Position, x [m]

(b) x and y values against eachother.

Figure 13: Single talker moving in a circle.

26

You might also like