Sound Source Localization
Sound Source Localization
Mikael Swartling
Examensarbete
Teknologie Magisterexamen i Elektroteknik
Abstract
The purpose of this thesis is to evaluate and implement algorithms
for robust localization and tracking of moving acoustic sources in real
time using a microphone array. To identify inter-sensor delays, the
generalized cross correlation is used together with a filter bank. From
the inter-sensor delays, position is estimated using a linear intersection
algorithm. Position estimates are associated with tracks, which are filtered by a Kalman filter. Results from two real-room experiments are
presented to demonstrate the localization and tracking performance,
along with a discussion on real time implementation issues.
Contents
1 Introduction
2 Delay estimation
2.1
Signal model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
2.3
Angle of arrival . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Multiple sensors . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
3 Filter banks
11
4 Position estimation
11
4.1
4.2
Linear intersection . . . . . . . . . . . . . . . . . . . . . . . . 12
13
5.1
Track association . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2
Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Experiments
6.1
6.2
16
6.2.2
18
19
List of Figures
1
Linear intersection. . . . . . . . . . . . . . . . . . . . . . . . . 14
10
11
12
13
Introduction
Delay estimation
2.1
Signal model
Given two spatially separated sensors (in this thesis, the sensors are microphones), the signal received from an acoustic source at one sensor will be
shifted in time relative the other sensor due to an extra propagation distance
from source to sensor. Figure 1 illustrates this delay where the source is
located in the near and far field, respectively. In the near field case, the
direction of arrival is different for the two sensors. In the far field case, the
direction of arrival can be considered parallel and will therefore be the same
for both sensors.
Assuming the relative attenuation between the two sensors is negligible,
(1)
where s(t) is the acoustic source signal, 0 and 1 are the propagation delays
from the source to the sensors and n0 (t) and n1 (t) are noise signals. The
noise received at the sensors are considered mutually uncorrelated and also
uncorrelated with the source signal. The relative delay between the sensors,
= 1 0 , is the delay caused by the extra propagation distance.
The task is to estimate the delay from finite size blocks of data from
x0 (t) and x1 (t). To track a talker, to locate new sources and alternate between
several sources quickly, a method to quickly estimate the delay is required.
2.2
(2)
The cross correlation Rx0 x1 ( ) is related to the cross power spectrum Gx0 x1 ()
by the Fourier transform as
Z
Rx0 x1 ( ) =
Gx0 x1 () ej d
(3)
(4)
() Gx0 x1 () ej d
(5)
1
|Gx0 x1 ()|
(6)
2.3
Angle of arrival
When the time delay of arrival is estimated and the array geometry is known,
a direction of arrival can also be estimated. From a given delay, a path can
be calculated along which the source is located. It is not possible, using only
two sensors, to determine where along the path the source is located. The
path is a parabolic curve in two dimensions as illustrated by the dashed line
in figure 2(a). The curve is actually mirrored along the line connecting the
two sensors. However, only one half-space is concidered here; the source is
assumed to be located in front of the sensor array.
In the far field, the parabolic curve approaches a straight line. Assuming
the source is always located in the far field, it is possible to approximate the
6
Figure 2: Path of possible source locations. A source located in the near field
results in a parabolic curve of possible source locations (a), and
in the far field the parabolic path ca be approximated by a straight
line (b).
parabolic curve with a straight line, as shown in figure 2(b). The angle is
the angle or arrival for a distant source.
The angle of arrival can be calculated as
1
= sin
c
d fs
(7)
where c is the speed of sound, d is the distance between the two sensors,
fs is the sample rate and is the estimated delay between the two sensors
measured in samples. An estimate of the variance of the estimated angle is
[ADBS95]
]
V [
]
(8)
V [
cos2
2.4
Multiple sensors
To increase the accuracy of the delay estimate, multiple sensors can be used.
Here, the sensors are placed on a line, evenly spaced. Assuming a far-field
source, the sensor arrangement and their delays relative other sensors are as
in figure 3.
The SRP based algorithms, steered response power, are algorithms based
on steering a beamformer, searching for maximum power output. The type
7
m0
m1
m2
mN-1
1 Z
N
2 N
X
X
(9)
n=0 m=n
2.5
d fs d fs
,
c
c
(10)
This is also the interval of for which (7) is defined, since the domain of
sin1 is [1, 1].
Assume the search interval for iteration i is [i , i ], where i < i . Two
new points, li and ri , are choosen such that i < li < ri < i . The search
interval is then updated depending on the function values at the points li and
ri . If f (li ) > f (ri ), the new search interval [i+1 , i+1 ] = [i , ri ], otherwise
[i+1 , i+1 ] = [li , i ].
By keeping the ratio bewteen all points constant for each iteration, the
inner points li and ri can be reused in the next iteration, not only as an
endpoint for the new search interval, but also as one of the new inner points.
Therefore, only a single new point and corresponding function value must be
calculated for each iteration. The ratio between the points can be expressed
as
ri i
li i
=
(11)
ri i
i i
The ratio is the Golden ratio, hence the name of the algorithm. The Golden
ratio is calculated as
3 5
=
0,3820
(12)
2
The algorithm for the Golden section search is shown in algorithm 1.
Eligible parameters in the algorithm are the search interval [, ] and the
tolerance . The algorithm returns the value that maximizes the function
f ( ) over the search interval, with a tolerance of units.
For the Golden section search to work, the function being optimized must
be unimodal; it must have one, and only one, maxima in the interval being
optimized. In general, the cross correlation is not unimodal. However, investigating the cross correlation for real recordings have shown that the cross
correlation can, in practice, under the circumstances given in this thesis, be
considered unimodal often enough for the Golden section search to be an
option. Sometimes the optimization returns a local maxima instead of the
global (in the range specified) maxima, but not often enough to notably affect
the general performance.
1: l = + ( )
2: r = l + ( l)
3: fl = f (l)
4: fr = f (r)
5: while < do
6:
if fl < fu then
7:
=l
8:
l=r
9:
r = + ( )
10:
fl = fr
11:
fr = f (r)
12:
else
13:
r
14:
rl
15:
l + (r )
16:
fr fl
17:
fl f (l)
18:
end if
19: end while
20: if fl > fr then
21:
l
22: else
23:
r
24: end if
10
X(z)
z
X (z)
X (z)
H (z)
H (z)
-1
-1
-1
IFFT
N-1
H (z)
N-1
X (z)
Filter banks
The generalized cross correlation, described in section 2.2, estimates the intersensor delays using the cross power spectrum. The cross power spectrum
is calculated as shown in (4). Instead of calculating the discrete Fourier
transform of the signals x0 and x1 directly, a uniform DFT analysis filter
bank is used.
The signal x (n) is decomposed into a set of N subbands by the filter bank.
The filter bank consists of a set of bandpass filters derived from a prototype
filter. The prototype filter is a lowpass filter whose frequency response is
shifted in frequency domain, making it a bandpass filter. The prototype filter
is used to create one bandpass filter for each of the N subbands, with center
, n = 0 . . . N 1, for the n:th subband. After filtering, the
frequency at 2n
N
subband signals are decimated. If the sample rate of the subband signals are
decimated by a multiple of the number of subbands, N , an efficient polyphase
implementation is possible, as shown in figure 4
4
4.1
Position estimation
Source localization problem
11
for the two sensors, mi0 and mi1 , and the position of the source, s, is
|s mi0 | |s mi1 |
(13)
c
where c is the speed of sound. For each pair, there is an estimated time
delay i between the two sensors, and an estimated variance i . If the delay
estimates i are corrupted by uncorrelated, zero-mean gaussian noise, the
maximum likelihood estimate of the source location sM L is found by minimizing a least-square error function JM L (s) [BAS97].
T ({mi0 , mi1 } , s) =
sM L = arg min
JM L (s)
s
where
JM L (s) =
N
1
X
i=0
4.2
1
[
i T ({mi0 , mi1 } , s)]2
i2
(14)
(15)
Linear intersection
Minimizing the error function in (14) involves searching for a position s from
which the theoretical delays, as closely as possible, matches the measured
delays. Instead of using a numerical search method to find the location of
the source, a numerically less expensive closed-form solution is used instead.
The algorithm used is based on the Linear intersection algorithm described
in [BAS97], modified from three- to two-dimensional intersections.
Once the direction of arrival is calculated for each sensor pair, the intersection of all estimated directions of arrival, together with the sensor position,
can be calculated. Given the position of sensor pair i, mi , and its direction
of arrival, vi , any point pi on the line originating from the array location in
the direction vi can be described as
pi = mi + ti vi
(16)
(17)
(18)
where
V=
vi vj
t=
ti
tj
(19)
and
m = mj mi
(20)
t = V1 m
(21)
(22)
sLI =
(23)
N
2
This section describes the algorithm used for tracking sources from individual
positional estimates. Section 4 describes an algorithm to estimate a position
for the source given the time delay between sensors in a sensor array, and
using several sensor subarrays to estimate a position. The algorithm gives a
set of points sampled at a certain time interval. The positional estimates are
distorted by noise and needs to be filtered spatially.
13
p1
p0
v0
v1
m0
m1
5.1
Track association
When there are multiple sources being located (for example, two or more talkers having a conversation), simply filtering the samples as they are calculated
is not an option. An algorithm to determine which source a sample belongs
to must be implemented, and only then can samples be filtered properly. The
track association algorithm is based on a method described in [SBS97].
A track is a state vector following a source. When a new sample is
calculated, one of the currently stored tracks is first associated with it. The
track associated with the sample is the nearest track, but the track must also
be within a certain distance from the sample.
If no track is good enough to be associated, a new track is created. An
association can fail because of two main reasons; the sample belongs to a
completely new source, or the sample was distorted by so much noise it fell
outside the acceptance region for the correct source. When a new track is
created, it is not yet known whether the sample is a new source being active,
or just a noise-corrupted sample from a current track. Therefore, all new
tracks are marked as potential tracks, so if no new samples falls within the
acceptance regions within a certain time, it can be assumed it was created
from a noise-corrupted sample and it will be dropped. However, if more
samples starts to fall within the acceptance region, it is assumed that the
track is indeed tracking an active source, and the track is promoted to an
active track.
A track associated with a sample is updated. The sample is added to
the list of samples for that track, and eventually filtered to smooth the path
14
5.2
Filtering
xn yn x n y n
iT
(24)
I2 02
(26)
I2 T I2
02
I2
(27)
The correlation matrices for the process and measurement noise, Q1 and Q2
respectively, is
Q1 = q1 I4 , Q2 = q2 I2
(28)
where q1 and q2 are the variances of the process and measurement noise.
15
n,
The algorithm for estimating the sources state vector at iteration n, x
given the estimated position samples, yn , is show in algorithm 2. The initial
0 is the estimated position and velocity of the source at the
state vector x
time the Kalman filter starts tracking the source. The position is estimated
from the samples collected before the track was promoted to an active track
(see section 5.1) and the velocity is assumed to be zero. The initial predicted
state-error correlation matrix K0 = 04 .
Algorithm 2 Kalman filter based on one-step prediction.
1: for n = 1, 2, 3 . . . do
h
i1
2:
Gn = F Kn CH C Kn CH + Q2
3:
an = yn C
xn
n+1 = F
4:
x
xn + Gn an
5:
Kn+1 = F [Kn F1 Gn Kn ] FH + Q1
6: end for
Instead of iterating through all the samples at once with the for-loop in
algorithm 2, each new sample calculated will trigger a single pass in the loop.
This is necessary for real time filtering where the filtered result is needed as
new samples are calculated.
Experiments
6.1
The algorithm to estimate the angle of arrival is evaluated using measurements with different types of sound and room environments and from different
angles relative the sensor array. The three scenarios are:
Speech in a room with low echo.
Speech in a room with moderate echo.
White gaussian noise in a room with low echo.
The speech used is pre-recorded speech of random phrases. The room is
of size 45 m. One wall have an acousting damper covering it, and the
other walls are unblocked walls, giving a moderate echo. Along the walls
are some tables with computer equipment and home entertainment systems,
16
speakers and some chairs. Figure 6 shows a general overview of the room, the
placement of the sensor and placement of the source in the different angles.
The source is placed in four angles; 0 , 22,5 , 45 and 67,5 . Figure 7 shows
the same room, but with acoustic dampers placed along the walls around the
sensor array to reduce the echo.
The sound is played using a speaker placed at the angles shown in figure 6
and 7, at a distance of 2 m away from the array. The sound is played
at normal speech level. Noise is present in the form of computer fans and
ventilation, and the signal to noise ratio at the sensors are about 15 dB.
The sample rate is 8 kHz. The array consists of 6 microphones with an
inter-sensor distance of 4 cm.
6.1.1
6.2
The localization and tracking algorithms are tested in the same room as
before. Two scenarios are tested:
Two fixed talkers having a conversation.
Single talker moving in a circle.
17
In both scenarios, the sample rate is 8 kHz and 512 subband filter bank
is used.
6.2.1
The scenario setup is given in figure 10. The distance between the two
subsensor arrays is 1,5 m, and the two talkers are located 1,7 m out from the
arrays.
The scenario simulates two talkers having a conversation. The test consists of three phases. They begin by speaking one at a time for about 20 s
each. Then they start talking for 5 s each to simulate more rapid changes in
the location estimates, and in the last phase they talk at the same time to
see how the algorithms handle two simultaneous sources.
Figure 11 shows the result from the evaluation after track association
and filtering. Figure 11(a) shows the x and y position components over time.
The first two phases pass without problems, the sources are clearly separated
and located. In the third phase, the algorithm can find two separate sources
and can track them independently, although tracks are sometimes lost and
recreated. Figure 11(b) shows the positions of the sources as a view from
above.
6.2.2
The setup in this scenario is shown in figure 12. The distance between the
sensor subarrays is, as in the previous scenario, 1,5 m. The talker is now
moving in a circle, about 1,8 m out from the arrays. The result from this
evaluation is shown in figure 13, where figure 13(a) shows the x and y position
components over time and figure 13(b) the position from above.
18
translation went smooth. The general structure of the code in both Matlab
and C++ are similar, so the translation was basically a line-by-line translation.
The main concern in the beginning was the available CPU time. It was
later found that it wasnt really the biggest problem in implementing the
algorithms in real time. A standard-equiped Pentium 4 at 1,5 GHz could
easily handle 2-3 arrays with 4-6 sensors per array, at sample rates up to 16
kHz, enough to sample speech at good quality, and filter banks with 1024
subbands. As new computers have significantly more computing power, the
CPU time is not a problem unless the arrays becomes too large and too many.
It it also a good choise for real time applications, as its doesnt require
much computing power compared to whats available in a standard desktop
computer.
The filter bank was also a huge improvement compared to only using the
DFT. The filter bank forms a time-averaged spectrum, making the important phase information less variant for the inter-sensor delay estimator. The
computational complexity of the filter bank is higher, but well within the
limits for real time applications and the improved precision was well worth
it.
The linear intersection, a closed-form algorithm, is computationally very
efficient. By associating samples with tracks, and spatially filtering the
tracks, the location algorithms is able to quickly locate and track multiple
sources; not just alternating sources, but also, to some extent, simultaneous
sources.
Further, the algorithms can be improved with smart acoustic detectors
and classificators to classify sounds and locate only certain types of events
(or ignore them), such as tracking speech only or locating noise sources. The
method for detecting multiple sources can also be improved. The current
implementation relies on the two sources being at about the same signal
power level at the subarrays.
20
References
[ADBS95] John E. Adcock, Joseph H. DiBiase, Michael S. Brandstein, and
Harvey F. Silverman. Practical issues in the use of a frequencydomain delay estimator for microphone-array applications, January 1995.
[BAS97]
Michael S. Brandstein, John E. Adcock, and Harvey F. Silverman. A closed form location estimator for use with room environment microphone arrays. IEEE Transaction on Speech and Audio
processing, 5(1):4550, January 1997.
[Hay02]
Simon Haykin. Adaptive filter theory. Prentice Hall, fourth edition, 2002.
[KC76]
Charles H. Knapp and G. Clifford Carter. The generalized correlation method for estimation of time delay. IEEE Transaction on
Acoustics, Speech and Signal Processing, 24(4):320327, August
1976.
[LRV01]
[SBS97]
Douglas E. Sturim, Michael S. Brandstein, and Harvey F. Silverman. Tracking multiple talkers using microphone-array measurements. IEEE Transaction on Acoustics, Speech and Signal
Processing, 1:371374, 1997.
21
200 cm
0=0
1=22,5
2=45
3=67,5
200 cm
0=0
1=22,5
2=45
3=67,5
22
45
Speech, moderate echo
Speech, low echo
Noise, low echo
20
15
40
35
10
5
0
5
10
30
25
20
15
10
15
5
20
64
128
256
512
1024
0
64
2048
128
256
Subbands
512
1024
2048
Subbands
65
60
85
80
55
50
45
40
35
75
70
65
60
55
30
50
25
64
128
256
512
1024
45
64
2048
128
256
Subbands
512
1024
Subbands
23
2048
10
10
10
10
10
64
10
10
128
256
512
1024
10
2048
64
128
256
Subbands
512
10
2048
10
10
10
10
1024
Subbands
64
10
10
128
256
512
1024
10
2048
Subbands
64
128
256
512
1024
Subbands
24
2048
75 cm 75 cm
Speaker A
y-axis
x-axis
Speaker B
150 cm
3
Position, x [m]
0
2
0
20
40
60
Time [s]
80
100
Position, y [m]
1
120
Position, y [m]
1
2
20
40
60
Time [s]
80
100
120
0
Position, x [m]
25
75 cm 75 cm
y-axis
x-axis
150 cm
3
Position, x [m]
0
2
0
10
20
Time [s]
Position, y [m]
1
30
Position, y [m]
1
2
10
20
Time [s]
30
0
Position, x [m]
26