Microphone Arrays: A Tutorial: Iain Mccowan April 2001
Microphone Arrays: A Tutorial: Iain Mccowan April 2001
Iain McCowan
April 2001
Abstract
This report presents a tutorial of fundamental array processing and beamforming theory relevant
to microphone array speech processing. A microphone array consists of multiple microphones placed
at different spatial locations. Built upon a knowledge of sound propagation principles, the multiple
inputs can be manipulated to enhance or attenuate signals emanating from particular directions. In
this way, microphone arrays provide a means of enhancing a desired signal in the presence of corrupting
noise sources. Moreover, this enhancement is based purely on knowledge of the source location, and
so microphone array techniques are applicable to a wide variety of noise types. Microphone arrays
have great potential in practical applications of speech processing, due to their ability to provide both
noise robustness and hands-free signal acquisition.
This report has been extracted from my PhD thesis, and can be referenced as :
I..A. McCowan. ”Robust Speech Recognition using Microphone Arrays,” PhD Thesis, Queensland Uni-
versity of Technology, Australia, 2001.
For a more in-depth discussion of key microphone processing techniques, the interested reader is
refered to
M. Brandstein and D. Ward (Eds). ”Microphone Arrays”, Springer, 2001.
1
1.2 Wave Propagation
Sound waves propagate through fluids as longitudinal waves. The molecules in the fluid move back
and forth in the direction of propagation, producing regions of compression and expansion. By using
Newton’s equations of motion to consider an infinitesimal volume of the fluid, an equation governing the
wave’s propagation can be developed. A generalised wave equation for acoustic waves is quite complex
as it depends upon properties of the fluid, however, assuming an ideal fluid with zero viscosity, the wave
equation can be derived as [2]
1 δ2
∇2 x(t, r) − 2 2 x(t, r) = 0 (1)
c δt
where x(t, r) is a function representing the sound pressure at a point in time and space,
x
r= y (2)
z
and ∇2 is the Laplacian operator. The speed of propagation, c, depends upon the pressure and density
of the fluid, and is approximately 330ms−1 in air. The wave equation of Equation 1 is known as the
governing equation for a wide range of propagating waves, including electromagnetic waves.
The solution to the differential wave equation can be derived using the method of separation of
variables. The solution is well known and for a monochromatic plane wave is given as [2]
x(t, r) = Aej(ωt−k·r) (3)
where A is the wave amplitude, ω = 2πf is the frequency in radians per second, and the wavenumber
vector k indicates the speed and direction of wave propagation and is given by
2π
k= sin θ cos φ sin θ sin φ cos θ (4)
λ
where the wavelength λ is related to c by the simple relation λ = c/f . Alternately, the solution for a
spherical wave can be derived as [2]
A j(ωt−kr)
x(t, r) = − e (5)
4πr
where r = |r| is the radial distance from the source, and k is the scalar wavenumber, given by 2π/λ.
The spherical wave solution shows that the signal amplitude decays at a rate proportional to the distance
from the source. This dependence of the amplitude on the distance has important implications for array
processing algorithms when the source is in the near-field, as will be discussed in later sections. While
sound waves are typically spherical in nature, they may be considered as plane waves at a sufficient
distance from the source, and this approximation is often used to simplify mathematical analysis.
The plane wave solution in Equation 3 is expressed in terms of two variables, time and space. Due to
the well defined propagation of the signal, these two variables are linked by a simple relation, and thus
the solution can be expressed as function of a single variable. If we formulate the plane wave solution as
x(t, r) = Aejω(t−β·r) (6)
k
where β = ω, and we define a new variable u such that u = t − β · r, then the solution can be expressed
as
x(u) = Aejωu (7)
For spherical waves, with the substitution u = t − r/c, we have the similar expression
A jωu
x(u) = − e (8)
4πr
2
Due to the linearity of the wave equation, the monochromatic solution can be expanded to the more
general polychromatic case by considering the solution as a sum or integral of such complex exponentials.
Fourier theory tells us that any function with a convergent Fourier integral can be expressed as a weighted
superposition of complex exponentials. From this we can make the powerful conclusion that any signal
with a valid Fourier transform, irrespective of its shape, satisfies the wave equation.
In this section, we have seen that propagating acoustic signals can be expressed as functions of a
single variable, with time and space linked by a simple relation. In addition, the information in the
signal is preserved as it propagates. These two conclusions imply that, for a band-limited signal, we can
reconstruct the signal over all space and time by either
• temporally sampling the signal at a given location in space, or
• spatially sampling the signal at a given instant of time.
The latter implication is the basis for all aperture and sensor array signal processing techniques.
Other implications from the above wave propagation analysis that are important for array processing
applications are [3]
• The speed of propagation depends on the properties of the medium, and thus is constant for a given
wave type and medium. For the specific case of acoustic waves in air, the speed of propagation is
approximately 330ms−1 .
• In general, waves propagate from their source as spherical waves, with the amplitude decaying at a
rate proportional to the distance from the source.
• The superposition principle applies to propagating wave signals, allowing multiple waves to occur
without interaction. To separate these signals, algorithms must be developed to distinguish the
different signals based upon knowledge of their temporal and spatial characteristics.
The above discussion has retained the simplicity of assuming a homogeneous, lossless medium, and
neglecting effects such as dispersion, diffraction, and changes in propagation speed. A thorough analysis
of acoustic field theory can be found in Ziomek [2].
3
Propagating
wavefronts
Amount of signal
"seen" by aperture
α = fβ
1 (13)
= sin θ cos φ sin θ sin φ cos θ
λ
is the direction vector of the wave, where the angles θ and φ are as shown in Figure 2. Note that the
frequency dependence in the above equations is implicit in the wavelength term as λ = c/f .
4
Z
(r, θ, φ)
θ r
Y
φ
−L/2 r (r, θ, φ)
θ
Y
φ
L/2
5
this case
xa
r= 0 (14)
0
and the directivity pattern simplifies to
Z L/2
DR (f, αx ) = AR (f, xa )ej2παx xa dxa (15)
−L/2
where
sin θ cos φ
αx = (16)
λ
if we write the equation as a function of angles θ and φ we obtain
Z L/2
2π
DR (f, θ, φ) = AR (f, xa )ej λ sin θ cos φxa
dxa (17)
−L/2
The above expressions have been developed for plane waves and thus are only valid for the case of far-field
sources. For a linear aperture, a wave source may be considered to come from the far-field of the aperture
if [5]
2L2
|r| > (18)
λ
For now the far-field assumption serves to simplify the discussion of aperture properties. The details of
the more precise case of near-field sources will be considered later when discussing discrete linear sensor
arrays.
Consider the case of a linear aperture with uniform, frequency-independent aperture function. The
aperture function may be written as
AR (xa ) = rect(xa /L) (19)
where
1 |x| ≤ L/2
rect(x/L)=
b (20)
0 |x| > L/2
The resulting directivity pattern is given by
6
rect(xa/L)
−L/2 0 L/2
|Lsinc(αx L)|
L
2λ
− 3λ
L
− 2λ
L − Lλ 0 λ
L L
3λ
L
7
by −1 ≤ sinc(x) ≤ 1, the maximum possible value of the directivity pattern is Dmax = L, and the
normalised directivity pattern is given as
DR (f, αx )
DN (f, αx ) = = sinc(αx L) (24)
Dmax
or in terms of the angles θ and φ
L
DN (f, θ, φ) = sinc( sin θ cos φ) (25)
λ
A common tool for examining the properties of the aperture response is a polar plot of the horizontal
directivity pattern over angle φ, given by
π L
DN (f, , φ) = sinc( cos φ) (26)
2 λ
Polar plots of the horizontal directivity pattern are shown in Figure 5 for different values of L/λ, demon-
strating the beam width’s dependence on this ratio as discussed previously.
Although the directivity pattern given by Equation 22 can theoretically be evaluated for any value of
αx , because αx = sin θ cos φ, it is practically bounded by −1 ≤ αx ≤ 1. This interval is referred to as the
visible region of the aperture. To examine the physical significance of key values of αx we consider the
horizontal directivity pattern, for which θ = π2 . First, we see that αx = 0 implies that φ = π2 or φ = 3π2 ,
corresponding to a source that is situated perpendicular to the aperture axis, referred to as a broadside
source. Conversely, αx = ±1 implies that φ = 0 or φ = π, corresponding to a source on the same axis as
the aperture, termed an endfire source.
where wn (f ) is the complex weight for element n, en (f, x) is its complex frequency response or element
function, and xn is its spatial position on the x-axis. If we substitute this discrete aperture function into
Equation 15 we obtain the far-field directivity pattern as
N −1
X2
8
90 1 90 1
120 60 120 60
180 0 180 0
90 1 90 1
120 60 120 60
180 0 180 0
9
Z
x−2
x−1
x0
Y
x1
x2
10
where En (f, αx ) is the directivity pattern of element n.
In the case where all the elements have identical frequency response (that is En (f, αx ) = E(f, αx ),
∀n), the aperture function can be simplified to
N −1
X2
Equation 30 is the far-field directivity pattern for a linear array of N identical sensors, with arbitrary
inter-element spacing. For the case where all elements are equally spaced by d metres, the directivity
pattern becomes
N −1
X 2
Equation 33 gives us the directivity pattern for a linear, equally spaced array of identical sensors.
From the equation we see that the directivity pattern depends upon
• the number of array elements N
• the inter-element spacing d, and
• the frequency f .
Recall that a discrete sensor array approximates a continuous aperture. The effective length of a sensor
array is the length of the continuous aperture which it samples, and is given by L = N d. The actual
physical length of the array, as given by the distance between the first and last sensors, is however d(N −1).
Several interesting characteristics of a linear, equally spaced sensor array can be observed by plotting the
directivity pattern for the following scenarios
1. varying number of array elements N (L and f fixed).
2. varying effective array length L = N d (N and f fixed).
3. varying frequency f (N and L fixed).
11
|D(f, φ)|
N=3
N=5
N = 10
φ (degrees)
|D(f, φ)|
d = 0.1 m
d = 0.15 m
d = 0.2 m
Figure 8: Directivity pattern for varying effective array length (f =1 kHz, N =5)
12
|D(f, φ)|
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2 3000
0.1 2500
2000
0
1500
0 20 40 1000
60 80 100 120
500 f (Hz)
140 160 180 0
φ (degrees)
Figure 7 plots the directivity pattern for the first of these scenarios. We observe that the sidelobe
level decreases with increasing spatial sampling frequency - that is, the more sensors we use, the lower
the sidelobe level. The directivity pattern for the second scenario is shown in Figure 8. The plot shows
that the beam width decreases as the effective array length (and thus the spacing) increases. In fact, the
beam width is inversely proportional to the product f L, as seen in Figure 4. Given that L = N d and that
N is fixed in this case, we see that to vary the beam width we must vary f d. It is more common however
to require a constant beam width, in which case we must ensure that f d remains relatively constant.
We thus see that, for a given frequency, two important characteristics of the array directivity pattern,
namely the beam width and the sidelobe level, are directly determined by the inter-element spacing and
the number of sensors respectively.
For a given array configuration, we note that the beam width will vary as a function of frequency :
as the frequency increases, the beam width will decrease. This effect is shown in Figure 9, which plots
the horizontal directivity pattern for the third scenario, where the frequency is varied over the range
400Hz ≤ f ≤ 3000Hz.
13
90 1 90 1
120 60 120 60
180 0 180 0
where fmax is the maximum frequency component in the signal’s frequency spectrum. Similarly, for
spatial sampling we have the requirement that
1
fxs = ≥ 2fxmax (35)
d
where fxs is the spatial sampling frequency in samples per metre and fxmax is the highest spatial frequency
component in the angular spectrum of the signal. The spatial sampling frequency along the x-axis is
given by
sin θ cos φ
fxs = (36)
λ
The maximum value of this ratio naturally occurs when the numerator is maximum and the denominator
minimum. This leads to the relation
1
fxmax = (37)
λmin
and consequently the requirement that
λmin
d< (38)
2
where λmin is the minimum wavelength in the signal of interest. Equation 38 is known as the spatial
sampling theorem, and must be adhered to in order to prevent the occurrence of spatial aliasing in the
directivity pattern of a sensor array. Figure 10 illustrates the effect of spatial aliasing on the polar plot
of the horizontal directivity pattern.
14
where Gd is the gain to the desired signal and Gn is the average gain to all noise sources. The gain to
the desired signal corresponds to the power of the directivity pattern in the direction of arrival, while the
noise gain naturally changes depending on the nature of the noise field.
A diffuse noise field is one in which noise of equal energy propagates in all directions at all times (see
Section 2.2. In the case of a diffuse noise field, the array gain is also known as the factor of directivity
and is given by
2
|D(f, θ0 , φ0 )|
Ga (f, θ0 , φ0 ) = R R
2π π 2
(40)
1
4π 0 0 |D(f, θ, φ)| sin θ dθ dφ
2L2
|r| > (41)
λ
Under this assumption, the wavefronts arriving at the aperture can be considered as plane waves, that
is, the curvature of the wavefront can be neglected. For many practical applications of sensor arrays,
particularly within the context of speech recognition, the above criterion is not satisfied and the signal
source is said to be located within the near-field of the array. The derivation of equivalent near-field
expressions for the general continuous and discrete directivity patterns is quite involved, but for the
purpose of this discussion it is sufficient to consider the horizontal directivity pattern for a linear sensor
array. Indeed, a simple derivation of a near-field expression is possible in this case.
Consider the arrival of planar wavefronts on different elements in a sensor array, as shown in Figure 11.
From the diagram we see that the actual distance traveled by the wave between adjacent sensors is given
by
d′ = d cos φ (42)
More generally, the distance traveled by the wave between the reference sensor n = 0 and the nth sensor
is given by
d′ = nd cos φ (43)
Figure 12 illustrates the arrival of spherical wavefronts on different elements in a sensor array. From
the diagram we see that the actual distance traveled by the wave between the two sensors is given by
and in general
d′ = dn (r, φ) − d0 (r, φ) (45)
where dn (r, φ) is the distance from the source to the nth sensor as a function of the spherical coordinates
of the source (in the horizontal plane) with respect to the reference sensor. Using trigonometric relations,
it can be shown that this distance is given by [7]
1
dn (r, φ) = r2 + 2r(xn − x0 ) cos φ + (xn − x0 )2 2 (46)
15
Z
x−1
x0
d
φ } d′ Y
x1
far-field source
16
Z
x−1
x0
′
d
d0(r, φ) Y
}
near-field source
d1(r, φ)
x1
17
|D(f, φ)|
Far-field response
Near-field response
Figure 13: Directivity pattern for far-field and near-field (r=1 m) source (f =1 kHz, N =10, d=.1 m)
we note that the exponential contains the term nd cos φ. We have seen that this corresponds to the
distance traveled by the propagating wave between the reference sensor and the nth sensor. Substituting
in the equivalent expression for the near-field case we obtain
N −1
X2
2π
′
D (f, φ) = wn (f )ej λ (dn (r,φ)−d0 (r,φ)) (49)
n=− N 2−1
In addition, we recall that for spherical acoustic waves, the amplitude decays at a rate proportional to
the distance traveled. For far-field sources the amplitude differences between sensors can be considered to
be negligible, however, these amplitude differences may be significant for near-field sources. Incorporating
the amplitude dependency into the expression and normalising to give unity amplitude on the reference
sensor we obtain the following expression for the horizontal directivity pattern for near-field sources
N −1
X2
d0 (r, φ) 2π
Dnf (f, φ) = wn (f )ej λ (dn (r,φ)−d0 (r,φ)) (50)
dn (r, φ)
n=− N 2−1
Figure 13 plots the horizontal directivity pattern for both a far-field source and a near-field source
for the same sensor array for r=1 m, illustrating the dependence of the pattern on the distance to the
source.
If a sensor array is desired to operate in the near-field, the near-field directivity pattern can be made
to match the corresponding far-field directivity pattern by compensating the frequency dependent sensor
18
weights wn (f ). If we replace the far-field weights by the near-field compensated weights
then the near-field directivity pattern will match the far-field directivity pattern obtained using the origi-
nal weights wn (f ). This procedure is referred to as near-field compensation and allows us to approximate
a desired far-field directivity pattern at a given point (r, φ) in the near-field.
1.6 Beamforming
We now consider the term wn (f ) in the far-field horizontal directivity pattern of a linear sensor array
N −1
X2
Up to this point of the discussion, we have assumed equally weighted sensors in calculating the
directivity patterns, that is
1
wn (f ) = (53)
N
In general, the complex weighting can be expressed in terms of its magnitude and phase components as
wn (f ) = an (f )ejϕn (f ) (54)
where an (f ) and ϕn (f ) are real, frequency dependent amplitude and phase weights respectively. By
modifying the amplitude weights, an (f ), we can modify the shape of the directivity pattern. Similarly,
by modifying the phase weights, ϕn (f ), we can control the angular location of the response’s main lobe.
Beamforming techniques are algorithms for determining the complex sensor weights wn (f ) in order to
implement a desired shaping and steering of the array directivity pattern.
To illustrate the concept of beam steering, we consider the case where the sensor amplitude weights
an (f ) are set to unity, resulting in the directivity pattern
N −1
X2
19
|D(f, φ)|
steered
beam pattern
unsteered
beam pattern
φ (degrees)
Figure 14: Unsteered and steered directivity patterns (φ′ =45 degrees, f =1 kHz, N =10, d=.15 m)
The effect of such a phase weight on the beam pattern is thus to steer the main lobe of the beam pattern
to the direction cosine αx = α′x , and thus to the directions θ = θ′ and φ = φ′ . While the beam pattern
remains unchanged apart from the shift along the αx axis, when plotted as a function of angle, the beam
shape will change as αx is actually a function of sin θ and cos φ. The horizontal directivity pattern is
shown in in Figure 14, where the beam pattern has been shifted to φ′ = 45o .
Fourier transform theory tells us that a negative phase shift in the frequency domain corresponds to
a time delay in the time domain [6], and so beam steering can effectively be implemented by applying
time delays to the sensor inputs. Considering only the horizontal plane, we see that the delay for the nth
sensor is given by
ϕn
τn =
2πf
2πf nd cos φ′
= (60)
2πf c
nd cos φ′
=
c
which is seen to be equivalent to the time the plane wave takes to travel between the reference sensor and
the nth sensor. This is the principle of the simplest of all beamforming techniques, known as delay-sum
beamforming, where the time domain sensor inputs are first delayed by τn seconds, and then summed
to give a single array output. While we have seen here that the mathematics of discrete sensor arrays
assures a main lobe of increased gain in the direction of the desired signal, the signal enhancement and
noise reduction provided by the delay-sum beamformer can intuitively be attributed to the constructive
(in phase) interference of the desired propagating wave and the destructive (out of phase) interference
of waves from all other directions. Other more complicated beamforming techniques will be discussed in
detail in the following section.
20
2 Microphone Array Beamforming Techniques
2.1 Introduction
The previous section presented the fundamental theory of sensor arrays, and introduced the concept of
beamforming algorithms. This chapter continues the discussion by presenting the theory of a number of
key microphone array beamforming techniques.
Beamforming techniques can be broadly classified as being either data-independent, or data-dependent.
Data-independent, or fixed, beamformers are so named because their parameters are fixed during op-
eration. Conversely, data-dependent, or adaptive, beamforming techniques continuously update their
parameters based on the received signals.
As different beamforming techniques are appropriate for different noise conditions, the chapter begins
by defining the noise fields encountered in microphone array applications. Following this, the principles of
a number of key beamforming techniques are described in detail. The chapter concludes with a summary
of the beamforming techniques, indicating their advantages, disadvantages and applicability in different
noise conditions.
21
energy. Many practical noise environments can be characterised by a diffuse noise field, such as office or
car noise. The coherence between the noise at any two points in a diffuse noise field is a function of the
distance between the sensors, and can be modeled as [10]
2πf dij
Γij (f ) = sinc (62)
c
where dij is the distance between sensors i and j, and the sinc function has been defined in Equation 23.
It can be seen that the coherence approaches unity for closely spaced sensors and decreases sharply with
increasing distance.
and the directivity pattern’s main lobe will be moved to the direction φ = φ′ , as illustrated in Figure 15 for
φ′ = 45o . Note that in this chapter we have made a simple modification to the formulae from Chapter 1 in
order to change the microphone index range from − N2−1 ≤ n ≤ N2−1 to the more convenient 1 ≤ n ≤ N .
The negative phase shift in the frequency domain can effectively be implemented by applying time
delays to the sensor inputs, where the delay for the nth sensor is given by
(n − 1)d cos φ′
τn = (65)
c
which is the time the plane wave takes to travel between the reference sensor and the nth sensor.
Delay-sum beamforming is so-named because the time domain sensor inputs are first delayed by
τn seconds, and then summed to give a single array output. Usually, each channel is given an equal
amplitude weighting in the summation, so that the directivity pattern demonstrates unity gain in the
desired direction. This leads to the complex channel weights
1 j −2πf (n−1)d cos φ′
wn (f ) =
e c (66)
N
Expressing the array output as the sum of the weighted channels we obtain
N
1 X −2πf ′
y(f ) = xn (f )ej c (n−1)d cos φ (67)
N n=1
22
|D(f, φ)|
steered
beam pattern
unsteered
beam pattern
φ (degrees)
Figure 15: Unsteered and steered directivity patterns (φ′ =45 degrees, f =1 kHz, N =10, d=.15 m)
It is often convenient to use matrix algebra to simplify the notation when describing microphone array
techniques. The above equation can be rewritten using matrix notation as
where the weight vector w(f ) and data vector x(f ) are defined as
T
w(f ) = w1 (f ) · · · wn (f ) · · · wN (f ) (71)
and T
x(f ) = x1 (f ) · · · xn (f ) · · · xN (f ) (72)
where (·)T denotes matrix transpose. A block diagram showing the structure of a general filter-sum
beamformer is given in Figure 16.
23
x(f ) w(f )
x1 (f ) w1 (f )
x2 (f ) w2 (f )
P y(f )
xN (f ) wN (f )
frequency means that the response characteristics (beam-width, sidelobe level) will only remain constant
for narrow-band signals, where the bandwidth is not a significant proportion of the centre frequency.
Speech, however, is a broad-band signal, meaning that a single linear array design is inadequate if a
frequency invariant beam-pattern is desired.
One simple method of covering broadband signals is to implement the array as a series of sub-arrays,
which are themselves linear arrays with uniform spacing. These sub-arrays are designed to give desired
response characteristics for a given frequency range. Due to the dependencies discussed in Section 1.4.1,
as the frequency increases, a smaller array length is required to maintain constant beam-width. In
addition, to ensure the sidelobe level remains the same across different frequency bands, the number
of elements in each sub-array should remain the same. The sub-arrays are generally implemented in a
nested fashion, such that any given sensor may be used in more than one sub-array. Each sub-array is
restricted to a different frequency range by applying band-pass filters, and the overall broad-band array
output is formed by recombining the outputs of the band-limited sub-arrays. An example of such a
nested sub-array structure for delay-sum beamforming, designed to cover 4 different frequency bands, is
shown in Figure 17. The sub-arrays employ 3, 5, 5 and 5 microphone respectively, but, due to the nested
structure, the 4 sub-arrays can be implemented using a total of 9 microphones.
For a general sub-array broadband beamformer, the beamforming channel filters are band-pass filtered
between the specified upper and lower frequencies for each sub-band. At the output of each channel filter
we have
vs,i (f ) = ws,i (f )xi (f ) (73)
where xi (f ) is the input to channel i of the array, and the subscript s represents the sub-array index.
The output of sub-array s, is then given by the sum across channels as
N
X
ys (f ) = vs,i (f ) (74)
i=1
where there are N microphones in the array. The summation in each sub-array is shown up to N for
simplicity of notation, although in practice only the channels belonging to each sub-array are used. The
24
Sub-array
1
Σ
Sub-array
3
Σ
Sub-array Σ
4
Σ
Sub-array
2
Σ
25
overall array output is then calculated as
S
X
y(f ) = ys (f ) (75)
s=1
Using the filter-sum weight vector, w(f ), and defining the propagation vector as
h iT
2πf 2πf
d(f ) = (n−1)d cos φ (N −1)d cos φ (78)
1 ··· e−j c ··· e−j c
where (·)H denotes matrix transpose conjugate. Expressing the factor of directivity in matrix notation,
and noting that w is independent of direction, we obtain
w(f )H d(f )2
Ga (f, θ0 , φ0 ) = R R (80)
1 2π π
w(f )H 4π 0 0
d(f )d(f )H sin θ dθ dφ w(f )
26
From the discussion in Section 1.4.3 we recall that the factor of directivity is the array gain for a diffuse
noise field. The diffuse noise field is characterised by the matrix Γ, which represents the cross-spectral
density of the noise between sensors. For a general noise field with noise cross-spectral matrix Q, the
array gain can be expressed as [13]
H 2
w d
Ga = H (83)
w Qw
Superdirective beamformers aim to calculate the weight vector w that maximises the array gain, that
is : H 2
w d
max H (84)
w w Qw
27
2.5 Near-field Superdirective Beamforming
Low frequency performance is problematic for conventional beamforming techniques because large wave-
lengths give negligible phase differences between closely spaced sensors, leading to poor directive dis-
crimination. Täger [14] states that delay-weight-sum beamformers can roughly cover the octave band
0.25 < d/λ < 0.5 (where d is the inter-element spacing) before excessive loss of directivity occurs. A
frequency of 100 Hz corresponds to a wavelength of 3.4 m for sound waves, and so to cater for this
frequency range requires that 0.85m < d < 1.7m. For a sub-array of 5 elements, this would give an
array dimension of 3.4m < L < 6.8m, which is impractical for many applications. For example, in the
context of a multimedia workstation, it is desirable that the array dimension does not exceed the monitor
width, which will be approximately 17 inches, or 40 cm. Thus methods providing good low frequency
performance with realistic array dimensions are required.
One such method is a technique proposed by Täger [14, 15], called near-field superdirectivity. As
its name implies, near-field superdirectivity is a modification of the standard superdirective technique
presented in Section 2.4, in which the propagation vector d is replaced by one formulated for a near-field
source.
We recall from Section 1.5 that the near-field directivity pattern can be expressed as
XN
d1 (r, φ) 2π
Dnf (f, φ) = wn (f )ej λ (dn (r,φ)−d1 (r,φ)) (90)
d (r, φ)
n=1 n
If we define the time difference between sensor n and the reference sensor (n = 1) as
dn (r, φ) − d1 (r, φ)
τn = (91)
c
and the amplitude attenuation factor between sensor n and the reference sensor as
d1 (r, φ)
αn = (92)
dn (r, φ)
then the near-field modified propagation vector can be expressed as
T
dnf (f ) = α1 e−j2πf τ1 · · · αn e−j2πf τn · · · αN e−j2πf τN (93)
Near-field superdirectivity uses the above near-field propagation vector in the standard superdirective
formulation, while maintaining the assumption of a (far-field) diffuse noise field in the noise cross-spectral
matrix Γ. In this way, as well as providing directional sensitivity, the technique gives a level of discrimi-
nation between the array’s near- and far-fields. Expressed formally we have
H
w dnf 2
max H (94)
w w Γw
where Γ was defined in Equation 81 using the far-field propagation vector. Thus, similar to standard
superdirectivity, the solution under a set of linear constraints, CH w = g (including wH dnf = 1), and a
robustness constraint on the white noise gain is given as
−1
w = [Γ + ǫI]−1 C CH [Γ + ǫI]−1 C g (95)
Near-field superdirectivity succeeds in achieving greater performance than standard techniques for
near-field sources at low frequencies. This is due to the fact that it takes the amplitude differences into
account as well as the phase differences. While the phase differences are negligible at low frequencies, the
amplitude differences are significant, particularly when the sensors are placed in an endfire configuration
as this maximises the difference in the distance from the source to each microphone. A simple illustration
of the effect of the amplitude compensation is given in Täger [14].
28
time
x alignment x′ w
x1 e−j2πf τ1 w1
x2 e−j2πf τ2 w2
yu + P y
P
hu
xN e−j2πf τN wN
a1
Blocking
Matrix
a2
ya
P
B
aN −1
yu′ (f ) = wc (f )T x′ (f ) (96)
29
where T
w(f ) = w1 (f ) · · · wn (f ) · · · wN (f ) (97)
are the fixed amplitude weights for each of the N channels, and
T
x′ (f ) = x′1 (f ) · · · x′n (f ) · · · x′N (f ) (98)
yu (f ) = hu (f )yu′ (f ) (99)
The lower path of the structure is the adaptive portion. It consists of two major parts. The first of
these is the blocking matrix, B, whose purpose is to remove the desired signal from the lower path. As
the desired signal is common to all the time-aligned channel inputs, blocking will occur if the rows of the
blocking matrix sum to zero. If x′′ denotes the signals at the output of the blocking matrix, then
where each row of the blocking matrix sums to zero, and the rows are linearly independent. As x′ can
have at most N − 1 linearly independent components, the number of rows in B must be N − 1 or less.
The standard Griffiths-Jim blocking matrix is [17]
1 −1 0 0 ··· 0
0 1 −1 0 · · · 0
.. ..
B = . · · · ... ... · · · . (101)
0 ··· 0 1 −1 0
0 ··· 0 0 1 −1
Following application of the blocking matrix, x′′ is adaptively filtered and summed to give the lower
path output ya . If we denote the lower path adaptive filters as a, then we have
Due to the blocking matrix, the lower path output only contains noise signals. The overall system output
is calculated as the difference of the upper and lower path outputs as
y(f ) = yu (f ) − ya (f ) (103)
Because the upper path contains the constrained desired signal estimate, and the lower path only
contains noise and interference terms, finding the set of filter coefficients a which minimise the power in
y is effectively equivalent to finding the linearly constrained minimum variance beamforming solution.
As the signal is constrained in the upper path, the unconstrained LMS algorithm can be used to adapt
the lower path filter coefficients
ak+1 (f ) = ak (f ) + µyk (f )x′′k (f ) (104)
where µ is the step size and k is the frame number.
The GSC is a flexible structure due to the separation of the beamformer into a fixed and adaptive
portion, and it is the most widely used adaptive beamformer. In practice, the GSC can cause a degree
of distortion to the desired signal, due to a phenomenon known as signal leakage. Signal leakage occurs
when the blocking matrix fails to remove all of the desired signal from the lower noise canceling path.
This can be particularly problematic for broad-band signals, such as speech, as it is difficult to ensure
perfect signal cancellation across a broad frequency range.
30
y(n)
h1
Microphone
Array
+
h2 - +
+
e(n)
es (n)
2
h3 - |es (n)/A|
+
D1 (A)
τ0
A = A ± δA
A
×
FD signal generator
2.7 AMNOR
While LCMV algorithms are theoretically powerful, they often encounter a number of problems in prac-
tice. Because of the hard constraint of one permissible value for the desired signal’s transfer function,
LCMV techniques can fail to sufficiently reduce the noise level due to the lack of freedom in the choice
of filters. Evaluation of the human auditory system shows that a certain level of distortion in the desired
signal can be tolerated and so in some situations it may be permissible, and even desirable, to allow some
signal distortion in order to achieve better noise reduction.
A technique incorporating such a ‘soft’ constraint, named the AMNOR (adaptive microphone-array
system for noise r eduction) system was proposed by Kaneda [18]. Instead of allowing only one response for
the desired signal, the system adopts a soft constraint that allows a class of responses whose degradation
is less than some pre-determined permissible level.
Figure 19 shows the structure of the AMNOR system. The system is essentially composed of three
filter blocks, h1 , h2 and h3 .
The filter block h1 contains the beamforming filters that are applied to the multi-channel input to
give the system output, y(n). The impulse response of the acoustic path between the source and array
element i is modeled by the filter gi (z). The frequency response of the beamformer to the desired signal
is therefore
XN
F (z) = h1,i (z)gi (z) (105)
i=1
31
The second set of filters h2 are adaptively updated to satisfy the criterion of minimum output noise
power for a given level of degradation to the desired signal. The adaptation only occurs during noise-only
periods, during which time a fictitious desired signal is introduced into the system. This fictitious desired
signal is a white noise signal with unity power that is magnified by a variable amplitude factor A. The
fictitious desired signal is filtered by the acoustic path impulse responses gi (z) in order to simulate the
presence of a known desired signal during noise-only periods.
It can be shown [18] that the mean square error in the output is related to the degradation to the
desired signal D1 , the output noise power D2 , and the amplitude of the fictitious desired signal A,
according to
2
|e(n)| = A2 ·D1 + D2 (106)
In addition, it can be shown that D1 and D2 are monotonically decreasing and increasing functions
of A respectively. This has the powerful implication that the level of signal degradation and output noise
power can be adjusted by varying a single parameter - namely the amplitude of the fictitious desired
signal, A.
The third set of filters h3 are used to estimate the response degradation D1 in order to adapt the
amplitude of the fictitious desired signal to achieve the desired levels of degradation and output noise
power.
Full details of the algorithm are given in Kaneda [18], and further work is presented in Kaneda [19]
and Kataoka et al [20]. The AMNOR technique has the limitations of requiring accurate speech/silence
detection and knowledge of the impulse responses of the acoustic paths between the source and each
microphone. Due to the fixed filters during speech periods, the technique implicitly assumes slowly-
varying noise characteristics. In practice, the acoustic paths are modeled using simple time delays, as for
delay-sum beamforming.
2.8 Post-filtering
In practice, the basic filter-sum beamformer seldom exhibits the level of improvement that the theory
promises and further enhancement is desirable. One method of improving the system performance is to
add a post-filter to the output of the beamformer.
Zelinski [21] proposed a Wiener post-filter formulated using the cross-spectral densities between chan-
nels in a microphone array. Incorporating a post-filter with a beamformer allows use of knowledge
obtained in spatial filtering to also allow effective frequency filtering of the signal. In using both spatial
and frequency domain enhancement, the use of information about the signal is maximised, where this
knowledge is solely the direction of arrival of the signal.
The use of such a post-filter with a filter-sum microphone array was thoroughly investigated by
Marro [22, 23] who demonstrated the mathematical interaction of the post-filter and the beamformer,
and determined an optimal array structure for their combination. A diagram illustrating the system is
presented in Figure 20.
At the output of the channel filters we have the time-aligned channel inputs
vi (f ) = wi (f )xi (f ) (107)
These signals contain an aligned version of the desired signal plus a noise component
vi (f ) = s(f ) + ni (f ) (108)
32
x(f ) w(f )
x1 (f ) w1 (f )
Wiener
w2 (f ) Filter
x2 (f )
P y(f ) z(f )
ĥ(f )
xN (f ) wN (f )
Wiener
Filter
Estimation
where Φss (f ) and Φn̄n̄ (f ) are respectively the auto-spectral density of the desired signal s(f ) and the
noise at the output of the beamformer n̄(f ).
A common problem with Wiener filters is the estimation of the signal and noise auto-spectral densities.
The multi-channel approach provides an interesting solution to this problem. Under the assumptions that
1. The signal arriving at each microphone can be modeled by the sum of the desired signal and noise,
according to Equation 108.
2. The noises ni (n) and desired signal s(n) are uncorrelated.
3. The power spectral density of the noise is the same on each microphone Φni ni (f ) = Φnn (f ), i =
1, . . . , N .
4. The noises are uncorrelated between different microphones Φni nj (f ) = 0, i 6= j.
5. The input signals vi (n) are restored into perfect phase alignment with s(n).
we have
Φvi vi (f ) = Φss (f ) + Φnn (f ) (110)
and
Φvi vj (f ) = Φss (f ) (111)
and by averaging these spectral densities, we can estimate the Wiener filter equation as [22]
PN 2
PN −1 PN
i=1 |wi (f )|
ℜ{ i=1 j=i+1 Φ̂vi vj (f )}
ĥ(f ) = PN −1 PN PN (112)
j=i+1 wi (f )wj (f ) Φ̂vi vi (f )
∗
i=1 i=1
The real operator ℜ{·} is used because Φss (f ) is necessarily real. An incoherent noise field is the ideal
condition for such a post-filter, however a diffuse noise field also provides a reasonable approximation of
the above assumptions for the noise signals on different sensors. For this reason, the post-filter is best
suited to incoherent or diffuse noise. The overall system output is given by
33
where y(f ) is the beamformer output.
In Marro [22], equations are developed for the post-filter transfer function in terms of beamformer
characteristics such as the noise reduction factor, signal to noise ratio and array gain for the following
adverse input conditions :
• the presence of diffuse noise;
• the presence of a coherent noise source;
• a minor fault in the pointing direction of the array; and
• the presence of noise that is correlated with the desired signal.
By investigating the dependence of the post-filter upon these properties, it is shown that a post-filter
enhances the beamformer output in the following ways :
• The post-filter cancels any incoherent noise.
• The post-filter further enhances the beamformer’s rejection of coherent correlated or uncorrelated
noise sources not emanating from the steered direction.
• The post-filter displays robustness to minor errors in the pointing of the array.
In summary, it is found that the effectiveness of such a post-filter follows that of the beamformer -
if the beamformer is effective, the post-filter will further improve the system output. However, in the
case where the beamformer is ineffective, the post-filter, being intrinsically linked to the beamformer
performance, will be similarly ineffective.
34
Technique Advantages Disadvantages
Delay-sum simplicity low frequency performance
narrow-band
Sub-array broad-band low frequency performance
delay-sum
Superdirectivity optimised array gain assumes diffuse noise
Near-field optimised array gain assumes diffuse noise
Superdirectivity near-field sources assumes noise in far-field
low frequency performance
Generalised adapts to noise conditions low frequency performance
Sidelobe minimises output noise power can distort in practice
Canceler hard constraint on signal
AMNOR adapts to noise conditions low frequency performance
minimises output noise power complexity
soft constraint on signal speech-silence detection
distortion level controlled some distortion
Post-filtering adapts to noise conditions can distort signal
improves beamformer output
While these tables give a simplistic overview of the different beamforming techniques, they serve to
indicate the characteristics that must be considered when choosing a technique for a given application
and noise conditions. For example, if the noise is approximately diffuse and there are no localised noise
sources, then a superdirective technique is appropriate. If, however, prominent localised noise sources
exist, then an adaptive technique would be advantageous. In applications where it is important to
minimise distortion to the desired signal, fixed techniques are generally better than adaptive techniques.
Also, depending on the location of the desired signal, a technique designed for the near-field may be
required.
References
[1] S. Haykin, Array Signal Processing. Prentice-Hall, 1985.
[2] L. J. Ziomek, Fundamentals of Acoustic Field Theory and Space-Time Signal Processing. CRC Press,
1995.
[3] D. H. Johnson and D. E. Dudgeon, Array Signal Processing: Concepts and Techniques. Prentice
Hall, 1993.
[4] D. C. Moore, “Speech enhancement using microphone arrays,” Master’s thesis, Queensland Univer-
sity of Technology, 2000.
[5] B. D. Steinberg, Principles of Aperture and Array System Design. John Wiley and Sons, 1976.
[6] E. Ifeachor and B. Jervis, Digital Signal Processing : A Practical Approach. Addison-Wesley, 1996.
[7] D. Ward, Theory and Application of Broadband Frequency Invariant Beamforming. PhD thesis,
Australian National University, July 1996.
35
[8] R. L. Bouquin and G. Faucon, “Using the coherence function for noise reduction,” IEE Proceedings,
vol. 139, pp. 276–280, June 1992.
[9] D. Templeton and D. Saunders, Acoustic Design. London: Architectural Press, 1987.
[10] J. Bitzer, K. Kammeyer, and K. U. Simmer, “An alternative implementation of the superdirective
beamformer,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, (New York), pp. 991–994, October 1999.
[11] H. Cox, R. Zeskind, and T. Kooij, “Practical supergain,” IEEE Transactions on Acoustics, Speech
and Signal Processing, vol. ASSP-34, pp. 393–397, June 1986.
[12] R. Taylor and G. Dailey, “The super-directional acoustic sensor,” in Proceedings of OCEANS ’92 -
Mastering the Oceans through Technology, vol. 1, pp. 386–391, 1992.
[13] H. Cox, R. Zeskind, and M. Owen, “Robust adaptive beamforming,” IEEE Transactions on Acous-
tics, Speech and Signal Processing, vol. 35, pp. 1365–1376, October 1987.
[14] W. Täger, “Near field superdirectivity (NFSD),” in Proceedings of ICASSP ’98, pp. 2045–2048, 1998.
[15] W. Tager, Etudes en Traitement d’Antenne pour la Prise de Son. PhD thesis, Universite de Rennes
1, 1998. in french.
[16] O. L. Frost, “An algorithm for linearly constrained adaptive array processing,” Proceedings of the
IEEE, vol. 60, pp. 926–935, August 1972.
[17] L. Griffiths and C. Jim, “An alternative approach to linearly constrained adaptive beamforming,”
IEEE Trans. on Antennas and Propagation, vol. 30(1), pp. 27–34, January 1982.
[18] Y. Kaneda, “Adaptive microphone-array system for noise reduction,” IEEE Transactions on Acous-
tics, Speech and Signal Processing, vol. ASSP-34, pp. 1391–1400, December 1986.
[19] Y. Kaneda, “Directivity characteristics of adaptive microphone-array for noise reduction (amnor),”
Journal of the Acoutical Society of Japan, vol. (E) 12, no. 4, pp. 179–187, 1991.
[20] A. Kataoka and Y. Ichinose, “A microphone-array configuration for amnor (adaptive microphone-
array system for noise reduction),” Journal of the Acoutical Society of Japan, vol. 11, no. 6, pp. 317–
325, 1990.
[21] R. Zelinski, “A microphone array with adaptive post-filtering for noise reduction in reverberant
rooms,” in Proceedings of ICASSP-88, vol. 5, pp. 2578 –2581, 1988.
[22] C. Marro, Traitements de Dereverberation et de Debruitage Pour le Signal de Parole dans des Con-
textes de Communication Interactive. PhD thesis, Universite de Rennes 1, 1996. in french.
[23] C. Marro, Y. Mahieux, and K. Uwe Simmer, “Analysis of noise reduction and dereverberation
techniques based on microphone arrays with postfiltering,” IEEE Transactions on Speech and Audio
Processing, vol. 6, pp. 240–259, May 1998.
36