Direction of Arrival Estimation of Reflections From Room Impulse Responses Using A Spherical Microphone Array
Direction of Arrival Estimation of Reflections From Room Impulse Responses Using A Spherical Microphone Array
Abstract—This paper studies the direction of arrival estimation array. In particular, we focus on the case where more than one
of reflections in short time windows of room impulse responses acoustic reflection is present in an analysis window.
measured with a spherical microphone array. Spectral-based In room impulse responses, the number of reflections arriving
methods, such as multiple signal classification (MUSIC) and
beamforming, are commonly used in the analysis of spatial room at the receiver in a short time window increases with the square
impulse responses. However, the room acoustic reflections are of time. This effect is given as the echo density [11, p. 98].
highly correlated or even coherent in a single analysis window Therefore, after a relatively short time, one ends up in a situation
and this imposes limitations on the use of spectral-based methods. where an analysis time window includes multiple reflections.
Here, we apply maximum likelihood (ML) methods, which are Exact overlap of the reflections occurs if the path length from
suitable for direction of arrival estimation of coherent reflections.
These methods have been earlier developed in the linear space
the source to the receiver is equal for two or more reflections.
domain and here we present the ML methods in the context This is not an unusual case in room acoustics, but happens al-
of spherical microphone array processing and room impulse ready for the first order reflections in symmetric source-receiver
responses. Experiments are conducted with simulated and real geometry. For example, such an overlap will occur if the micro-
data using the em32 Eigenmike. The results show that direction phone and the sound source are located somewhere on the same
estimation with ML methods is more robust against noise and less
diagonal or central axis of a rectangular room.
biased than MUSIC or beamforming.
The acoustic reflections are highly correlated or even
Index Terms—Direction of arrival (DOA), room acoustics, spa- coherent, especially in a narrow frequency band. The high
tial room impulse response, spherical microphone arrays. correlation of the reflections causes problems for spectral-based
DOA estimation methods [12], which are commonly ap-
plied in the spherical microphone array processing [2], [3],
I. INTRODUCTION [13]–[17]. According to a classification given in [12], the
spectral-based methods include multiple signal classification
D IRECTION of arrival (DOA) estimation of a sound wave (MUSIC) method, the estimation of signal parameters via
arriving at a microphone array is an essential part of spa- rotational invariant techniques (ESPRIT), and beamforming.
tial room acoustic analysis and synthesis. The directional infor- These methods require that the source signals are independent,
mation is used together with pressure or energy to describe the which is not true in the case of highly correlated reflections.
sound field [1]–[4] or to reproduce a sound in spatial sound syn- The estimation in the case of correlated signals has been
thesis from a certain direction [5], [6]. Thus, it has a profound enhanced by smoothing methods in the space domain in the
impact on how room acoustics are interpreted via the analysis previous decades [18], [19]. These techniques have been es-
or perceived through the spatial sound synthesis. The increasing pecially under research with uniform linear arrays [20]–[24].
number of publications and numerous array designs indicate Later on, several smoothing methods have also been developed
the importance of the spherical microphone array processing in the context of spherical microphone array processing [2],
[7]–[10]. Therefore, it can be considered as one of the most im- [3], [16], [17]. These smoothing methods average the array
portant approaches for spatial sound analysis nowadays. This covariance matrix over frequency [2], time [3], or space [16],
paper studies the DOA estimation of reflections from a spatial [17] and require pre-processing before DOA estimation. In
room impulse response captured with a spherical microphone frequency smoothing the noise is whitened [2] and in time
domain smoothing a stabilizing filter reduces the undesired
amplification of noise. Spatial smoothing in the space domain
Manuscript received January 30, 2015; revised April 24, 2015; accepted May
is often implemented by averaging over subarrays formed from
26, 2015. Date of publication June 01, 2015; date of current version June 09, the original uniform linear array [12]. In spherical microphone
2015. The associate editor coordinating the review of this manuscript and ap- array processing, the division into subarrays is obtained by
proving it for publication was Prof. Thushara Abhayapala.
S. Tervo is with the Department of Computer Science, Aalto University,
transforming the spherical microphone array to a uniform linear
FI-00076 Aalto, Finland (e-mail: sakari.tervo@aalto.fi). array [17], [25]. Another approach to obtain spatial smoothing
A. Politis is with the Department of Signal Processing and Acoustics, Aalto is to form the subarrays via eigenbeam space rotation [15], [16]
University, FI-00076 Aalto, Finland (e-mail: archontis.politis@aalto.fi).
Color versions of one or more of the figures in this paper are available online
The smoothing methods have an apparent disadvantage when
at http://ieeexplore.ieee.org. applied to room impulse responses. Namely, in each time step
Digital Object Identifier 10.1109/TASLP.2015.2439573 and frequency, the room impulse response may have a different
2329-9290 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1540 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
response, i.e., the DOA, phase, and amplitude. Averaging over and the phase of the traveling sound wave, possibly differently,
any domain reduces resolution in the respective domain. Av- in each frequency and at each incident angle.
eraging over time reduces temporal resolution of the estimate; Besides the boundary effects, the sound wave is affected by
averaging over frequency reduces frequency resolution; and av- the propagation distance and attenuation by medium. Namely,
eraging over space reduces spatial resolution since the number the air absorption depends on the composition of the room
of microphones per subarray is lower than the number of micro- air, the distance that the wave has traveled, and frequency
phones in the whole array. [11, p. 147]. Thus, the air absorption alone suggests that the
Contrary to MUSIC and similar approaches, the maximum room impulse response is different in each time and frequency
likelihood (ML) methods [12], [26] can handle coherent signals and therefore gives basis for the frequency dependent analysis.
without any smoothing. This is possible due to the freedom of Moreover, if the sound source is assumed to be a point source
selection in the signal and noise model. Consequently, also a co- the amplitude is attenuated due to the spherical spreading of
herent signal model can be assumed. This signal model is -di- the sound wave.
mensional, where is the number of reflections, since each re- The number of reflections per time interval is described
by a quantity called the echo density, which is given asymptot-
flection is modeled separately. If a spatial spectrum is evaluated
ically by [11, p. 98]
with a grid of size , where the appropriate sampling and size
is defined by the user, the ML DOA estimation function has a
(1)
dimensionality of , whereas the spectral-based methods
have a dimensionality of . Therefore, as the number of re- where is the volume of the enclosure and is the speed of
flections increases, the search space becomes quickly very large. sound. According to Kuttruff [11, p. 98], this is valid for any ge-
For restricted computational time, the high dimensionality leads ometry with a homogeneous medium. Shorter time interval
to the use of non-linear optimization algorithms for the ML reduces the number of reflections present in an analysis window.
methods [26]. As a general limitation in the ML methods, the When we are inspecting the impulse response in a single fre-
number of reflections that can be estimated must be smaller quency, we limit the time window length as , so that
than the number of microphones , i.e., [12]. at least one period of the wave length is observed in the analysis
Analysis of room reflections is often based on the wideband window. Then, the time instant where we have or fewer
assumption [2], [3], [6]. This assumption states that frequencies reflections present in the analysis window in a single frequency
are delayed by the same amount of time. In theory, the wide- is given as:
band assumption holds if the surfaces are large and rigid. The
wideband analysis of acoustic reflections can lead to a desired (2)
accuracy in the analysis [2], [3] or in the reproduction of the
acoustics [6]. However, in the real world, the room impulse re- where is the angular frequency.
sponses are always frequency and time dependent. On this basis,
B. Space Domain
studies on acoustics benefit from the frequency band analysis
of reflections since it enables a more accurate description of the We present the location of microphone in the 3-D Cartesian
room acoustic properties. coordinate system as
In this paper, we study the performance of ML methods and
a large sample approximation called the weighted subspace fit-
ting (WSF) in the analysis of spatial room impulse responses. where and are the standard spherical coordi-
Large sample approximation assumes that the number of avail- nates, radius , inclination and az-
able measurements is large. The cases where one or more reflec- imuth . Each acoustic event arriving to the
tions are present in an analysis window in a wide or narrow fre- microphone has traveled a path length from the source to the
quency band are of interest due to the above mentioned features microphone and has a DOA w.r.t to the array
of the room impulse responses. The contribution of this paper origin and it is described in Cartesian coordinates as:
is the application of the ML methods to the analysis of room re-
flections with the spherical microphone array. Throughout the
experimental section, the results of the ML methods are com- An acoustic event is considered to be a sound wave which is
pared to MUSIC and beamforming. altered by the acoustic phenomena listed above.
A measured wideband impulse response pressure signal in
a microphone for a source signal is presented in the frequency
II. MODELS
domain as the sum of all acoustic events
A. Room impulse response
(3)
A room impulse response is defined as the acoustic response,
measured from a source to a microphone in an enclosed space.
After the initial excitation, the sound wave propagates through where is the frequency response of an acoustic event, and
the space and arrives at the receiver via multiple paths. On these is a noise component, assumed independent and identi-
paths, the sound wave is altered by several acoustic phenomena cally distributed for each microphone. Furthermore,
on the boundaries, such as, reflection, absorption, and diffrac- is the wavenumber, where is the speed of sound and is
tion [11, ch. 2]. These acoustic phenomena affect the amplitude frequency. In addition, is the source signal that describes
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1541
the frequency response of the source, originally emitted to some The noise and reflection signals are assumed to be zero mean
direction, and arriving to the microphone from the direction . complex Gaussian processes, i.e., [26]
A typical source in room impulse response measurements is a
loudspeaker. Note that this impulse response model is general
and describes any room impulse response. and
The frequency response of the acoustic event is dependent on
the time delay , that describes the time it has taken for the
acoustic wave to travel to the microphone, and on the complex where denotes expectation and denotes a Hermi-
amplitude of each acoustic event , i.e., tian transpose of a matrix. This leads to an array covariance
matrix [26]:
(4)
where . (10)
We further assume that the sources and reflections are in the The assumption on the Gaussian reflection signal is not neces-
far field with respect to the array, so that a plane wave model sarily true in the case of room impulse responses. That is, the
can be applied. Then, according to the plane wave model the reflection at a single frequency in a small time window may be
complex amplitude is represented by of deterministic nature rather than random. A model for deter-
ministic array covariance matrix is given in Section III-E.
(5)
C. Spherical Harmonics Domain
where is the complex amplitude of plane wave which in- In order to apply spherical microphone array processing, the
cludes all the acoustic phenomena that the wave has encoun- pressure and the array covariance matrix are described in the
tered before arriving to the microphone and is the response spherical harmonic (SH) domain. The formulation in this sec-
of the microphone in the direction , assumed to be known tion follows the one given in [3].
a priori. Furthermore, since a homogeneous medium and short The SH domain representation of the pressure for
time window analysis are assumed, the path length is ne- an array with radius and order is given by the approxima-
glected in the directional analysis. Consequently, the plane wave tion of the spherical Fourier Transform and its inverse [3]:
time delay can be expressed w.r.t. to the origin, i.e., the center
of the array, as
and (11)
(6)
(7)
and are the associated Legendre polynomials. Moreover,
and call this the steering vector. For example, for an ideal open denotes the complex conjugate, is the SH domain co-
microphone array . The plane wave amplitude and the efficient, are the sampling weights to correct the orthonor-
source response are presented as a product mality errors, are the microphone coordinates and
is a steering direction where the inverse spherical Fourier trans-
(8) form is evaluated. The sampling weights are defined by the sam-
pling scheme [9], and the order is defined by the number of mi-
which we call the reflection signal. For the compactness of the crophones (see [9] for details). Throughout this paper we as-
rest of the paper, the measurements are described in a matrix sume uniform sampling and hence the weights reduce to
form. The impulse responses in microphones , i.e., . In addition, we use harmonic coeffi-
the array input, are described by the vector [26] cients, where the harmonic order of the array is , and
the radius is cm, due to the applied microphone array.
Spatial aliasing for spherical arrays typically occurs when
.. .. .. [9]. The noiseless pressure in the SH domain can be ex-
. . .
pressed in a matrix form by [3]
(9) (14)
where is the set of all unknown direction of where are the sensor positions in angular
arrivals. coordinates, are the sampling
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1542 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
weights, and . The left hand side of Eq. (14) where expresses an encoding matrix that implements the
are the SH coefficients in a vectorized form, i.e.: SHT and equalization, and is the array response vector
to direction . In the ideal case
(15) and Eq. (19) holds exactly. In practice both the encoding filters
where and are not used for brevity and the spherical har- and the interpolated array response for an arbitrary can be
monics are expressed in a matrix form computed from measurements of the response at directions
, uniformly distributed or e.g. arranged in
a regular grid. The encoding filters can then be computed by a
weighted regularized least-squares solution from the measure-
ments as in [27], [29]
(16)
.. .. .. ..
. . . . (21)
where are the measured responses of the
The array dependent coefficients are represented by a diagonal array to directions at frequency , are appropriate weights
matrix: for the measurement grid, and is a regularization parameter
set according to the indications in [27], [28]. In this work the
weights of were computed from the areas of the spherical
(17) Voronoi cells of the measurement grid.
where the individual array dependent coefficients in the case of To obtain an array steering vector at any direction with
a rigid sphere are given as [3]: high accuracy, a spherical interpolation was performed by ex-
panding the steering vector in terms of its measured SH coeffi-
(18) cients, as
Due to the nature of room impulse response and the applica- In this paper, we are also interested in the performance in
tions in room acoustics, we are interested in two cases, the wide- single frequency bins. In the frequency domain estimation, we
band analysis of reflections and the analysis of reflections in a apply spatial whitening to the covariance matrix estimate and
single frequency. For the wideband analysis we apply the time the steering vectors as in [2]. The following formulations in this
domain smoothing algorithm, presented in [3], which is also re- section are presented for the time domain smoothing, but are
viewed briefly in this section. In the narrow band analysis, the equal to the frequency domain smoothing versions if the covari-
smoothing algorithms will not provide any benefits, since the ance matrix estimate and the steering vectors are replaced with
reflections are coherent, therefore the analysis is implemented their respective whitened versions.
for the frequency domain SH coefficients.
B. The Steering Matrix
A. Time domain smoothing The steering vector or matrix, used in all the methods, is the
We follow the formulations in [3], in the processing of the SH Hermitean transpose of Eq. (16). For example, in ML methods
domain coefficients. The SH domain coefficients are normalized and WSF, the matrix has the following form in the case of two
by multiplying Eq. (14) by from the left side, which reflections:
leads to
(24)
where . The array covariance since there are two possible reflections and , and the
matrix in this case is given number of harmonic components is . In con-
trast, in MUSIC and PWD, the steering is always 1-D and has
(25) the form
where the noise covariance matrix
and denote the equalized versions of the variables with . In the (33)
estimation, we assume that the time domain version of noise ma-
trix is independent of time, and spatially white, i.e.,
, where is the equalized variance. In the above, is the noisy version of Eq. (14) and the
The array covariance matrix estimate for time instants weighting is given by
is given as
(31)
where the first term ensures unity beamformer in the look direc-
tion. This form of PWD does not whiten the noise, and therefore
where are the equalized SH will have a poor performance in the wideband case. Whitening
coefficients in the time domain. of the noise can be implemented following [2].
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1544 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
With the time domain smoothing, PWD is implemented here where are the equalized time domain
as: SH coefficients, is the modeled array covariance matrix
and denotes the determinant of a matrix.
(34) As usually in ML methods, the solution is found from the
negative log-likelihood which is given for SML as:
The output energy of the time-domain beamformer at is given
as (38)
D. MUSIC (40)
One of the most popular methods for direction estimation
with spherical microphone arrays is MUSIC [2], [3]. The spatial respectively. In the above
spectrum of MUSIC is calculated as
and (41)
(35) (42)
where is the array noise matrix from eigenvalue decomposi- are the pseudo-inverse of and the orthogonal projector
tion of the array covariance matrix estimate . This decom- onto the null space of , respectively. The localization
position follows the form function for SML is given as [26]:
(36) (43)
where the superscripts and denote signal and noise sub- 2) Deterministic Maximum Likelihood: The determin-
spaces, respectively, is the eigenvalue matrix, and includes istic model makes no assumptions on the signal wave-
the right eigenvectors. forms. That is, the average signal waveform has the form
MUSIC requires a full rank reflection signal covariance ma- . When the average signal waveform
trix. Therefore, when MUSIC is applied to localize reflec- is deduced from the signals, the array covariance matrix is only
tions, the estimated reflection signal covariance matrix should dependent on the noise term
have eigenvalues deviating from the noise. This assumption
is violated if we have too few snapshots of the array covariance
matrix or coherent reflections. The snapshots here refer to time
The likelihood function for DML when several snapshots are
domain or frequency domain SH coefficients.
available is given as [26]:
E. Maximum Likelihood Methods
Maximum likelihood methods are generally applied in sev-
eral estimation tasks. For an overview on maximum likelihood
estimation, the reader is referred to [33]. A requirement for the
ML method is a signal and noise model. These models are de-
pendent on the parameters that are estimated by the ML method.
The problem in ML estimation is then to find the parameters (44)
of the model that most likely explain the observed data. In this where are the reflection signals. From
paper, we apply two ML methods developed earlier in the space the negative log-likelihood, setting and constant, the vari-
domain to the spherical microphone array processing. ance can be estimated as [26]:
1) Stochastic Maximum Likelihood: The first ML method is
called stochastic (SML), due to the assumption that the reflec- (45)
tion signals are stochastic processes. The array covariance ma-
trix in the case of time domain smoothing takes the form
Using this in the negative log-likelihood leads to a non-linear
least-squares problem, from where the localization function and
reflection signal estimates can be written as [26]:
The probability density function of SML for time instant
is given as (46)
and
(47)
(37)
respectively.
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1545
In the incoherent case, i.e., for uncorrelated reflection signals Substituting the DML probability density function in Eq. (44)
and large sample size, MUSIC is asymptotically equivalent to to Eq. (47) leads to the deterministic CRLB [26]:
DML [26].
3) Weighted Subspace Fitting: Subspace fitting methods are (54)
suboptimal approximations of the above maximum likelihood
methods. They are of interest since they have a lower compu-
tational complexity than the ML methods. In addition, for large G. Detection of Reflections
sample sizes and if some conditions are fulfilled they are asymp- DOA estimation requires knowledge on the number of reflec-
totically equivalent to ML methods. [26] tions . Numerous methods for resolving the number of reflec-
The WSF method is a large sample approximation of the SML tions, i.e., detection, have been proposed, and the reader is re-
method, and its localization function is given as [26]: ferred to [12] for an overview. Some of the detection methods
estimate the subspace dimensions from the eigenvalue matrix,
for example, by statistically testing how many of the eigen-
(48) values belong to the noise space [26]. For detection of partially
correlated source signals, the best approaches are the model-
where and is the signal eigenvalue matrix, based approaches, such as generalized likelihood ratio test and
as previously, and is the average WSF detection [26]. These approaches simultaneously detect
of smallest eigenvalues. This weighting gives the number of and estimate the DOA. It is therefore expected
the lowest asymptotic error variance, as shown in [26]. that these methods should also perform well for the detection
of reflection signal, which are correlated or coherent. In this
F. Cramér-Rao Lower Bound on the estimation accuracy
paper, we do not investigate the detection, but assume that
Estimation theory studies the performance of the methods already exists as a prior knowledge, as in previous research on
by comparing their covariance of the estimation error against this topic [2], [3].
a theoretical lower bound. A commonly used bound on the esti-
mation covariance matrix error is the Cramér-Rao lower bound IV. EXPERIMENTS
(CRLB). [26], [33]
This section describes simulation and real data experiments
For the cases studied in this article, the error covariance of an
with a spherical microphone array. In all cases, we use em32
unbiased estimate , i.e., is bound by
Eigenmike®, microphone array which has 32 capsules on the
surface of a rigid sphere. A technical description, microphone
positions etc. of the Eigenmike, is given for example in [35].
The results of the experiments are investigated with the root
(49)
mean squared error (RMSE),
In the following and are omitted from the notation for com-
pactness. Substituting the SML probability density function in
Eq. (37) to the above gives the stochastic CRLB [26]:
which is compared against the square root of the CRLB. In
all the simulation experiments the RMSE is averaged over 100
(50) Monte-Carlo Samples.
In the simulations of this paper, the signal-to-noise
where is the real part of a complex number, and is ratio (SNR) is reported as the space-domain SNR, i.e.,
. However, perhaps a more mean-
(51) ingful SNR value is the effective SNR, which can be calculated
as the relation between equalized reflection signal variance and
the partial derivative of w.r.t. th element in . The elements the equalized noise variance, i.e., . The
of are the partial derivatives of spherical harmonics w.r.t. noise is simulated as i.i.d. complex Gaussian random variable
inclination and azimuth , which are given as: in the space domain with a variance in all the simulated
cases of this paper.
(52) A. Search of the global minimum/maximum via non-linear
optimization methods
As discussed previously, the localization functions are highly
and non-linear and therefore non-linear optimization methods must
be applied if computational time is a requirement. In this paper
(53) we use Newton-type search algorithm to find the minimum of
SML, DML, and WSF localization functions, similarly as in
respectively, where is the cosecant function. These par- [26], where Levenberg-Marquardt (LM) technique is used. The
tial derivatives are available in the literature and, for example, LM technique requires the true gradient and Hessian matrixes of
in MATHEMATICA. In [34], the CRLB for a single stochastic the localization function. Instead of LM, here we use the Quasi-
source in the SH domain is presented. Newton (QN) method with Broyden-Fletcher-Goldfarb-Shanno
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1546 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
Fig. 2. RMS error for all the methods over 100 Monte-Carlo Samples and CRLB against , when the second reflection is delayed by samples and SNR =
60 dB. (a) (b) (c) (d) (e) (f) (g) (h) .
is lower than CRLB, when with . The second PWD performance when , and is similar
maximum in the PWD spectrum is produced by the conjugate as for the case when .
value of the steering vector that produced the global maximum. The CRLB does not predict the RMSE very well when RMSE
When , also MUSIC shows similar behavior as PWD and is high [37]. That is, CRLB assumes a low error variance. Ac-
its estimation is similarly biased. When degrees the cording to [26], CRLB and RMSE are approximately in agree-
performance of PWD and MUSIC slightly improves. This is due ment when the theoretical standard deviation of the DOA esti-
to the fact that separation is more than the Rayleigh resolution mation of the reflections is less than half the angle separation.
, and the methods are able to separate the two re- This is also approximately true in the above studied cases. When
flections. However, also when , both MUSIC and PWD and , CRLB is for the first reflection and
estimates are biased. for the second reflection which is more than the half of the angle
ML methods and WSF have a lower value than CRLB when separation .
in the coherent case, i.e., when . Also this is
caused by a bias in the estimation. All of the methods have the C. Two Reflections in a Single Frequency
global minimum in between the true values and they get strong Next we study the performance of the methods in a single fre-
evidence for the biased estimate, similarly as PWD above. Thus, quency. As explained above, since the processing is applied in
the information contained in the second order moments of the the frequency domain, the noise is whitened as in [2]. That is,
array covariance matrix is too coherent for unbiased estimation the whitened frequency domain covariance matrix and steering
when and . First, the reflection signals are co- vectors are used instead of the time domain versions in Eqs.
herent. Second, the steering vectors are very similar when the (16) and (31). When analyzing DOA in a single frequency with
reflections are arriving from directions that are close to each a microphone array, one should consider the spatial aliasing. As
other. To obtain a higher performance when , more mi- mentioned, spatial aliasing typically occurs in high frequencies,
crophones would be required. For , when the ML when [9]. For the applied array this limit is about
methods and WSF follow the CLRB and perform clearly better 4.7 kHz. However, when analyzing single frequencies, the ML
than PWD or MUSIC. methods and WSF are unaffected by the spatial aliasing, since
In the partially correlated cases and , ML is continuous, i.e., it does not contain any zeros, and
methods and WSF have the same performance. MUSIC has a and are equal for all frequencies.
slightly higher RMSE than the ML methods and WSF. That When the analysis is performed in a single frequency the re-
is, the ML methods and WSF outperform MUSIC in the par- flection signal, the covariance matrix is inevitably coherent, i.e.:
tially correlated case. In the almost incoherent case with ,
MUSIC, the ML methods and WSF have the same performance.
(55)
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1548 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
Fig. 3. RMSE of the ML methods over 100 Monte-Carlo Samples and CRLB for selected frequencies when dB. Best performance is achieved when
kHz. (a) Hz (b) Hz (c) Hz (d) Hz (e) Hz (f) Hz (g) Hz (h)
Hz (i) Hz (j) Hz (k) Hz (l) Hz (m) Hz (n) Hz (o) Hz
(p) Hz (q) Hz (r) Hz.
and the array covariance matrix estimate becomes and RMSE do not meet when the error variance is as high. The
poor performance on these frequencies is a consequence of a
(56) poor effective SNR. In low frequencies, due to the array geom-
etry, the effective SNR is lower than in higher frequencies.
The selected frequencies for this experiment are the center fre-
quencies of the octave bands , where
D. Real Data
. It should be noted that the array covariance matrices
are not averaged over the octave bands, but the performance is 1) Measurement setup: To test the methods in real situations,
examined in single frequencies as show in Eq. (56). measurements were made in a semi-anechoic room with dimen-
The reflections are simulated as wideband reflections with sions m m m. The octave band reverberation times
, , , and SNR is set to 60 dB. Fre- as well as theoretical absorption coefficients, calculated with
quency domain SH coefficients are evaluated with Eq. (24) via Sabine’s equation are shown in Table I. The room has highly
the discrete Fourier Transform, and the frequency bins that are absorptive walls and ceiling, which are treated with 5 cm of min-
the closest to frequencies are used in the analysis. The fre- eral wool, placed 50 cm in front of a concrete wall to generate
quency resolution is Hz and the dB bandwidth an air gap, as shown in Fig. 4. The material of highly reflective
for each analyzed frequency is about 21 Hz for each frequency floor is linoleum on concrete. Seven reflections were introduced
due to the aliasing or “spectral leakage” from the windowing. to the measurement by building a reflective corner to the room,
The results of the experiment for different frequencies are shown also in Fig. 4. This corner was built of two reflective pro-
shown in Fig. 3. The results show that the methods obtain a jection silver screens of size m, parallel to the walls of the
similar performance as in the wideband case when kHz room. The reflections include the first, second, and third order
and kHz, of which kHz gives the best performance. reflections from the inserted corner building and floor in addi-
The reason why RMSE CRLB when kHz, kHz tion to the direct sound from the source to the array.
and kHz is the same as in the wideband case, explained A Genelec 1029A was placed in the source position and em32
above. That is, the difference between the steering vectors for Eigenmike to the receiver position, as shown in Fig. 4. The loud-
reflections arriving from almost the same angle is small, and speaker was facing the em32 microphone array and the acoustic
the reflection signal covariance matrix is coherent. PWD and center of the loudspeaker as well as the center of the microphone
MUSIC suffer from the same problems in the case of single array was at 1.3 m height. The applied loudspeaker is flat in the
frequencies as in the wideband case when . direction of the reflective surface up to 1 kHz, and is attenuated
When Hz, Hz, and Hz, the es- about 6 dB from 1 kHz to 10 kHz in the directions w.r.t
timation with all methods is biased as the RMSE is on average the central horizontal plane. Based on tabulated values, it is es-
about 2 degrees for all values of . We can see from Fig. 3, that timated that the absorption coefficient of the projection silver
for frequencies the Hz, Hz, and Hz, screens is less than 0.1 for frequencies above 500 Hz and less
the CRLB is higher than the half of the angle separation with all than 0.1 for the floor for all frequencies. Due to the size of the re-
the studied separation angles. Thus, it is expected that CRLB flective surface, material, and the geometry of the source-array
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1549
[6] S. Tervo, J. Pätynen, and T. Lokki, “Spatial decomposition method [27] S. Moreau, J. Daniel, and S. Bertet, “3D sound field recording with
for room impulse responses,” J. Audio Eng. Soc., vol. 61, no. 1/2, pp. higher order ambisonics-objective measurements and validation of
16–27, Mar. 2013. spherical microphone,” in Proc. Audio Eng. Soc. Conv. 120, paper no.
[7] T. D. Abhayapala and D. B. Ward, “Theory and design of high order 6857.
sound field microphones using spherical microphone array,” in Proc. [28] C. T. Jin, N. Epain, and A. Parthy, “Design, optimization and evalu-
IEEE Int. Conf. Acoust., Speech, Signal Process., 2002, vol. II, pp. ation of a dual-radius spherical microphone array,” IEEE/ACM Trans.
1949–1953. Audio, Speech Lang. Process., vol. 22, no. 1, pp. 193–204, Jan. 2014.
[8] J. Meyer and G. Elko, “A highly scalable spherical microphone array [29] A. Farina, M. Binelli, A. Capra, E. Armelloni, S. Campanini, and A.
based on an orthonormal decomposition of the soundfield,” in Proc. Amendola, “Recording, simulation and reproduction of spatial sound-
IEEE Int. Conf. Acoust., Speech, Signal Process., 2002, vol. II, pp. fields by spatial PCM sampling (SPS),” Int. Seminar Virtual Acoust.,
1781–1784. vol. 4.1b, p. 14, Nov. 2011.
[9] B. Rafaely, “Analysis and design of spherical microphone arrays,” [30] A. O’Donovan, R. Duraiswami, and D. Zotkin, “Imaging concert hall
IEEE Trans. Speech Audio Process., vol. 13, no. 1, pp. 135–143, Jan. acoustics using visual and audio cameras,” in Proc. Int. Conf. Acoust.,
2005. Speech, Signal Process., 2008, pp. 5284–5287.
[10] Z. Li and R. Duraiswami, “Flexible and optimal design of spherical mi- [31] B. Rafaely, “Phase-mode versus delay-and-sum spherical microphone
crophone arrays for beamforming,” IEEE Trans. Audio, Speech Lang. array processing,” IEEE Signal Process. Lett., vol. 12, no. 10, pp.
Process., vol. 15, no. 2, pp. 702–714, Feb. 2007. 713–716, Oct. 2005.
[11] H. Kuttruff, Room Acoustics. London, U.K.: Spon Press, 2000. [32] M. Park and B. Rafaely, “Sound-field analysis by plane-wave decom-
[12] H. Krim and M. Viberg, “Two decades of array signal processing re- position using spherical microphone array,” J. Acoust. Soc. Amer., vol.
search: The parametric approach,” IEEE Signal Process. Mag., vol. 13, 118, no. 5, pp. 3094–3103, 2005.
no. 4, pp. 67–94, Jul. 1996. [33] S. Kay, Fundamentals of Statistical Signal Processing: Estimation
[13] B. Rafaely, “Plane-wave decomposition of the sound field on a sphere Theory. Upper Saddle River, NJ, USA: Prentice-Hall, 1998.
by spherical convolution,” J. Acoust. Soc. Amer., vol. 116, no. 4, pp. [34] L. Kumar and R. Hegde, “Stochastic Cramér-Rao Bound Analysis
2149–2157, 2004. for DOA Estimation in Spherical Harmonics Domain,” IEEE Signal
[14] H. Teutsch, “Wavefield decomposition using microphone arrays and Process. Lett., vol. 22, no. 8, pp. 1030–1034, Aug 2015.
its application to acoustic scene analysis,” Ph.D. dissertation, Univ. [35] mh acoustics, “EM32 eigenmike microphone array release notes (v17.
Erlangen-Nürnberg, Erlangen, Germany, 2005. 0),” mh acoustics, Summit, NJ, USA, Tech. Rep., Oct. 2013.
[15] E. Mabande, H. Sun, K. Kowalczyk, and W. Kellermann, “Comparison [36] I. Ziskind and M. Wax, “Maximum likelihood localization of multiple
of subspace-based and steered beamformer-based reflection localiza- sources by alternating projection,” IEEE Trans. Acoust., Speech, Signal
tion methods,” in Proc. Eur. Signal Process. Conf, 2011. Process., vol. 36, no. 10, pp. 1553–1560, Oct. 1988.
[16] H. Sun, H. Teutsch, E. Mabande, and W. Kellermann, “Robust localiza- [37] G. C. Carter, “Coherence and time delay estimation,” Proc. IEEE, vol.
tion of multiple sources in reverberant environments using EB-ESPRIT 75, no. 2, pp. 236–255, Feb. 1987.
with spherical microphone arrays,” in Proc. IEEE Int. Conf. Acoust., [38] A. Farina, “Simultaneous measurement of impulse response and dis-
Speech, Signal Process., 2011, pp. 117–120. tortion with a swept-sine technique,” in Proc. Audio Eng. Soc. Conv.
[17] C.-I. C. Nilsen, I. Hafizovic, and S. Holm, “Robust 3-D sound source 108, 2000, paper no. 5093.
localization using spherical microphone arrays,” in Proc. Audio Eng. [39] A. Farina, A. Amendola, A. Capra, and C. Varani, “Spatial analysis of
Soc. Conv. 134, May 2013, paper no. 8904. room impulse responses captured with a 32-capsule microphone array,”
[18] B. Ottersten and T. Kailath, “Direction-of-arrival estimation for in Proc. Audio Eng. Soc. Conv. 130, 2011, paper no. 8400.
wide-band signals using the ESPRIT algorithm,” IEEE Trans. Acoust.,
Speech Signal Process., vol. 38, no. 2, pp. 317–327, 1990.
[19] H. Wang and M. Kaveh, “Coherent signal-subspace processing for the
detection and estimation of angles of arrival of multiple wide-band Sakari Tervo was born in Kuopio, Finland, in
sources,” IEEE Trans. Acoust., Speech and Signal Process., vol. 33, 1983. He received a M.Sc. degree in audio signal
no. 4, pp. 823–831, 1985. processing from Tampere University of Technology
[20] H. Hung and M. Kaveh, “Focussing matrices for coherent signal-sub- in 2006 and a D.Sc. degree in the field of acoustic
space processing,” IEEE Trans. Acoust., Speech, Signal Process., vol. signal processing from Aalto University in 2012. He
36, no. 8, pp. 1272–1281, Aug. 1988. has been a Visiting Researcher in Philips Research,
[21] S. U. Pillai and B. H. Kwon, “Forward/backward spatial smoothing in the Netherlands, in 2007, and in the University of
techniques for coherent signal identification,” IEEE Trans. Acoust., York in 2010. Currently, he works as a Post-Doctoral
Speech Signal Process., vol. 37, no. 1, pp. 8–15, Jan. 1989. Researcher in Aalto University.
[22] C. Qi, Y. Wang, Y. Zhang, and Y. Han, “Spatial difference smoothing
for DOA estimation of coherent signals,” IEEE Signal Process. Lett.,
vol. 12, no. 11, pp. 800–802, Jan. 2005.
[23] F.-M. Han and X.-D. Zhang, “An ESPRIT-like algorithm for coherent
DOA estimation,” IEEE Antennas Wireless Propag. Lett., vol. 4, pp.
443–446, 2005. Archontis Politis obtained his M.Eng. degree in civil
[24] Y. Zhang and Z. Ye, “Efficient method of DOA estimation for uncor- engineering at Aristotle’s University of Thessaloniki,
related and coherent signals,” IEEE Antennas Wireless Propag. Lett., Greece, and his M.Sc. degree in sound & vibration
vol. 7, pp. 799–802, 2008. studies at ISVR, University of Southampton, UK, in
[25] I. Hafizovic, C. Nilsen, and S. Holm, “Transformation between 2006 and 2008, respectively. From 2008 to 2010, he
uniform linear and spherical microphone arrays with symmetric worked as a Graduate Acoustic Consultant at Arup
responses,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. Acoustics, Glasgow, UK, and as a Researcher in a
4, pp. 1189–1195, May 2012. joint collaboration between Arup Acoustics and the
[26] B. Ottersten, M. Viberg, P. Stoica, and A. Nehorai, “Exact and large Glasgow School of Arts, on interactive auralization
sample maximum likelihood techniques for parameter estimation and of architectural spaces using 3D sound techniques.
detection in array processing,” in Radar Array Processing. Berlin, Currently, he is pursuing a doctoral degree in the field
Germany: Springer, 1993, ch. 4. of parametric spatial sound recording and reproduction.
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.