0% found this document useful (0 votes)
41 views13 pages

Direction of Arrival Estimation of Reflections From Room Impulse Responses Using A Spherical Microphone Array

Spatial Decomposition Method for Room Acoustic Parameters using Micriphone Array

Uploaded by

Hamza Aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views13 pages

Direction of Arrival Estimation of Reflections From Room Impulse Responses Using A Spherical Microphone Array

Spatial Decomposition Method for Room Acoustic Parameters using Micriphone Array

Uploaded by

Hamza Aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO.

10, OCTOBER 2015 1539

Direction of Arrival Estimation of Reflections


from Room Impulse Responses Using
a Spherical Microphone Array
Sakari Tervo and Archontis Politis

Abstract—This paper studies the direction of arrival estimation array. In particular, we focus on the case where more than one
of reflections in short time windows of room impulse responses acoustic reflection is present in an analysis window.
measured with a spherical microphone array. Spectral-based In room impulse responses, the number of reflections arriving
methods, such as multiple signal classification (MUSIC) and
beamforming, are commonly used in the analysis of spatial room at the receiver in a short time window increases with the square
impulse responses. However, the room acoustic reflections are of time. This effect is given as the echo density [11, p. 98].
highly correlated or even coherent in a single analysis window Therefore, after a relatively short time, one ends up in a situation
and this imposes limitations on the use of spectral-based methods. where an analysis time window includes multiple reflections.
Here, we apply maximum likelihood (ML) methods, which are Exact overlap of the reflections occurs if the path length from
suitable for direction of arrival estimation of coherent reflections.
These methods have been earlier developed in the linear space
the source to the receiver is equal for two or more reflections.
domain and here we present the ML methods in the context This is not an unusual case in room acoustics, but happens al-
of spherical microphone array processing and room impulse ready for the first order reflections in symmetric source-receiver
responses. Experiments are conducted with simulated and real geometry. For example, such an overlap will occur if the micro-
data using the em32 Eigenmike. The results show that direction phone and the sound source are located somewhere on the same
estimation with ML methods is more robust against noise and less
diagonal or central axis of a rectangular room.
biased than MUSIC or beamforming.
The acoustic reflections are highly correlated or even
Index Terms—Direction of arrival (DOA), room acoustics, spa- coherent, especially in a narrow frequency band. The high
tial room impulse response, spherical microphone arrays. correlation of the reflections causes problems for spectral-based
DOA estimation methods [12], which are commonly ap-
plied in the spherical microphone array processing [2], [3],
I. INTRODUCTION [13]–[17]. According to a classification given in [12], the
spectral-based methods include multiple signal classification

D IRECTION of arrival (DOA) estimation of a sound wave (MUSIC) method, the estimation of signal parameters via
arriving at a microphone array is an essential part of spa- rotational invariant techniques (ESPRIT), and beamforming.
tial room acoustic analysis and synthesis. The directional infor- These methods require that the source signals are independent,
mation is used together with pressure or energy to describe the which is not true in the case of highly correlated reflections.
sound field [1]–[4] or to reproduce a sound in spatial sound syn- The estimation in the case of correlated signals has been
thesis from a certain direction [5], [6]. Thus, it has a profound enhanced by smoothing methods in the space domain in the
impact on how room acoustics are interpreted via the analysis previous decades [18], [19]. These techniques have been es-
or perceived through the spatial sound synthesis. The increasing pecially under research with uniform linear arrays [20]–[24].
number of publications and numerous array designs indicate Later on, several smoothing methods have also been developed
the importance of the spherical microphone array processing in the context of spherical microphone array processing [2],
[7]–[10]. Therefore, it can be considered as one of the most im- [3], [16], [17]. These smoothing methods average the array
portant approaches for spatial sound analysis nowadays. This covariance matrix over frequency [2], time [3], or space [16],
paper studies the DOA estimation of reflections from a spatial [17] and require pre-processing before DOA estimation. In
room impulse response captured with a spherical microphone frequency smoothing the noise is whitened [2] and in time
domain smoothing a stabilizing filter reduces the undesired
amplification of noise. Spatial smoothing in the space domain
Manuscript received January 30, 2015; revised April 24, 2015; accepted May
is often implemented by averaging over subarrays formed from
26, 2015. Date of publication June 01, 2015; date of current version June 09, the original uniform linear array [12]. In spherical microphone
2015. The associate editor coordinating the review of this manuscript and ap- array processing, the division into subarrays is obtained by
proving it for publication was Prof. Thushara Abhayapala.
S. Tervo is with the Department of Computer Science, Aalto University,
transforming the spherical microphone array to a uniform linear
FI-00076 Aalto, Finland (e-mail: sakari.tervo@aalto.fi). array [17], [25]. Another approach to obtain spatial smoothing
A. Politis is with the Department of Signal Processing and Acoustics, Aalto is to form the subarrays via eigenbeam space rotation [15], [16]
University, FI-00076 Aalto, Finland (e-mail: archontis.politis@aalto.fi).
Color versions of one or more of the figures in this paper are available online
The smoothing methods have an apparent disadvantage when
at http://ieeexplore.ieee.org. applied to room impulse responses. Namely, in each time step
Digital Object Identifier 10.1109/TASLP.2015.2439573 and frequency, the room impulse response may have a different

2329-9290 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1540 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

response, i.e., the DOA, phase, and amplitude. Averaging over and the phase of the traveling sound wave, possibly differently,
any domain reduces resolution in the respective domain. Av- in each frequency and at each incident angle.
eraging over time reduces temporal resolution of the estimate; Besides the boundary effects, the sound wave is affected by
averaging over frequency reduces frequency resolution; and av- the propagation distance and attenuation by medium. Namely,
eraging over space reduces spatial resolution since the number the air absorption depends on the composition of the room
of microphones per subarray is lower than the number of micro- air, the distance that the wave has traveled, and frequency
phones in the whole array. [11, p. 147]. Thus, the air absorption alone suggests that the
Contrary to MUSIC and similar approaches, the maximum room impulse response is different in each time and frequency
likelihood (ML) methods [12], [26] can handle coherent signals and therefore gives basis for the frequency dependent analysis.
without any smoothing. This is possible due to the freedom of Moreover, if the sound source is assumed to be a point source
selection in the signal and noise model. Consequently, also a co- the amplitude is attenuated due to the spherical spreading of
herent signal model can be assumed. This signal model is -di- the sound wave.
mensional, where is the number of reflections, since each re- The number of reflections per time interval is described
by a quantity called the echo density, which is given asymptot-
flection is modeled separately. If a spatial spectrum is evaluated
ically by [11, p. 98]
with a grid of size , where the appropriate sampling and size
is defined by the user, the ML DOA estimation function has a
(1)
dimensionality of , whereas the spectral-based methods
have a dimensionality of . Therefore, as the number of re- where is the volume of the enclosure and is the speed of
flections increases, the search space becomes quickly very large. sound. According to Kuttruff [11, p. 98], this is valid for any ge-
For restricted computational time, the high dimensionality leads ometry with a homogeneous medium. Shorter time interval
to the use of non-linear optimization algorithms for the ML reduces the number of reflections present in an analysis window.
methods [26]. As a general limitation in the ML methods, the When we are inspecting the impulse response in a single fre-
number of reflections that can be estimated must be smaller quency, we limit the time window length as , so that
than the number of microphones , i.e., [12]. at least one period of the wave length is observed in the analysis
Analysis of room reflections is often based on the wideband window. Then, the time instant where we have or fewer
assumption [2], [3], [6]. This assumption states that frequencies reflections present in the analysis window in a single frequency
are delayed by the same amount of time. In theory, the wide- is given as:
band assumption holds if the surfaces are large and rigid. The
wideband analysis of acoustic reflections can lead to a desired (2)
accuracy in the analysis [2], [3] or in the reproduction of the
acoustics [6]. However, in the real world, the room impulse re- where is the angular frequency.
sponses are always frequency and time dependent. On this basis,
B. Space Domain
studies on acoustics benefit from the frequency band analysis
of reflections since it enables a more accurate description of the We present the location of microphone in the 3-D Cartesian
room acoustic properties. coordinate system as
In this paper, we study the performance of ML methods and
a large sample approximation called the weighted subspace fit-
ting (WSF) in the analysis of spatial room impulse responses. where and are the standard spherical coordi-
Large sample approximation assumes that the number of avail- nates, radius , inclination and az-
able measurements is large. The cases where one or more reflec- imuth . Each acoustic event arriving to the
tions are present in an analysis window in a wide or narrow fre- microphone has traveled a path length from the source to the
quency band are of interest due to the above mentioned features microphone and has a DOA w.r.t to the array
of the room impulse responses. The contribution of this paper origin and it is described in Cartesian coordinates as:
is the application of the ML methods to the analysis of room re-
flections with the spherical microphone array. Throughout the
experimental section, the results of the ML methods are com- An acoustic event is considered to be a sound wave which is
pared to MUSIC and beamforming. altered by the acoustic phenomena listed above.
A measured wideband impulse response pressure signal in
a microphone for a source signal is presented in the frequency
II. MODELS
domain as the sum of all acoustic events
A. Room impulse response
(3)
A room impulse response is defined as the acoustic response,
measured from a source to a microphone in an enclosed space.
After the initial excitation, the sound wave propagates through where is the frequency response of an acoustic event, and
the space and arrives at the receiver via multiple paths. On these is a noise component, assumed independent and identi-
paths, the sound wave is altered by several acoustic phenomena cally distributed for each microphone. Furthermore,
on the boundaries, such as, reflection, absorption, and diffrac- is the wavenumber, where is the speed of sound and is
tion [11, ch. 2]. These acoustic phenomena affect the amplitude frequency. In addition, is the source signal that describes
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1541

the frequency response of the source, originally emitted to some The noise and reflection signals are assumed to be zero mean
direction, and arriving to the microphone from the direction . complex Gaussian processes, i.e., [26]
A typical source in room impulse response measurements is a
loudspeaker. Note that this impulse response model is general
and describes any room impulse response. and
The frequency response of the acoustic event is dependent on
the time delay , that describes the time it has taken for the
acoustic wave to travel to the microphone, and on the complex where denotes expectation and denotes a Hermi-
amplitude of each acoustic event , i.e., tian transpose of a matrix. This leads to an array covariance
matrix [26]:
(4)

where . (10)
We further assume that the sources and reflections are in the The assumption on the Gaussian reflection signal is not neces-
far field with respect to the array, so that a plane wave model sarily true in the case of room impulse responses. That is, the
can be applied. Then, according to the plane wave model the reflection at a single frequency in a small time window may be
complex amplitude is represented by of deterministic nature rather than random. A model for deter-
ministic array covariance matrix is given in Section III-E.
(5)
C. Spherical Harmonics Domain
where is the complex amplitude of plane wave which in- In order to apply spherical microphone array processing, the
cludes all the acoustic phenomena that the wave has encoun- pressure and the array covariance matrix are described in the
tered before arriving to the microphone and is the response spherical harmonic (SH) domain. The formulation in this sec-
of the microphone in the direction , assumed to be known tion follows the one given in [3].
a priori. Furthermore, since a homogeneous medium and short The SH domain representation of the pressure for
time window analysis are assumed, the path length is ne- an array with radius and order is given by the approxima-
glected in the directional analysis. Consequently, the plane wave tion of the spherical Fourier Transform and its inverse [3]:
time delay can be expressed w.r.t. to the origin, i.e., the center
of the array, as
and (11)
(6)

where denotes the transpose of a vector or a matrix and (12)

respectively, where is the spherical harmonic of order


is the normal to the wavefront. and degree [9]:
We present the delay term and the microphone array response
with (13)

(7)
and are the associated Legendre polynomials. Moreover,
and call this the steering vector. For example, for an ideal open denotes the complex conjugate, is the SH domain co-
microphone array . The plane wave amplitude and the efficient, are the sampling weights to correct the orthonor-
source response are presented as a product mality errors, are the microphone coordinates and
is a steering direction where the inverse spherical Fourier trans-
(8) form is evaluated. The sampling weights are defined by the sam-
pling scheme [9], and the order is defined by the number of mi-
which we call the reflection signal. For the compactness of the crophones (see [9] for details). Throughout this paper we as-
rest of the paper, the measurements are described in a matrix sume uniform sampling and hence the weights reduce to
form. The impulse responses in microphones , i.e., . In addition, we use harmonic coeffi-
the array input, are described by the vector [26] cients, where the harmonic order of the array is , and
the radius is cm, due to the applied microphone array.
Spatial aliasing for spherical arrays typically occurs when
.. .. .. [9]. The noiseless pressure in the SH domain can be ex-
. . .
pressed in a matrix form by [3]

(9) (14)

where is the set of all unknown direction of where are the sensor positions in angular
arrivals. coordinates, are the sampling
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1542 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

weights, and . The left hand side of Eq. (14) where expresses an encoding matrix that implements the
are the SH coefficients in a vectorized form, i.e.: SHT and equalization, and is the array response vector
to direction . In the ideal case
(15) and Eq. (19) holds exactly. In practice both the encoding filters
where and are not used for brevity and the spherical har- and the interpolated array response for an arbitrary can be
monics are expressed in a matrix form computed from measurements of the response at directions
, uniformly distributed or e.g. arranged in
a regular grid. The encoding filters can then be computed by a
weighted regularized least-squares solution from the measure-
ments as in [27], [29]
(16)
.. .. .. ..
. . . . (21)
where are the measured responses of the
The array dependent coefficients are represented by a diagonal array to directions at frequency , are appropriate weights
matrix: for the measurement grid, and is a regularization parameter
set according to the indications in [27], [28]. In this work the
weights of were computed from the areas of the spherical
(17) Voronoi cells of the measurement grid.
where the individual array dependent coefficients in the case of To obtain an array steering vector at any direction with
a rigid sphere are given as [3]: high accuracy, a spherical interpolation was performed by ex-
panding the steering vector in terms of its measured SH coeffi-
(18) cients, as

where , , , and are the spherical Bessel and Hankel (22)


functions and their derivatives w.r.t. , respectively.
where the spherical harmonics matrices are evaluated up to
D. Spherical Harmonics with Real Arrays an order determined by the number of measurements. For a reg-
Previous studies utilizing spherical microphone arrays have ular grid we found that an appropriate truncation order is
shown that by applying the theoretical discrete spherical har- approximately defined by . By visually in-
monic transform on recordings of a real array as in Eq. (12) specting the energy of the SH coefficients, we concluded that
results in significant loss of accuracy in the transformed co- coefficients beyond this truncation order are close to the noise
efficients compared to the theoretical result for an ideal array level. Finally, by combining Eq. (21) and (22) we have the in-
with the same specifications as the real one [27], [28]. This is terpolated SH coefficients
mainly due to deviations between the theoretical array model
(23)
and the real one, such as positioning and calibration errors be-
tween microphones and microphone diaphragm effects. How-
ever the same studies indicate that by employing actual cali- III. DIRECTION OF ARRIVAL ESTIMATION
bration measurements of the array in an anechoic chamber, the This section presents five DOA estimation methods. The first
performance of the SHT approaches the theoretical one. two, plane wave decomposition (PWD) and MUSIC, have been
In essence the SHT of Eq. (12) can be represented by a set previously applied in the spherical microphone array processing
of encoding filters, and the equalization with the inverse modal [2], [3], [30], and the latter three, stochastic (S) and determin-
weights can be obtained optimally in a least-squares sense. istic (D) ML methods, and WSF, are applied in the analysis of
This is equivalent to the problem of obtaining a steering vector reflection via spherical microphone array processing for the first
in the spherical domain that is optimal with respect to the mea- time in this paper. As mentioned earlier, WSF is a large sample
sured array properties. Such measurement-based steering vector approximation of the SML method.
is employed in this work when the methods are applied to real The main difference between ML methods and WSF and the
data. The measurement procedure for the array at hand is de-
two previous methods, besides the fact that MUSIC and PWD
tailed in Section IV-C.
are not maximum likelihood methods, is that in ML and WSF
Let us denote this measurement based steering vector as
one can choose the signal and noise model freely. Therefore,
. Ideally this steering vector should be approximately
whereas in PWD, MUSIC, and similar methods one searches
equal to the theoretical steering vector to a plane wave in the
for peaks from spatial spectrum of size , in ML methods
transform domain
one searches for a single maximum (or equivalently a minimum)
(19)
from an estimation function of size . From this it follows
Eq. (14) shows that the theoretical transformed array re- that ML methods very quickly meet the curse of dimensionality
sponse, after equalization, can be factored into the following when increases. Exhaustive search in this higher-dimensional
components space may be computationally prohibitive, but nonlinear opti-
mization algorithms can often find a minimum in a reasonable
amount of time. In Section IV-A, non-linear optimization algo-
(20) rithms are presented for the ML methods and WSF.
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1543

Due to the nature of room impulse response and the applica- In this paper, we are also interested in the performance in
tions in room acoustics, we are interested in two cases, the wide- single frequency bins. In the frequency domain estimation, we
band analysis of reflections and the analysis of reflections in a apply spatial whitening to the covariance matrix estimate and
single frequency. For the wideband analysis we apply the time the steering vectors as in [2]. The following formulations in this
domain smoothing algorithm, presented in [3], which is also re- section are presented for the time domain smoothing, but are
viewed briefly in this section. In the narrow band analysis, the equal to the frequency domain smoothing versions if the covari-
smoothing algorithms will not provide any benefits, since the ance matrix estimate and the steering vectors are replaced with
reflections are coherent, therefore the analysis is implemented their respective whitened versions.
for the frequency domain SH coefficients.
B. The Steering Matrix
A. Time domain smoothing The steering vector or matrix, used in all the methods, is the
We follow the formulations in [3], in the processing of the SH Hermitean transpose of Eq. (16). For example, in ML methods
domain coefficients. The SH domain coefficients are normalized and WSF, the matrix has the following form in the case of two
by multiplying Eq. (14) by from the left side, which reflections:
leads to

(24)

where . The array covariance since there are two possible reflections and , and the
matrix in this case is given number of harmonic components is . In con-
trast, in MUSIC and PWD, the steering is always 1-D and has
(25) the form
where the noise covariance matrix

(26) These matrices illustrate the fundamental difference between


1-D and 2-D search space, applied for spectral-based methods
is dependent on and denotes the Frobenius norm.
and ML methods and WSF, respectively.
Time domain version of is obtained via the inverse
For ML methods and WSF the localization function is
Fourier transform as . How-
-dimensional since has inclination and azimuth values.
ever, the multiplication with leads to undesired noise am-
The minimum argument of the localization function for ML
plification on certain frequencies [2], [3]. Therefore, in the time
methods and WSF gives the DOA estimates:
domain smoothing we use the same equalization as in [3]:
(32)

For spectral-based methods, the localization function is


(27) 2-dimensional, since has 2 dimensions, inclination and az-
where the normalization factor is constant for all frequen- imuth. The DOA estimates for the spectral-based methods, are
cies. For the microphone array applied in this paper, the magni- the highest local maxima in .
tude response of has the shape of a high-pass filter.
C. Beamforming
Because of the equalization, the array covariance matrix in
the frequency domain becomes Beamforming has been a popular approach for direction es-
timation for several decades [12]. For spherical microphone ar-
(28) rays, beamforming is studied extensively [10], [13], [31], [32]
and one of the most commonly applied beamformer is the plane
where
wave decomposition (PWD) [13]. PWD has the characteristic of
and (29) maximum rejection of isotropic spatial noise and is given over
a frequency range as
(30)

and denote the equalized versions of the variables with . In the (33)
estimation, we assume that the time domain version of noise ma-
trix is independent of time, and spatially white, i.e.,
, where is the equalized variance. In the above, is the noisy version of Eq. (14) and the
The array covariance matrix estimate for time instants weighting is given by
is given as

(31)
where the first term ensures unity beamformer in the look direc-
tion. This form of PWD does not whiten the noise, and therefore
where are the equalized SH will have a poor performance in the wideband case. Whitening
coefficients in the time domain. of the noise can be implemented following [2].
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1544 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

With the time domain smoothing, PWD is implemented here where are the equalized time domain
as: SH coefficients, is the modeled array covariance matrix
and denotes the determinant of a matrix.
(34) As usually in ML methods, the solution is found from the
negative log-likelihood which is given for SML as:
The output energy of the time-domain beamformer at is given
as (38)

Solving the negative log-likelihood w.r.t. (see [26] for


details), leads to
where is the estimated array covariance matrix, given in
Eq. (31). and (39)

D. MUSIC (40)
One of the most popular methods for direction estimation
with spherical microphone arrays is MUSIC [2], [3]. The spatial respectively. In the above
spectrum of MUSIC is calculated as
and (41)
(35) (42)

where is the array noise matrix from eigenvalue decomposi- are the pseudo-inverse of and the orthogonal projector
tion of the array covariance matrix estimate . This decom- onto the null space of , respectively. The localization
position follows the form function for SML is given as [26]:

(36) (43)

where the superscripts and denote signal and noise sub- 2) Deterministic Maximum Likelihood: The determin-
spaces, respectively, is the eigenvalue matrix, and includes istic model makes no assumptions on the signal wave-
the right eigenvectors. forms. That is, the average signal waveform has the form
MUSIC requires a full rank reflection signal covariance ma- . When the average signal waveform
trix. Therefore, when MUSIC is applied to localize reflec- is deduced from the signals, the array covariance matrix is only
tions, the estimated reflection signal covariance matrix should dependent on the noise term
have eigenvalues deviating from the noise. This assumption
is violated if we have too few snapshots of the array covariance
matrix or coherent reflections. The snapshots here refer to time
The likelihood function for DML when several snapshots are
domain or frequency domain SH coefficients.
available is given as [26]:
E. Maximum Likelihood Methods
Maximum likelihood methods are generally applied in sev-
eral estimation tasks. For an overview on maximum likelihood
estimation, the reader is referred to [33]. A requirement for the
ML method is a signal and noise model. These models are de-
pendent on the parameters that are estimated by the ML method.
The problem in ML estimation is then to find the parameters (44)
of the model that most likely explain the observed data. In this where are the reflection signals. From
paper, we apply two ML methods developed earlier in the space the negative log-likelihood, setting and constant, the vari-
domain to the spherical microphone array processing. ance can be estimated as [26]:
1) Stochastic Maximum Likelihood: The first ML method is
called stochastic (SML), due to the assumption that the reflec- (45)
tion signals are stochastic processes. The array covariance ma-
trix in the case of time domain smoothing takes the form
Using this in the negative log-likelihood leads to a non-linear
least-squares problem, from where the localization function and
reflection signal estimates can be written as [26]:
The probability density function of SML for time instant
is given as (46)

and

(47)
(37)
respectively.
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1545

In the incoherent case, i.e., for uncorrelated reflection signals Substituting the DML probability density function in Eq. (44)
and large sample size, MUSIC is asymptotically equivalent to to Eq. (47) leads to the deterministic CRLB [26]:
DML [26].
3) Weighted Subspace Fitting: Subspace fitting methods are (54)
suboptimal approximations of the above maximum likelihood
methods. They are of interest since they have a lower compu-
tational complexity than the ML methods. In addition, for large G. Detection of Reflections
sample sizes and if some conditions are fulfilled they are asymp- DOA estimation requires knowledge on the number of reflec-
totically equivalent to ML methods. [26] tions . Numerous methods for resolving the number of reflec-
The WSF method is a large sample approximation of the SML tions, i.e., detection, have been proposed, and the reader is re-
method, and its localization function is given as [26]: ferred to [12] for an overview. Some of the detection methods
estimate the subspace dimensions from the eigenvalue matrix,
for example, by statistically testing how many of the eigen-
(48) values belong to the noise space [26]. For detection of partially
correlated source signals, the best approaches are the model-
where and is the signal eigenvalue matrix, based approaches, such as generalized likelihood ratio test and
as previously, and is the average WSF detection [26]. These approaches simultaneously detect
of smallest eigenvalues. This weighting gives the number of and estimate the DOA. It is therefore expected
the lowest asymptotic error variance, as shown in [26]. that these methods should also perform well for the detection
of reflection signal, which are correlated or coherent. In this
F. Cramér-Rao Lower Bound on the estimation accuracy
paper, we do not investigate the detection, but assume that
Estimation theory studies the performance of the methods already exists as a prior knowledge, as in previous research on
by comparing their covariance of the estimation error against this topic [2], [3].
a theoretical lower bound. A commonly used bound on the esti-
mation covariance matrix error is the Cramér-Rao lower bound IV. EXPERIMENTS
(CRLB). [26], [33]
This section describes simulation and real data experiments
For the cases studied in this article, the error covariance of an
with a spherical microphone array. In all cases, we use em32
unbiased estimate , i.e., is bound by
Eigenmike®, microphone array which has 32 capsules on the
surface of a rigid sphere. A technical description, microphone
positions etc. of the Eigenmike, is given for example in [35].
The results of the experiments are investigated with the root
(49)
mean squared error (RMSE),
In the following and are omitted from the notation for com-
pactness. Substituting the SML probability density function in
Eq. (37) to the above gives the stochastic CRLB [26]:
which is compared against the square root of the CRLB. In
all the simulation experiments the RMSE is averaged over 100
(50) Monte-Carlo Samples.
In the simulations of this paper, the signal-to-noise
where is the real part of a complex number, and is ratio (SNR) is reported as the space-domain SNR, i.e.,
. However, perhaps a more mean-
(51) ingful SNR value is the effective SNR, which can be calculated
as the relation between equalized reflection signal variance and
the partial derivative of w.r.t. th element in . The elements the equalized noise variance, i.e., . The
of are the partial derivatives of spherical harmonics w.r.t. noise is simulated as i.i.d. complex Gaussian random variable
inclination and azimuth , which are given as: in the space domain with a variance in all the simulated
cases of this paper.
(52) A. Search of the global minimum/maximum via non-linear
optimization methods
As discussed previously, the localization functions are highly
and non-linear and therefore non-linear optimization methods must
be applied if computational time is a requirement. In this paper
(53) we use Newton-type search algorithm to find the minimum of
SML, DML, and WSF localization functions, similarly as in
respectively, where is the cosecant function. These par- [26], where Levenberg-Marquardt (LM) technique is used. The
tial derivatives are available in the literature and, for example, LM technique requires the true gradient and Hessian matrixes of
in MATHEMATICA. In [34], the CRLB for a single stochastic the localization function. Instead of LM, here we use the Quasi-
source in the SH domain is presented. Newton (QN) method with Broyden-Fletcher-Goldfarb-Shanno
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1546 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

(BFGS) algorithm, which estimates the Hessian via finite dif-


ferences. We chose QN method since it is readily available in
MATLAB in the -function. The iteration in the QN
method is stopped if the function value or the estimated param-
eters do not change more than between two consec-
utive iterations. Moreover, the maximum number of iterations,
allowed for the QN method, is set to 10000.
The QN method requires a reasonable initial guess for the es-
timated parameters. In the simulations of this paper, the param-
eters are initialized around the true values in the QN method, by
adding a random number to the true values. The random num-
bers are drawn from a uniform distribution with a variance of
2 degrees and mean 0. In real situations, the true values are not
necessarily available and therefore we follow the suggestion in
[26], and search the initial value for SML, DML, and WSF via
a non-linear optimization algorithm called the alternating pro-
jection (AP) algorithm [36]. AP has a property of always con-
verging to a local minimum, which is not necessarily global.
However, here, the local minimum found by AP is assumed to Fig. 1. RMSE over 100 Monte-Carlo Samples and CRLB for (a) azimuth and
(b) inclination angle against noise variance for a single reflection in the
be also the global minimum of the applied search grid. We use direction . Inclination angle has a lower RMSE and CRLB than
iterations in the AP algorithm including the initialization the azimuth angle. (a) (b) .
and a grid which has a 5 degree resolution w.r.t. to both azimuth
and inclination. samples. That is, the signals for the first and second reflection
In simulated situations, we use the QN method with BFGS are and ,
algorithm to find the maxima of MUSIC and PWD localization respectively. The four simulated delays are , ,
functions and initialize the parameters around the true values. , and samples. When the delay approaches 0, the
In real situations, we initialize MUSIC and PWD by finding reflection signal covariance matrix approaches coherence, and
highest local maxima from their spatial spectra. These spatial when increases the signal covariance matrix approaches in-
spectra are calculated for the same grid that is used above in the coherence. For the four cases studied here, the reflection signal
AP algorithm. covariance matrices are
B. Simulation
1) A single wideband reflection: We first simulate a case
where one reflection is present in a time window. The sampling and
frequency is set to kHz and the length of the window is
. In addition, the reflection is simulated as a wide
band plane wave reflection arriving
from and with a harmonic order . More-
over, the time domain SH coefficients are calculated as shown in for , , , and , respectively.
Section III-A. via the discrete Fourier Transform of size . For Both of the reflections are simulated as plane waves with
a single plane wave, the reflection signal covariance matrix is harmonic order and DOAs and
, where the direction of the second plane wave is varied
as . In these experiments the
signal-to-noise ratio is set to 60 dB.
The time domain SH coefficients are evaluated as previously
where is the total energy of the equalized reflection and they are windowed using a samples long rectan-
signal. gular window. This window is centered around the first reflec-
Fig. 1 shows the RMS error for each method over 100 Monte- tion at . The array covariance matrix estimate is then
Carlo samples at different noise variances . Also theoretical averaged over samples in the time domain as shown
performance in the simulated case, i.e., the CRLB is shown for in Eq. (31).
the stochastic and deterministic case in Fig. 1. It can be seen Fig. 2 show the results for the cases. The first observation
that the methods have about equal performance in the case of from the results is that the performance of PWD is poor in the
one reflection. The RMSE of the methods follows the CRLB DOA estimation of multiple reflections as its RMSE increases
when [dB]. The estimation of inclination angle as increases. This is explained by a bias in the estimation.
has a lower RMSE, as predicted also by the CRLB. Namely, the global maximum in PWD is found in between the
2) Two wideband reflections: In the second example, we sim- true values and the second maximum is apart from the first
ulate four cases where two reflections arrive during the same one. The two reflections produce a peak in the spatial spectrum
time window. Similarly as in [3, Section IV-B], two wideband of PWD in between the true values of and . Because of
reflections are simulated of which the second one is delayed by the high amount of energy that this maximum has, the RMSE
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1547

Fig. 2. RMS error for all the methods over 100 Monte-Carlo Samples and CRLB against , when the second reflection is delayed by samples and SNR =
60 dB. (a) (b) (c) (d) (e) (f) (g) (h) .

is lower than CRLB, when with . The second PWD performance when , and is similar
maximum in the PWD spectrum is produced by the conjugate as for the case when .
value of the steering vector that produced the global maximum. The CRLB does not predict the RMSE very well when RMSE
When , also MUSIC shows similar behavior as PWD and is high [37]. That is, CRLB assumes a low error variance. Ac-
its estimation is similarly biased. When degrees the cording to [26], CRLB and RMSE are approximately in agree-
performance of PWD and MUSIC slightly improves. This is due ment when the theoretical standard deviation of the DOA esti-
to the fact that separation is more than the Rayleigh resolution mation of the reflections is less than half the angle separation.
, and the methods are able to separate the two re- This is also approximately true in the above studied cases. When
flections. However, also when , both MUSIC and PWD and , CRLB is for the first reflection and
estimates are biased. for the second reflection which is more than the half of the angle
ML methods and WSF have a lower value than CRLB when separation .
in the coherent case, i.e., when . Also this is
caused by a bias in the estimation. All of the methods have the C. Two Reflections in a Single Frequency
global minimum in between the true values and they get strong Next we study the performance of the methods in a single fre-
evidence for the biased estimate, similarly as PWD above. Thus, quency. As explained above, since the processing is applied in
the information contained in the second order moments of the the frequency domain, the noise is whitened as in [2]. That is,
array covariance matrix is too coherent for unbiased estimation the whitened frequency domain covariance matrix and steering
when and . First, the reflection signals are co- vectors are used instead of the time domain versions in Eqs.
herent. Second, the steering vectors are very similar when the (16) and (31). When analyzing DOA in a single frequency with
reflections are arriving from directions that are close to each a microphone array, one should consider the spatial aliasing. As
other. To obtain a higher performance when , more mi- mentioned, spatial aliasing typically occurs in high frequencies,
crophones would be required. For , when the ML when [9]. For the applied array this limit is about
methods and WSF follow the CLRB and perform clearly better 4.7 kHz. However, when analyzing single frequencies, the ML
than PWD or MUSIC. methods and WSF are unaffected by the spatial aliasing, since
In the partially correlated cases and , ML is continuous, i.e., it does not contain any zeros, and
methods and WSF have the same performance. MUSIC has a and are equal for all frequencies.
slightly higher RMSE than the ML methods and WSF. That When the analysis is performed in a single frequency the re-
is, the ML methods and WSF outperform MUSIC in the par- flection signal, the covariance matrix is inevitably coherent, i.e.:
tially correlated case. In the almost incoherent case with ,
MUSIC, the ML methods and WSF have the same performance.
(55)
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1548 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

Fig. 3. RMSE of the ML methods over 100 Monte-Carlo Samples and CRLB for selected frequencies when dB. Best performance is achieved when
kHz. (a) Hz (b) Hz (c) Hz (d) Hz (e) Hz (f) Hz (g) Hz (h)
Hz (i) Hz (j) Hz (k) Hz (l) Hz (m) Hz (n) Hz (o) Hz
(p) Hz (q) Hz (r) Hz.

and the array covariance matrix estimate becomes and RMSE do not meet when the error variance is as high. The
poor performance on these frequencies is a consequence of a
(56) poor effective SNR. In low frequencies, due to the array geom-
etry, the effective SNR is lower than in higher frequencies.
The selected frequencies for this experiment are the center fre-
quencies of the octave bands , where
D. Real Data
. It should be noted that the array covariance matrices
are not averaged over the octave bands, but the performance is 1) Measurement setup: To test the methods in real situations,
examined in single frequencies as show in Eq. (56). measurements were made in a semi-anechoic room with dimen-
The reflections are simulated as wideband reflections with sions m m m. The octave band reverberation times
, , , and SNR is set to 60 dB. Fre- as well as theoretical absorption coefficients, calculated with
quency domain SH coefficients are evaluated with Eq. (24) via Sabine’s equation are shown in Table I. The room has highly
the discrete Fourier Transform, and the frequency bins that are absorptive walls and ceiling, which are treated with 5 cm of min-
the closest to frequencies are used in the analysis. The fre- eral wool, placed 50 cm in front of a concrete wall to generate
quency resolution is Hz and the dB bandwidth an air gap, as shown in Fig. 4. The material of highly reflective
for each analyzed frequency is about 21 Hz for each frequency floor is linoleum on concrete. Seven reflections were introduced
due to the aliasing or “spectral leakage” from the windowing. to the measurement by building a reflective corner to the room,
The results of the experiment for different frequencies are shown also in Fig. 4. This corner was built of two reflective pro-
shown in Fig. 3. The results show that the methods obtain a jection silver screens of size m, parallel to the walls of the
similar performance as in the wideband case when kHz room. The reflections include the first, second, and third order
and kHz, of which kHz gives the best performance. reflections from the inserted corner building and floor in addi-
The reason why RMSE CRLB when kHz, kHz tion to the direct sound from the source to the array.
and kHz is the same as in the wideband case, explained A Genelec 1029A was placed in the source position and em32
above. That is, the difference between the steering vectors for Eigenmike to the receiver position, as shown in Fig. 4. The loud-
reflections arriving from almost the same angle is small, and speaker was facing the em32 microphone array and the acoustic
the reflection signal covariance matrix is coherent. PWD and center of the loudspeaker as well as the center of the microphone
MUSIC suffer from the same problems in the case of single array was at 1.3 m height. The applied loudspeaker is flat in the
frequencies as in the wideband case when . direction of the reflective surface up to 1 kHz, and is attenuated
When Hz, Hz, and Hz, the es- about 6 dB from 1 kHz to 10 kHz in the directions w.r.t
timation with all methods is biased as the RMSE is on average the central horizontal plane. Based on tabulated values, it is es-
about 2 degrees for all values of . We can see from Fig. 3, that timated that the absorption coefficient of the projection silver
for frequencies the Hz, Hz, and Hz, screens is less than 0.1 for frequencies above 500 Hz and less
the CRLB is higher than the half of the angle separation with all than 0.1 for the floor for all frequencies. Due to the size of the re-
the studied separation angles. Thus, it is expected that CRLB flective surface, material, and the geometry of the source-array
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1549

sweep signal. The measurement grid had 5 and 10 degree resolu-


tions for azimuth and inclination, respectively, leading to a total
of 1225 points. Moreover, the distance from the source, a Gen-
elec 8020B loudspeaker, to the em32 microphone array was 2 m.
The initial time delay of the direct sound and the internal delay
due to buffers etc. was first removed from the responses. Then,
an additional direction-dependent sinusoidally varying delay at
the measured responses, due to the acoustic center of the array
not being exactly fixed while the array was rotated, was com-
pensated in a similar way as described in [28], [39]. Finally, the
measurements were truncated to 2048 samples and processed
as shown in Section III-D. to obtain SH coefficients in the fre-
quency domain for harmonic order .
Fig. 4. The measurement setup in the semi-anechoic room. 3) Processing and results: We study the performance of the
methods with two experiments. The first experiment is a wide-
TABLE I band analysis using the time smoothing algorithm, as suggested
MEASURED REVERBERATION TIME OF THE SEMI-ANECHOIC ROOM
in [3]. The second experiment is a single frequency analysis in
AND (SABINE’S) THEORETICAL ABSORPTION COEFFICIENT
the frequency domain. In the wideband evaluation, time domain
SH coefficients are windowed using a 48 sample long (1 ms)
Hanning window around the time of arrival of each reflection
and the direct sound, as shown in Fig. 5(b). The array covari-
ance matrix is then averaged from 48 time domain SH coeffi-
cients for each of the eight time windows. A single direction
of arrival is estimated for each of windows. Thus, we set the
number of reflections to for each time window. In the
single frequency analysis, we window the time domain impulse
responses with two rectangular windows of length 512 samples,
tapered in the beginning and end. The windows are selected so
that both of the windows include four acoustic events, as shown
in Fig. 5(a). The first window includes the direct sound and three
reflections and the second one includes four reflections. That is,
in the estimation, we use for both windows. The array
covariance matrix is calculated using a single frequency SH co-
efficients in the frequency domain, as above in the simulations
in Section III-C.
The results of the real data experiments are shown for each
method in Table II as an average RMSE over the eight DOA
Fig. 5. (a) The 32 room impulse responses measured with em32 Eigen-
mike and the (b) corresponding real part of the SH coefficients
estimates. As we can see from the results, there is not much
in time domain, as well as the windowing (---) applied in the experiments. difference in the performance of the methods in the wideband
Note that in both (a) and (b) the impulse responses overlap in the visualization case. This result was also predictable from the first simulation
heavily. (a) Room impulse responses (b) Real-part of SH coefficients.
result. That is, when , all the methods perform equally.
The results of the single frequency analysis in Table II show
setup, it is estimated that reflections from the reflective surface that SML performs clearly worse than DML and WSF. This
are obtained for frequencies above 250 Hz. degradation in the performance is caused by the fact that SML
An impulse response was measured with a 10 s long expo- does not converge to the global minimum but to a local min-
nential sine-sweep from 1 Hz to 24 kHz at a sampling rate of imum. This result is already shown in [26, p. 61], where SML
kHz [38]. The impulse responses of the 32 em32 Eigen- has a worse probability to converge to a global minimum than
mike microphones as well as the SH time domain coefficients DML or WSF. PWD and MUSIC perform worse than the other
are shown in Fig. 5, which are normalized in the visualization methods, as expected from the simulations. As a conclusion of
such that the maximum absolute value is 1. The impulse re- the experiments, DML and WSF have a similar performance and
sponses were truncated to 7.2 ms after the direct sound, resulting should thus be preferred in the analysis of multiple reflections,
in a total length of 1024 samples. The truncation was done to especially in the single frequency analysis. In addition, WSF
include only the reflections that are under study, although, as has an advantage over DML since it is computationally lighter,
shown in Fig. 5, there are no visible reflections in the impulse especially if is large, which is typical in room acoustics.
response after the 7th reflection. 4) Discussion: The results of this paper show that if a single
2) Measurement of the steering vectors: The steering vec- reflection is present in the analysis window, all the studied
tors were estimated from a set of anechoic measurements of methods perform well. However, if there are two reflections
the em32 Eigenmike. The impulse responses were measured at that are highly correlated, the estimation of PWD and MUSIC
48 kHz from 1 Hz to 24 kHz with a 5 s long exponential sine is biased and the ML methods are by far a better choice. As
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
1550 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

TABLE II V. CONCLUSIONS AND FUTURE WORK


RMSE OF THE METHODS IN THE REAL EXPERIMENT FOR INCLINATION
AND AZIMUTH IN THE REAL EXPERIMENT IN DIFFERENT FREQUENCIES, This paper studied the direction estimation of reflections from
AND FOR WIDEBAND VIA TIME SMOOTHING ALGORITHM spatial room impulse responses. The analysis case of multiple
reflections in a small time window was of special interest. This
case is known to be difficult to methods such as MUSIC or
ESPRIT since the reflections are highly correlated or even co-
herent and these methods rely on the independence of the reflec-
tions. The problem of estimating the DOA of coherent reflec-
tions has not been solved previously. However, the case of par-
tially correlated or incoherent reflections has been solved previ-
ously by smoothing, i.e., averaging, the array covariance matrix
over time [3], frequency [2] or space [16]. A room impulse re-
sponse may have different characteristic at every time instant
and in every frequency. Therefore, smoothing results in a low-
ered resolution with respect to time, frequency, or space, which
impairs the detailed analysis of the room acoustics. This paper
proposed to use ML methods in the DOA estimation of reflec-
tions, where smoothing operations can be avoided and which
can cope with coherent signals.
Simulations and real data experiments were conducted to
study the performance of the direction estimation with PWD
beamformer, MUSIC, two ML methods, a deterministic and
a stochastic one, and a subspace technique, WSF, which is a
large sample approximation of the stochastic ML. The results
show that the ML methods and WSF have better performance
than MUSIC or PWD in the studied case of partially correlated
or coherent reflections. They are able to correctly estimate the
direction of several reflections with a much higher resolution
than MUSIC or PWD. Moreover, an example analysis of a real
shown in Fig. 2, all the methods benefit if smoothing is applied.
situation showed that the ML methods and WSF allow a much
Similarly as with the time-domain smoothing, the performance
more accurate analysis of the room acoustics than MUSIC or
of all the methods would increase if frequency smoothing
PWD, since frequency analysis is possible. Moreover, real data
would be applied.
experiments showed that the stochastic ML performs worse
In the real experiments the reflective wall does not reflect low
than the deterministic one or WSF, due to a lower convergence
frequencies well and the estimation for them is very erroneous,
probability. Therefore, DML or WSF should be preferred in
as shown in Table II. Actually the time window in the analysis
the analysis of reflections. In addition, WSF is computationally
of the lowest frequencies ( Hz) should be longer, since
more light, which makes it the best approach for the analysis of
the low frequencies travel a longer path. They go through the
reflections out of the five methods tested in this paper.
projection screen, the mineral wool, and are eventually reflected
The results of this paper show that ML methods can be ap-
from the concrete wall. Namely, as the air gap in the wall is
plied to investigate acoustics of rooms, such as, concert halls
0.45 m, frequencies above about 200 Hz are attenuated in the
[4], in more detail than ever before. In addition, the studied ML
air gap, but lower frequencies are reflected back. This can also
methods can be applied to obtain a more detailed synthesis of
be observed from the theoretical in Table I. In addition, as
room acoustics with spatial sound reproduction methods.
shown in the simulations, the performance of DOA estimation
is poor on lower frequencies ( Hz). However, even if
the windowing would be correct for the lowest frequencies, the REFERENCES
DOA estimation would still suffer from the same problems as
[1] J. Merimaa, T. Lokki, T. Peltonen, and M. Karjalainen, “Measurement,
in the simulated case with low frequencies. For the analysis of analysis, and visualization of directional room responses,” in Proc.
multiple coherent reflections in low frequencies ( Hz), 111th Conv. Audio Eng. Soc., New York, NY, USA, Sep. 21–24, 2001,
a microphone array with a larger radius should be used. paper 5449.
[2] D. Khaykin and B. Rafaely, “Acoustic analysis by spherical micro-
As shown here, the performance of DOA estimation of sev- phone array processing of room impulse responses,” J. Acoust. Soc.
eral reflections with a rigid sphere depends on the analyzed fre- Amer., vol. 132, no. 1, pp. 261–270, 2012.
quency. The radius of the sphere affects on the accuracy of the [3] N. Huleihel and B. Rafaely, “Spherical array processing for acoustic
analysis using room impulse responses and time-domain smoothing,”
estimation in the specific frequency as shown by the CRLB. The
J. Acoust. Soc. Amer., vol. 133, no. 6, pp. 3995–4007, 2013.
larger the radius, the better the performance in lower frequen- [4] J. Pätynen, S. Tervo, and T. Lokki, “Analysis of concert hall acoustics
cies. With the current array of radius 4.2 cm the best perfor- via visualizations of time-frequency and spatio-temporal responses,” J.
mance was achieved around 4 kHz. If the analysis would require Acoust. Soc. Amer., vol. 133, no. 2, pp. 842–857, Feb. 2013.
[5] J. Merimaa and V. Pulkki, “Spatial impulse response rendering I: Anal-
similar performance, say around 250 Hz, the radius should be ysis and synthesis,” J. Audio Eng. Soc., vol. 53, no. 12, pp. 1115–1127,
about 67.2 cm. 2005.
Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.
TERVO AND POLITIS: DOA ESTIMATION OF REFLECTIONS FROM ROOM IMPULSE RESPONSES 1551

[6] S. Tervo, J. Pätynen, and T. Lokki, “Spatial decomposition method [27] S. Moreau, J. Daniel, and S. Bertet, “3D sound field recording with
for room impulse responses,” J. Audio Eng. Soc., vol. 61, no. 1/2, pp. higher order ambisonics-objective measurements and validation of
16–27, Mar. 2013. spherical microphone,” in Proc. Audio Eng. Soc. Conv. 120, paper no.
[7] T. D. Abhayapala and D. B. Ward, “Theory and design of high order 6857.
sound field microphones using spherical microphone array,” in Proc. [28] C. T. Jin, N. Epain, and A. Parthy, “Design, optimization and evalu-
IEEE Int. Conf. Acoust., Speech, Signal Process., 2002, vol. II, pp. ation of a dual-radius spherical microphone array,” IEEE/ACM Trans.
1949–1953. Audio, Speech Lang. Process., vol. 22, no. 1, pp. 193–204, Jan. 2014.
[8] J. Meyer and G. Elko, “A highly scalable spherical microphone array [29] A. Farina, M. Binelli, A. Capra, E. Armelloni, S. Campanini, and A.
based on an orthonormal decomposition of the soundfield,” in Proc. Amendola, “Recording, simulation and reproduction of spatial sound-
IEEE Int. Conf. Acoust., Speech, Signal Process., 2002, vol. II, pp. fields by spatial PCM sampling (SPS),” Int. Seminar Virtual Acoust.,
1781–1784. vol. 4.1b, p. 14, Nov. 2011.
[9] B. Rafaely, “Analysis and design of spherical microphone arrays,” [30] A. O’Donovan, R. Duraiswami, and D. Zotkin, “Imaging concert hall
IEEE Trans. Speech Audio Process., vol. 13, no. 1, pp. 135–143, Jan. acoustics using visual and audio cameras,” in Proc. Int. Conf. Acoust.,
2005. Speech, Signal Process., 2008, pp. 5284–5287.
[10] Z. Li and R. Duraiswami, “Flexible and optimal design of spherical mi- [31] B. Rafaely, “Phase-mode versus delay-and-sum spherical microphone
crophone arrays for beamforming,” IEEE Trans. Audio, Speech Lang. array processing,” IEEE Signal Process. Lett., vol. 12, no. 10, pp.
Process., vol. 15, no. 2, pp. 702–714, Feb. 2007. 713–716, Oct. 2005.
[11] H. Kuttruff, Room Acoustics. London, U.K.: Spon Press, 2000. [32] M. Park and B. Rafaely, “Sound-field analysis by plane-wave decom-
[12] H. Krim and M. Viberg, “Two decades of array signal processing re- position using spherical microphone array,” J. Acoust. Soc. Amer., vol.
search: The parametric approach,” IEEE Signal Process. Mag., vol. 13, 118, no. 5, pp. 3094–3103, 2005.
no. 4, pp. 67–94, Jul. 1996. [33] S. Kay, Fundamentals of Statistical Signal Processing: Estimation
[13] B. Rafaely, “Plane-wave decomposition of the sound field on a sphere Theory. Upper Saddle River, NJ, USA: Prentice-Hall, 1998.
by spherical convolution,” J. Acoust. Soc. Amer., vol. 116, no. 4, pp. [34] L. Kumar and R. Hegde, “Stochastic Cramér-Rao Bound Analysis
2149–2157, 2004. for DOA Estimation in Spherical Harmonics Domain,” IEEE Signal
[14] H. Teutsch, “Wavefield decomposition using microphone arrays and Process. Lett., vol. 22, no. 8, pp. 1030–1034, Aug 2015.
its application to acoustic scene analysis,” Ph.D. dissertation, Univ. [35] mh acoustics, “EM32 eigenmike microphone array release notes (v17.
Erlangen-Nürnberg, Erlangen, Germany, 2005. 0),” mh acoustics, Summit, NJ, USA, Tech. Rep., Oct. 2013.
[15] E. Mabande, H. Sun, K. Kowalczyk, and W. Kellermann, “Comparison [36] I. Ziskind and M. Wax, “Maximum likelihood localization of multiple
of subspace-based and steered beamformer-based reflection localiza- sources by alternating projection,” IEEE Trans. Acoust., Speech, Signal
tion methods,” in Proc. Eur. Signal Process. Conf, 2011. Process., vol. 36, no. 10, pp. 1553–1560, Oct. 1988.
[16] H. Sun, H. Teutsch, E. Mabande, and W. Kellermann, “Robust localiza- [37] G. C. Carter, “Coherence and time delay estimation,” Proc. IEEE, vol.
tion of multiple sources in reverberant environments using EB-ESPRIT 75, no. 2, pp. 236–255, Feb. 1987.
with spherical microphone arrays,” in Proc. IEEE Int. Conf. Acoust., [38] A. Farina, “Simultaneous measurement of impulse response and dis-
Speech, Signal Process., 2011, pp. 117–120. tortion with a swept-sine technique,” in Proc. Audio Eng. Soc. Conv.
[17] C.-I. C. Nilsen, I. Hafizovic, and S. Holm, “Robust 3-D sound source 108, 2000, paper no. 5093.
localization using spherical microphone arrays,” in Proc. Audio Eng. [39] A. Farina, A. Amendola, A. Capra, and C. Varani, “Spatial analysis of
Soc. Conv. 134, May 2013, paper no. 8904. room impulse responses captured with a 32-capsule microphone array,”
[18] B. Ottersten and T. Kailath, “Direction-of-arrival estimation for in Proc. Audio Eng. Soc. Conv. 130, 2011, paper no. 8400.
wide-band signals using the ESPRIT algorithm,” IEEE Trans. Acoust.,
Speech Signal Process., vol. 38, no. 2, pp. 317–327, 1990.
[19] H. Wang and M. Kaveh, “Coherent signal-subspace processing for the
detection and estimation of angles of arrival of multiple wide-band Sakari Tervo was born in Kuopio, Finland, in
sources,” IEEE Trans. Acoust., Speech and Signal Process., vol. 33, 1983. He received a M.Sc. degree in audio signal
no. 4, pp. 823–831, 1985. processing from Tampere University of Technology
[20] H. Hung and M. Kaveh, “Focussing matrices for coherent signal-sub- in 2006 and a D.Sc. degree in the field of acoustic
space processing,” IEEE Trans. Acoust., Speech, Signal Process., vol. signal processing from Aalto University in 2012. He
36, no. 8, pp. 1272–1281, Aug. 1988. has been a Visiting Researcher in Philips Research,
[21] S. U. Pillai and B. H. Kwon, “Forward/backward spatial smoothing in the Netherlands, in 2007, and in the University of
techniques for coherent signal identification,” IEEE Trans. Acoust., York in 2010. Currently, he works as a Post-Doctoral
Speech Signal Process., vol. 37, no. 1, pp. 8–15, Jan. 1989. Researcher in Aalto University.
[22] C. Qi, Y. Wang, Y. Zhang, and Y. Han, “Spatial difference smoothing
for DOA estimation of coherent signals,” IEEE Signal Process. Lett.,
vol. 12, no. 11, pp. 800–802, Jan. 2005.
[23] F.-M. Han and X.-D. Zhang, “An ESPRIT-like algorithm for coherent
DOA estimation,” IEEE Antennas Wireless Propag. Lett., vol. 4, pp.
443–446, 2005. Archontis Politis obtained his M.Eng. degree in civil
[24] Y. Zhang and Z. Ye, “Efficient method of DOA estimation for uncor- engineering at Aristotle’s University of Thessaloniki,
related and coherent signals,” IEEE Antennas Wireless Propag. Lett., Greece, and his M.Sc. degree in sound & vibration
vol. 7, pp. 799–802, 2008. studies at ISVR, University of Southampton, UK, in
[25] I. Hafizovic, C. Nilsen, and S. Holm, “Transformation between 2006 and 2008, respectively. From 2008 to 2010, he
uniform linear and spherical microphone arrays with symmetric worked as a Graduate Acoustic Consultant at Arup
responses,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. Acoustics, Glasgow, UK, and as a Researcher in a
4, pp. 1189–1195, May 2012. joint collaboration between Arup Acoustics and the
[26] B. Ottersten, M. Viberg, P. Stoica, and A. Nehorai, “Exact and large Glasgow School of Arts, on interactive auralization
sample maximum likelihood techniques for parameter estimation and of architectural spaces using 3D sound techniques.
detection in array processing,” in Radar Array Processing. Berlin, Currently, he is pursuing a doctoral degree in the field
Germany: Springer, 1993, ch. 4. of parametric spatial sound recording and reproduction.

Authorized licensed use limited to: TU Ilmenau. Downloaded on June 10,2024 at 08:06:49 UTC from IEEE Xplore. Restrictions apply.

You might also like