Immersive Audio Signal Processing
Immersive Audio Signal Processing
Sunil Bharitkar
Audyssey Laboratories, Inc. and
University of Southern California
Los Angeles, CA, USA
Chris Kyriakakis
University of Southern California
Los Angeles, CA, USA
~ Springer
Sunil Bharitkar Chris Kyriakakis
Dept. of Electrical Eng.-Systems Dept. of Electrical Eng.-Systems and
University of Southern California Integrated Media Systems Center (IMSC)
Los Angeles, CA 90089-2564, and University of Southern California
Audyssey Laboratories, Inc. Los Angeles, CA 90089-2564
350 S. Figueroa Street, Ste. 196 [email protected]
Los Angeles, CA 90071
[email protected]
987654321
springer.com
To Aai and Baba
This book is the result of several years of research in acoustics, digital signal process-
ing (DSP), and psychoacoustics, conducted in the Immersive Audio Laboratory at the
University of Southern California’s (USC) Viterbi School of Engineering, the Signal
and Image Processing Institute at USC, and Audyssey Laboratories. The authors’
association began five years ago when Sunil Bharitkar joined Chris Kyriakakis’ re-
search group as a PhD candidate.
The title Immersive Audio Signal Processing refers to the fact that signal pro-
cessing algorithms that take into account human perception and room acoustics, as
described in this book, can greatly enhance the experience of immersion for listeners.
The topics covered are of widespread interest in both consumer and professional au-
dio, and have not been previously presented comprehensively in an audio processing
textbook.
Besides the basics of DSP and psychoacoustics, this book contains the latest
results in audio processing for audio synthesis and rendering, multichannel room
equalization, audio selective signal cancellation, signal processing for audio appli-
cations, surround sound synthesis and processing, and the incorporation of psychoa-
coustics in audio signal processing algorithms.
Chapter 1, “Foundations of Digital Signal Processing for Audio,” includes con-
cepts from signals and linear systems, analog–to–digital and digital–to–analog con-
version, convolution, digital filtering concepts, sampling rate alteration, and transfer
function representations (viz., z-transforms, Fourier transforms, bilinear transforms).
Chapter 2, “Filter Design Techniques for Audio Processing,” introduces the de-
sign of various filters such as FIR, IIR, parametric, and shelving filters.
Chapter 3, “Introduction to Acoustics and Auditory Perception,” introduces the
theory and physics behind sound propagation in enclosed environments, room acous-
tics, reverberation time, the decibel scale, loudspeaker and room responses, and
stimuli for measuring room responses (e.g., logarithmic chirp, maximum length se-
quences). We also briefly discuss some relevant topics in pyschoacoustics, such as
loudness perception and frequency selectivity.
In Chapter 4, “Immersive Audio Synthesis and Rendering,” we present tech-
niques that can be used to automatically generate multiple microphone signals
viii Preface
needed for a multichannel rendering without having to record using multiple real mi-
crophones for performing spatial audio playback over loudspeakers. We also present
techniques for spatial audio playback over loudspeakers. It is assumed that readers
have sufficient knowledge of head-related transfer functions (HRTFs). However, ad-
equate references are provided at the end of the book for interested readers.
Chapter 5, “Multiple Position Room Response Equalization for Real-Time Ap-
plications,” provides the necessary theory behind equalization of room acoustics for
immersive audio playback. Theoretical analysis and examples for single listener and
multiple listener equalization are provided. Traditional techniques of single position
equalization using FIR and IIR filters are introduced. Subsequently, a multiple lis-
tener equalization technique employing a pattern recognition technique is presented.
For real-time implementations, warping for designing lower filter orders is intro-
duced. The motivation for the pattern recognition approach can be seen through a
statistical analysis and visual interpretation of the clustering phenomena through the
Sammon map algorithm. The Sammon map also permits a visual display of room
response variations as well as a multiple listener equalization performance measure.
The influence of reverberation on room equalization is also discussed. Results from
a typical home theater setup are presented in the chapter.
Chapter 6, “Practical Considerations for Multichannel Equalization,” discusses
distortions due to phase effects, and presents algorithms that minimize the effect
of phase distortions. Selecting proper choices of bass management filters, crossover
frequencies, as well as all-pass coefficients and time-delay adjustments that affect
crossover region response are presented.
Chapter 7, “Robustness of Equalization to Displacement Effects: Part I,” explores
robustness analysis (viz., mismatch between listener positions during playback and
microphone position during room response measurement) in room equalization for
frequencies above the Schroeder frequency.
Chapter 8, “Robustness of Equalization to Displacement Effects: Part II,” ex-
plores robustness analysis in room equalization for low frequencies.
Chapter 9,“Selective Audio Signal Cancellation,” presents a signal processing-
based approach for audio signal cancellation at predetermined positions. This is im-
portant, for example, in automobile environments, to create a zone of silence spa-
tially.
The material in this book is primarily intended for the practicing engineer, sci-
entists, and researchers in the field. It is also suitable for a semester course at the
upper-class undergraduate and graduate level. A basic knowledge of signal process-
ing and linear system theory is assumed, although relevant topics are presented early
on in this book. References to supplemental information are given at the end of the
book.
Several individuals provided technical comments and insight on a preliminary
version of the manuscript for improving the book. Hence, we would like to acknowl-
edge and thank the following individuals, Dr. Randy Cole from Texas Instruments,
Prof. Tomlinson Holman from the University of Southern California, and Prof.
Stephan Weiss from the University of Southampton. Ana Bozicevic and Vaishali
Damle at Springer, provided incentive to the authors to produce the manuscript and
Preface ix
make the book a reality, and we are thankful for their valuable assistance during
the process. We would also like to thank Elizabeth Loew for production of this
volume. Thanks also go out to the people at Audyssey Laboratories, in particular
Philip Hilmes and Michael Solomon, for their support during the preparation of this
manuscript.
We invite you to join us on this exciting journey, where signal processing, acous-
tics, and auditory perception have merged to create a truly immersive experience.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Part I
The content presented in this chapter includes relevant topics in digital signal pro-
cessing such as the mathematical foundations of signal processing (viz., convolution,
sampling theory, etc.), basics of linear and time-invariant (LTI) systems, minimum-
phase and all-pass systems, sampling and reconstruction of signals, discrete time
Fourier transform (DTFT), discrete Fourier transform (DFT), z-transform, bilinear
transform, and linear-phase finite impulse response (FIR) filters.
where the continuous time signal xc (t) is sampled with a sampling period Ts which
is the inverse of the sampling frequency fs . Typical sampling frequencies used in
audio processing applications include 32 kHz, 44.1 kHz, 48 kHz, 64 kHz, 96 kHz,
and 192 kHz.
Some examples of discrete time signals include:
(i) Kronecker delta function shown in Fig. 1.2 and as defined by
1 n=0
x(n) = δ(n) = (1.2)
0 n = 0
(iv) The special case when α = ejω0 n and A = |A|ejφ the resulting magnitude
and phase of the complex exponential sequence shown in Fig. 1.5 (in the equivalent
continuous form) and given by
Fig. 1.5. Real and imaginary part of a complex exponential sequence with φ = 0.25π and
ω0 = 0.1π.
Operations performed by a DSP system are based on the premise that the system
satisfies the properties of linearity and time-invariance. Specifically, from the theory
of linear systems, if T {.} denotes the transformation performed by a linear system
(i.e., y(n) = T {x(n)}), then the input and output of a linear system satisfy the
following properties of additivity and homogeneity, respectively,
With a Kronecker delta function, δ(n), applied as an input to a linear system, the
output of the linear system is defined as an impulse response h(n) that completely
characterizes the linear system as shown in Fig. 1.6. An important class of DSP sys-
tems is those that exhibit linear and time-invariant (LTI) properties and are extremely
important for designing immersive audio signal processing systems. Thus, if the in-
put x(n) is represented as a series of delayed impulses as
1.1 Basics of Digital Signal Processing 7
∞
x(n) = x(k)δ(n − k) (1.7)
k=−∞
then the output y(n) can be expressed as the well-known convolution formula where
h(n) is the impulse response,
∞
y(n) = T x(k)δ(n − k)
k=−∞
∞
= x(k)T {δ(n − k)}
k=−∞
∞
= x(k)h(n − k)
k=−∞
∞
= h(k)x(n − k)
k=−∞
= x(n) ⊗ h(n) (1.8)
1
M2
h(n) = δ(n − k) (1.10)
M 1 + M2 + 1
k=M1
The input and output signals from any LTI system can be found through various
methods. For simple impulse responses (as given in the above examples), substitut-
ing δ(n) with x(n) and h(n) with y(n) provides the input and output signal descrip-
tion of the LTI system. For example, the input and output signal description for the
ARMA system can be written as
N2
M2
ai y(n − i) = bi x(n − k) (1.13)
i=N1 k=M1
A second method for finding the output from an LTI system involves the con-
volution formula (1.8) (along with any graphical plotting). For example, if x(n) =
α1n u(n) and h(n) = α2n u(n) are two right-sided sequences (where |α1 | < 1 and
|α2 | < 1 and u(n) is the step function), then with (1.8),
N2
αN1 − αN2 +1
αk = (1.15)
1−α
k=N1
Fig. 1.7. The convolution of two exponentially decaying right-sided sequences (viz., (1.14))
with α1 = 0.3 and α2 = 0.6.
∞
=e jωn
h(k)e−jωk
k=−∞
where H(ejω ) = HR (ejω ) + jHI (ejω ) = |H(ejω )|ej∠H(e ) is the discrete time
jω
Fourier transform of the system and characterizes the LTI system along a real fre-
quency axis ω expressed in radians. The real frequency variable ω is related to the
analog frequency Ω = 2πf (f is in Hz), as shown in a subsequent section on sam-
pling, through the relation ω = ΩTs where the sampling frequency fs and the sam-
pling period Ts are related by Ts = 1/fs .
An important property of the discrete time Fourier transform or frequency re-
sponse, H(ejω ), is that it is periodic with a period of 2π, as
∞
∞
H(ejω ) = h(k)e−jωk = h(k)e−j(ω+2π)k = H(ej(ω+2π) ).
k=−∞ k=−∞
A periodic discrete time frequency response of a low-pass filter, with a cutoff fre-
quency ωc , is shown in Fig. 1.8.
Thus, the discrete time forward and inverse Fourier transforms can be expressed
as
h(n) ←→ H(ejω ) (1.17)
∞
H(e ) =
jω
h(k)e−jωk (1.18)
k=−∞
1
h(n) = H(ejω )ejωn dω (1.19)
2π 2π
10 1 Foundations of Digital Signal Processing for Audio and Acoustics
The properties and theorems on Fourier transforms can be found in several texts
(e.g., [2, 3]).
and is stable (i.e., |dk | < 1∀k) then the magnitude response of the transfer function
in decibel scale (dB scale) can be expressed as
|H(ejω )| = H(ejω )H ∗ (ejω )
M M
b0
k=1 (1 − ck e
−jω ) ∗ jω
k=1 (1 − ck e )
= N (1.21)
−jω ) N ∗ jω
k=1 (1 − dk e k=1 (1 − dk e )
a0
b0
M N
−jω −jω
|H(e )|(dB) = 20 log10
jω
+ log10 |1 − ck e |− log10 |1 − dk e |
a0
k=1 k=1
∂
grd[H(ejω )] = − ∠H(ejω ) (1.23)
∂ω
M
∂
N
∂
=− arg(1 − ck e−jω ) + arg(1 − dk e−jω )
∂ω ∂ω
k=1 k=1
1.2 Fourier Transforms 11
The numerator roots ck and the denominator roots dk of the transfer function H(ejω )
(1.20) are called the zeros and poles of the transfer function, respectively.
Because H(ejω ) = H(ej(ω+2π) ), the phase of each of the terms in the phase
response (1.22) is ambiguous. A correct phase response can be obtained by taking
the principal value ARG(H(ejω )) (which lies between −π and π), computed by
any computer subroutine (e.g., the angle command in MATLAB) or the arctangent
function from a calculator, and adding 2πr(ω) [4]. Thus,
The unwrap command in MATLAB computes the angle in terms of the principal
value. For example, a fourth-order (N = 4) low-pass Butterworth transfer function
used for audio bass management, with a cutoff frequency ωc , can be expressed as
−1 N
b0,k + b1,k e−jω + b2,k e−j2ω
2
H(e ) =
jω
(1.25)
a0,k + a1,k e−jω + a2,k e−j2ω
k=0
b0,k = b2,k = K 2
b1,k = 2K 2
a0,k = 1 + 2K cos(π(2k + 1)/2N ) + K 2
a2,k = 1 − 2K cos(π(2k + 1)/2N ) + K 2
a1,k = 2(K 2 − 1)
where K = tan(ωc /2) = tan(πfc /fs ). The magnitude response, principal phase,
and unwrapped phase, for the fourth-order Butterworth low-pass filter with cutoff
frequency fc = 80 Hz and fs = 48 kHz, as shown in Fig. 1.9, reveal the 2π phase
rotation in principal value (viz., Fig. 1.9(b)).
Minimum-Phase Systems
From system theory, if H(ejω ) is assumed to correspond to a causal1 and stable sys-
tem, then the magnitude of all its poles is less than unity [2]. For certain classes of
problems it is important to constrain the inverse of the transfer function, H(ejω ),
to be causal and stable. Hence if H(ejω ) is causal and stable, then if the inverse,
1/H(ejω ), too is constrained to be causal and stable it must be true that the magni-
tude of all of the zeros (i.e., the roots of the numerator polynomial) of H(ejω ) must
be less than unity. Classes of systems that satisfy this property (where the transfer
function, as well as its inverse, is causal and stable) are called minimum-phase sys-
tems [2] and the transfer function is usually denoted by Hmin (ejω ).
1
A causal system is one for which the output signal depends on the present value and/or the
past values of the input signal or y(n) = f [x(n), x(n − 1), . . . , x(n − p)], where p ≥ 0.
12 1 Foundations of Digital Signal Processing for Audio and Acoustics
Fig. 1.9. (a) Magnitude response of the fourth-order Butterworth low-pass filter; (b) principal
value of the phase; (c) unwrapped phase.
All-Pass Systems
An all-pass system or transfer function is one whose magnitude response is flat for
all frequencies. A first-order all-pass transfer function can be expressed as
e−jω − λ∗ ∗ jω
−jω 1 − λ e
Hap (ejω ) = = e (1.26)
1 − λe−jω 1 − λe−jω
where the roots of the numerator and denominator (viz., 1/λ∗ and λ, respectively)
are conjugate reciprocals of each other. Thus,
|Hap (ejω )| = Hap (ejω )Hap ∗ (ejω )
1 − λ∗ ejω jω 1 − λe−jω
= e−jω e =1 (1.27)
1 − λe−jω 1 − λ∗ ejω
A generalized all-pass transfer function, providing a real time-domain response,
can be expressed as [2],
N
Ncomplex
real
e−jω − di e−jω − gk∗ e−jω − gk
Hap (e ) =
jω
(1.28)
i=1
1 − di e−jω 1 − gk e−jω 1 − gk∗ e−jω
k=1
Fig. 1.10. (a) Magnitude response of a second order real response all-pass filter; (b) principal
value of the phase; (c) unwrapped phase.
The magnitude and phase response for a second-order all-pass transfer function
with complex poles (r = 0.2865 and θ = 0.1625π) is shown in Fig. 1.10.
As shown in a subsequent chapter, using an all-pass filter in cascade with other
filters allows the overall phase response of a system to approximate a desired phase
response which is useful to correct for phase interactions between the subwoofer and
satellite speaker responses in a multichannel sound playback system.
Specifically, (1.30) specifies that any transfer function having poles and/or zeros,
some of whose magnitudes are greater than unity, can be decomposed as a product
of two transfer functions. The minimum-phase transfer function Hmin (ejω ) includes
poles and zeros whose magnitude is less than unity, whereas the all-pass transfer
function Hap (ejω ) includes poles and zeros that are conjugate reciprocal of each
other (i.e., if λ is a zero then 1/λ∗ is a pole of the all-pass transfer function).
As shown in the next chapter, traditional filter techniques do not consider the phase
response during the design of filters. This can cause degradation to the shape and
quality of the signal that is being filtered by such a filter, especially in the relevant
frequency regions, due to phase distortion induced by a nonlinear phase response of
the filter. Thus, in many cases, it is desirable that the phase response of the filter be
kept a linear function of frequency ω (or the group delay be kept constant).
A classic example of a linear phase signal is the delay function, h(n) = δ(n−k),
which delays the input signal x(n) by k samples. Specifically,
∠H(ejω ) = −kω
grd[H(ejω )] = k
More details on designing linear phase filters are given in the next chapter.
whereas the inverse z-transform can be expressed in terms of the contour integral
1.3 The z-Transform 15
1
x(n) = X(z)z n−1 dz (1.34)
2jπ C
A principle motivation for using this transform is that the Fourier transform does
not converge for all discrete time signals, or sequences, and a generalization via the
z-transform encompasses a broader class of signals. Furthermore, powerful complex
variable techniques can be used to analyze signals when using the z-transform.
Being a complex variable, the poles and zeros of the resulting system function
can be depicted on a two-dimensional complex z-plane of Fig. 1.11. Because the
z-transform is related to the Fourier transform through the transformation z = ejω ,
then the Fourier transform exists on the unit circle depicted in Fig. 1.11.
The region of convergence (ROC) is defined to be the set of values on the com-
plex z-plane where the z-transform converges.
Some examples of using z-transforms for determining system functions corre-
sponding to discrete time signals are given
∞ below.
(i) x(n) = δ(n − nd ) =⇒ X(z) = n=−∞ δ(n − nd )z −n = z −nd , where the
ROC is z-plane. ∞ ∞
(ii) x(n) = an u(n) =⇒ X(z) = n=0 an z −n = n=0 (az −1 )n = 1/(1 −
az −1 ), where the ROC is the region |z| > |a| exterior to the dotted circle in Fig. 1.12
(where |a| = 0.65 and the ROC includes the unit circle).
(1/3)n n≥0
(iii) x(n) =
2n n<0
−1
∞
X(z) = (2z −1 )n + ((1/3)z −1 )n
n=−∞ n=0
Fig. 1.12. The ROC in the complex z-plane for the sequence x(n) = an u(n) with a = 0.65.
∞
∞
= (2z −1 )−n + ((1/3)z −1 )n
n=1 n=0
∞
∞
= ((1/2)z) − 1n
+ ((1/3)z −1 )n
n=0 n=0
1 −1
2z 1
= +
1 − 12 z −1 1 − 13 z −1
where the ROC is the intersection of the region in the complex z-plane defined by
|z| < 2 and |z| > 1/3 or 2 > |z| > 1/3.
Again, several properties of the z-transform, the determination of a time domain
signal from the z-transform using various techniques (e.g., residue theorem), and
theory behind the z-transform can be found in several texts including [2] and [4].
Fig. 1.14. (a) Fourier transform of a bandlimited signal x(n) with limiting frequency Ωc ; (b)
periodicity of the Fourier transform of the signal x(n) upon ideal sampling.
simply low-pass filtering the baseband spectrum of Xs (jΩ) with a cutoff frequency
Ωc and inverse Fourier transforming the result. For recovering the signal x(t), as
can be seen from Fig. 1.14(b), it is required that Ωs − Ωc > Ωc or Ωs > 2Ωc to
prevent an aliased signal recovery. This condition is called the Nyquist condition, Ωc
is called the Nyquist frequency, and Ωs is called the Nyquist rate.
Subsequently, the frequency response of the discrete time signal from the sam-
pled signal can be obtained from (1.36) by using the continuous time Fourier trans-
form relation,2
∞
Xs (jΩ) = x(kTs )e−jkTs Ω (1.38)
k=−∞
x(n) = x(t)|t=nTs = x(nTs )
∞ ∞
X(ejω ) = x(n)e−jωn = x(nTs )e−jωk (1.39)
n=−∞ k=−∞
is mathematically described by
∞
∞
xr (t) = x(nTs )hr (t − nTs ) = x(n)hr (t − nTs ) (1.41)
n=−∞ n=−∞
because the sinc function is unity at time index zero and is zero at other discrete
time indices as shown in Fig. 1.15 for Ts = 1. At other noninteger time values, the
sinc filter acts as an interpolator by performing interpolation between the impulses
of xs (t) to form the continuous time signal xr (t).
As shown in the previous section, a discrete time sequence can be obtained by sam-
pling a continuous time signal x(t) with a sampling frequency fs = 1/Ts , and the
subsequent sequence can be expressed as x(n) = x(t)|t=nTs = x(nTs ). In many
situations, it is necessary to reduce the sampling rate or frequency by an integer
3
The function sin(πx)/(πx) is referred to as the sinc function.
20 1 Foundations of Digital Signal Processing for Audio and Acoustics
Fig. 1.16. (a) x(t) having response X(jΩ) begin bandlimited such that −π/D < ΩTs <
π/D; (b) X(ejω ); (c) Xd (ejω ) with D = 2.
amount.4 Thus, in order to reduce the sampling rate by an amount D, the discrete
time sequence is obtained by using a period Ts such that Ts = DTs ⇒ fs = fs /D,
or xd (n) = x(nTs ) = x(nDTs ). The signal xd (n) is called a downsampled or deci-
mated signal, which is obtained from x(n) by reducing the sampling rate by a factor
of D.
In order for xd (n) to be free of aliasing error, the continuous time signal x(t)
shown in Fig. 1.16(a), from which x(n) is obtained, must be bandlimited a priori
such that −π/D < ΩTs < π/D or the original sampling rate should be at least D
times the Nyquist rate.
The Fourier expression for the decimated signal xd (n) is
∞
1 ω 2πk
Xd (e ) =
jω
X j −j (1.43)
DTs DTs DTs
k=−∞
4
In audio applications, there are several sampling frequencies in use, including 32 kHz, 44.1
kHz, 64 kHz, 48 kHz, 96 kHz, 128 kHz, and 192 kHz, and in many instances it is required
that the sampling rate be reduced by an integer amount, such as from 96 kHz to 48 kHz.
1.4 Sampling and Reconstruction 21
In this case, the sampling rate increase is reflected by altering the sampling period
such that Ts = Ts /L ⇒ fs = Lfs , or xi (n) = x(nTs ) = x(nTs /L) = x(n/L),
n = 0, ±L, ±2L, . . . . The signal xi (n) is called an interpolated signal, and is ob-
tained from x(n) by increasing the sampling rate by a factor of L. To obtain the
interpolated signal, x(n), the first step involves an expander stage [5, 6] that gener-
ates a signal xe (n) such that
x(n/L) n = 0, ±L, ±2L, . . .
xe (n) =
0 otherwise
∞
xe (n) = x(k)δ(n − kL) (1.44)
k=−∞
= X(ejωL ) (1.45)
Fig. 1.18 shows the spectrum of a bandlimited continuous time signal x(t) along
with the expanded signal spectrum and the interpolated spectrum. As is evident from
Fig. 1.18(c), the expander introduces L − 1 copies of the continuous time spectrum
between −π and π. Subsequently, an ideal low-pass interpolation filter, having a
Fig. 1.18. (a) x(t) having response X(jΩ); (b) X(ejω ); (c) extraction of baseband expanded
and interpolated spectrum of X(ejω ) with L = 2.
cutoff frequency of π/L and a gain of L (shown by dotted lines in Fig. 1.18(c)),
extracts the baseband discrete time interpolated spectrum of Xe (jΩ).
A block diagram employing an L-fold expander (depicted by an upwards arrow)
and the interpolation filter Hi (ejω ) is shown in Fig. 1.19.
The relation between the DFT and the discrete time Fourier transform (DTFT) is
Equation (1.49) basically states that the DFT is obtained by uniformly, or equally,
sampling the DTFT (i.e., uniform sampling along the unit circle in the complex z-
plane).
An important property for the DFT is the circular shift of an aperiodic signal,
where any delay to a signal constitutes a circular shift in the signal. The appropriate
relation between the DFT and the m-sample circularly shifted sequence is
N −1
x3 (n) = x1 (n) x2 (n) = x1 (m)x2 ((n − m)N ) n = 0, . . . , N − 1
N k=0
x3 (n) ↔ X3 (k) = X1 (k)X2 (k) (1.51)
2 (1 − z −1 )
s=
Td (1 + z −1 )
1.7 Summary 25
1 + (Td /2)s
z=
1 − (Td /2)s
2 (1 − z −1 )
H(z) = Hc (1.52)
Td (1 + z −1 )
1.7 Summary
In this chapter we have presented the fundamental prerequisites in digital signal pro-
cessing such as convolution, sampling theory, basics of linear and time-invariant
(LTI) systems, minimum-phase and all-pass systems, sampling and reconstruction of
signals, discrete time Fourier transform (DTFT), discrete Fourier transform (DFT),
z-transform, bilinear transform, and linear-phase finite impulse response (FIR) filters.
2
Filter Design for Audio Applications
In this chapter we present a summary of various approaches for finite impulse re-
sponse (FIR) and infinite duration impulse response (IIR) filter designs.
The desired response can be specified in the frequency domain (viz., Hd (ejω )) or in
the time domain (e.g., hd (n) = δ(n − nd )). For example, a low-pass filter specifica-
tion is
1 ω ∈ [0, ωc ]
Hd (e ) =
jω
(2.1)
0 ω ∈ [ωs , π]
The domains [0, ωc ], (ωc , ωs ), [ωs , π] are called the pass-band, transition-band,
and stop-band, respectively, and are specified by their tolerance parameters. Exam-
ples of tolerance parameters include allowable ripples δp and δs which describe the
pass-band amplitude variance Ap and stop-band attenuation As .
Alternatively, if the signal waveform needs to be preserved, then the phase re-
sponse of the desired response is specified with linearity constraint,
The specifications of (2.1) and (2.2) of the low-pass filter, for example, can also be
written in terms of a frequency weighting approximation such as
where the accuracy of the amplitude of the selected filter, H(ejω ), in the pass-band
domain, Xp and stop-band domain, Xs , is controlled by the frequency weighting
function, W (ejω ).
Thus, according to (2.4) the frequency weighted approximating error function,
E(ejω ), can be written as E(ejω ) = W (ejω )(|H(ejω )|−Hd (ejω )), with Hd (ejω ) =
0 on the stop-band domain Xs .
Other widely used error criteria are:
• Minimax error, where
, the maximum error in a particular frequency band, is
minimized. Specifically,
= maxω∈X |E(ejω )|.
• Minimization of the Lp norm, where the minimization of the quantity, Jp ·
ω∈X
E (e
p jω
)dω, is done for p > 0. When p → ∞, the solution that minimizes the
integration approaches the minimax solution. The classic case is the L2 norm where
p = 2.
• Maximally flat approximation, which is obtained by means of a Taylor series
expansion of the desired response at a particular frequency point.
• Combination of any of the above approximating schemes.
There are many advantages of using FIR filters (over their IIR counterparts) which
include [7] linear-phase constraint design, computationally efficient realizations, sta-
ble designs free of limit cycle oscillations when implemented on finite-wordlength
systems, arbitrary specification-based designs, low output noise due to multiplica-
tion roundoff errors, and low sensitivity to variations in the filter coefficients. The
disadvantages include a larger length filter for extremely narrow or stringent tran-
sition bands thereby increasing the computational requirements but which can be
minimized through fast convolution algorithms and multiplier-efficient realizations.
2.2 FIR Filter Design 29
There are four types of causal linear phase responses, of finite duration or finite
impulse response (FIR).
1. Type 1 linear phase filter of length M + 1 (M even, constant group delay
M/2, β = {0, π}) having finite duration response h(n) = h(M − n), and frequency
M/2
response H(ejω ) = e(−jωM/2) k=0 ak cos(kω) with a0 = h(M/2) and ak =
2h((M/2) − k), 1 ≤ k ≤ M/2. Type 1 filters are used to design low-pass, high-
pass, and band-pass filters.
2. Type 2 linear phase filter of length M +1 (M odd, a delay M/2 corresponding
to an integer plus one-half, β = {0, π}) having finite duration response h(n) =
(M +1)/2
h(M − n), and frequency response H(ejω ) = e(−jωM/2) k=1 bk
cos(ω(k − (1/2))) with bk = 2h((M + 1)/2 − k), 1 ≤ k ≤ (M + 1)/2. Type 2
filters have a zero at z = −1 (i.e., ω = π) and hence cannot be used for designing
high-pass filters.
3. Type 3 linear phase filter of length M + 1 (M even, a delay M/2, β =
{π/2, 3π/2}) having finite duration response h(n) = −h(M − n), and frequency
M/2
response H(ejω ) = je(−jωM/2) k=1 ck sin(kω) with ck = 2h((M/2) − k),
1 ≤ k ≤ M/2. Type 3 filter has a zero at z = 1 and z = −1 and hence cannot
be used for designing a low-pass or a high-pass filter.
4. Type 4 linear phase filter of length M + 1 (M odd, M/2 being an integer plus
one-half, β = {π/2, 3π/2}) having finite duration response h(n) = −h(M − n),
(M +1)/2
and frequency response H(ejω ) = je(−jωM/2) k=1 dk sin(ω(k − (1/2)) with
dk = 2h((M + 1)/2 − k), 1 ≤ k ≤ (M + 1)/2. Type 4 filter has a zero at z = 1
and hence cannot be used in the design of a low-pass filter.
For simplicity in notation in subsequent sections, the general linear-phase filter
frequency response can be described in the following functional form,
H(ejω ) = tn ψ(ω, n) (2.5)
n
where the trigonometric function ψ(·, ·) is a symbolic description for the sin or cos
term in the four types of linear phase filters described above.
The design of linear-phase FIR filters, depending on the zero locations of these
filters, is shown in [2]. As in the case of the decomposition of a general transfer func-
tion into minimum-phase and all-pass components, any linear-phase system function
can also be decomposed into a product of three terms comprising: (i) a minimum-
phase function, (ii) a maximum-phase function (where all of the poles and zeros have
magnitude strictly greater than unity), and (iii) a function comprising zeros having
strictly unit magnitude.
K
J2 = [W (ejωk )(|H(ejωk )| − Hd (ejωk ))]2 (2.7)
k=1
J2 = eT e (2.9)
where
e = Xt − d (2.10)
t = (t0 , t1 , . . . , tM )T
d = (W (ω1 )ψ(ω1 , 0)Hd (ejω1 ), . . . , W (ωK )ψ(ωK , 0)Hd (ejωK ))T (2.12)
In many instances the FIR (or IIR) filters designed can be of very large duration
which will increase the computational requirements for implementing such filters,
which are typically not available in real-time DSP devices. Then one approach is
to limit the duration of the filter without significantly affecting the resulting perfor-
mance of the filter. There are several windowing filters that limit the signal duration
and achieve a tradeoff between the main-lobe width and side-lobe amplitudes.
2.2 FIR Filter Design 31
Fig. 2.1. (a) Impulse response of a rectangular window function hr (n) with N = 10; (b)
magnitude response of the rectangular window.
A direct truncation of a signal x(n) with a rectangular window filter hr (n) gives
a shortened duration signal xr (n),
xr (n) = hr (n)x(n)
1 n ∈ {−N, N }
hr (n) = (2.14)
0 |n| > N
The time domain and magnitude response for this window are shown in Fig. 2.3.
The Hamming window is given by hHm = 0.54(1+0.8519 cos(2πn/(2N +1)))
and its frequency response, HHm (ejω ), is expressed in relation to the frequency
32 2 Filter Design for Audio Applications
Fig. 2.2. (a) Impulse response of a Bartlett window function with N = 10; (b) magnitude
response of the Bartlett window.
Fig. 2.3. (a) Impulse response of a Hann window function with N = 10; (b) magnitude
response of the Hann window.
2.2 FIR Filter Design 33
Fig. 2.4. (a) Impulse response of a Hamming window function with N = 10; (b) magnitude
response of the Hamming window.
The time domain and magnitude response for this window are shown in Fig. 2.5.
The Kaiser window is given by hK = I0 [β(1 − [(n − α)/α]2 )0.5 ]/I0 (β), 0 ≤
n ≤ N where I0 represents the zeroth-order modified Bessel function of the first
kind. The main lobe width and side-lobe levels can be adjusted by varying the length
(N + 1) and β. The parameter, β, performs a tapering operation with high values
of β achieving a sharp taper. In the extreme case, where β = 0, the Kaiser window
becomes the rectangular window hR (n). Figure 2.6(a) shows the Kaiser window
response for various values of β, whereas Figure 2.6(b) shows the corresponding
magnitude response which shows a lower side-lobe level but increasing main-lobe
width as β increases in value. Fig. 2.7 shows the magnitude responses of the Kaiser
window filter, as a function of the filter or window length, N , with β = 6. As is
34 2 Filter Design for Audio Applications
Fig. 2.5. (a) Impulse response of a Blackman window function with N = 10; (b) magnitude
response of the Blackman window.
Fig. 2.6. (a) The effect of β on the shape of the Kaiser window; (b) the magnitude response of
the Kaiser window for the various values of β.
2.2 FIR Filter Design 35
Fig. 2.7. The effect of N on the magnitude response of the Kaiser window with β = 6.
evident as the filter length increases, advantageously the main-lobe width as well as
the side-lobe amplitude decreases. Kaiser determined empirically that to achieve a
specified stop-band ripple amplitude A = −20 log10 δs , the value of β can be set as
⎧
⎨ 0.1102(A − 8.7) A > 50
β = 0.5842(A − 21)0.4 + 0.07886(A − 21) 21 ≤ A ≤ 50 (2.15)
⎩
0 A < 21
Adaptive filtering is found widely in applications involving radar, sonar, acoustic sig-
nal processing (noise and echo cancellation, source localization, etc.), speech com-
pression, and others. The advantage of using adaptive filters is their ability to track
information from a nonstationary environment in real-time by optimization of its in-
ternal parameters. Haykin [8] provides a number of references to various types of
adaptive filters and their applications. Of them, the most popular is the least mean
square (LMS) algorithm of Widrow and Hoff [13] which is explained in this section.
An adaptive FIR filter (also called the transversal or tapped delay line filter) struc-
ture, shown in Fig. 2.8, differs from a fixed FIR filter in that the filter coefficients
Wk = (w0 (k), w1 (k), . . . , wN −1 (k))T are varied with time as a function of the
36 2 Filter Design for Audio Applications
filter inputs X(n) = [x(n), x(n − 1), . . . , x(n − N + 1)] and an approximation error
signal e(n) = d(n) − y(n). The filter coefficients are adapted such that the mean
m−1
square error Jm (n) = E{e(n)2 } ≈ 1/m k=0 e2 (n − k) (the operator E{·} is the
statistical expectation operator) is minimized. For the well-known LMS method of
[13], the instantaneous error, J1 (n), where m = 1, is minimized.
The LMS filter adaptation equations, for a complex input signal vector X(n) =
[x(n), x(n − 1), . . . , x(n − N + 1)], are expressed as
W(n) = W(n − 1) + µe(n − 1)X∗ (n − 1) (2.16)
where µ is the adaptation rate that controls the rate of convergence to the solution,
and the superscript ∗ denotes complex conjugation. Details on the convergence and
steady-state performance of adaptive filters (e.g., based on LMS, the recursive least
squares or RLS error criteria) can be found in various texts and articles including [8,
126]. Other variations include the filtered-X, frequency domain, and block adaptive
filters.
b0 + b1 z −1 + b2 z −2 + · · · + bM −1 z −(M −1)
H(z) =
a0 + a1 z −1 + a2 z −2 + · · · + aM −1 z −(M −1)
N −1 −1
1 M
h(n) = − ak h(n − k) + bk x(n − k) (2.17)
a0
k=1 k=0
Generally IIR filters can approximate specific frequency responses with shorter or-
ders than an FIR (especially where sharp and narrow transition bands are required),
2.3 IIR Filter Design 37
but have associated problems, including (i) converging to a stable design, (ii) higher
computational complexity requiring nonlinear optimization to converge to a valid
solution, and (iii) numerical problems in computing an equivalent polynomial, that
defines the transfer function, if multiple-order poles are densely packed near the unit
circle. Furthermore, IIR filters cannot be designed to have linear phase, like FIR fil-
ters, but by cascading an all-pass filter the phase can be approximately linearized in
a particular band of the frequency response.
Fig. 2.9. Magnitude response of typically used bass management filters in consumer electron-
ics, and the recombined response (the sum of the low-pass and high-pass frequency responses).
Type-1 Chebyshev IIR filters exhibit equiripple error in the pass-band and monoton-
ically decreasing response in the stop-band. A low-pass N th-order Chebyshev IIR
filter is specified by the squared magnitude response, Hcheby (ejω ), where the mag-
nitude response oscillates between 1 and 1/(1 +
2 ) in the pass-band, where it will
have a total of N local maxima and local minima.
1
|Hcheby,1 (ejω )|2 = (2.20)
(1 +
2 TN2 (ω/ωP ))
where TN (x) is the N th-order Chebyshev polynomial. For nonnegative integers k,
the kth-order Chebyshev polynomial is expressed as
cos(k cos−1 x) |x| ≤ 1
Tk (x) = (2.21)
cosh(k cosh−1 x) |x| ≥ 1
A Type-1 low-pass Chebyshev filter, with pass-band frequency of 1 kHz, stop-band
frequency of 2 kHz, and attenuation of 60 dB, is shown in Fig. 2.10.
Type-2 Chebyshev IIR filters exhibit equiripple error in the stop-band and the
response decays monotonically in the pass-band. The squared magnitude response is
expressed by,
1
|Hcheby,2 (ejω )|2 = (2.22)
(1 +
2 [T 2
N (ωs /ωP )]/TN (ωs /ω)] )
Fig. 2.10. (a) Magnitude response, between 20 Hz and 1500 Hz, of Type-1 Chebyshev low-
pass filter of order 7 having pass-band frequency of 1000 Hz, stop-band frequency of 2000
Hz, and attenuation of 60 dB; (b) magnitude response of the filter in (a), between 1000 Hz and
20,000 Hz.
Fig. 2.11. (a) Magnitude response, between 20 Hz and 1500 Hz, of Type-2 Chebyshev low-
pass filter of order 7 having pass-band frequency of 1000 Hz, stop-band frequency of 2000
Hz, and attenuation of 60 dB; (b) magnitude response of the filter in (a), between 1000 Hz and
20,000 Hz.
40 2 Filter Design for Audio Applications
Fig. 2.12. Magnitude response of an N = 5-order elliptic filter having pass-band frequency
of 200 Hz and stop-band frequency of 300 Hz, with stop-band attenuation of 60 dB.
Elliptic filters exhibit equiripple pass-band magnitude response and the stop-band.
For a specific filter order N , pass-band ripple
, and maximum stop-band amplitude
1/A, the elliptic filter provides the fastest transition from pass-band to stop-band. In
fact, this feature is advantageous as a low-pass filter response, with a rapid decay
time, and can be designed for low-order transversal or direct form two implementa-
tions. The magnitude response of an N th-order low-pass elliptic filter can be written
as
1
|H(ejω )|2 =
1 +
2 FN2 (ω/ωP )
⎧ 2 2
⎨ γ 2 (ω1 −ω )(ω32 −ω 2 )...(ω2N
2 2
−1 −ω )
(1−ω 2 ω 2 )(1−ω 2 ω 2 )...(1−ω 2 ω2 )
N, even
FN (ω) = 1 3 2N −1
(2.23)
⎩ γ 2 ω(ω222−ω2 2 )(ω42 −ω 2 2
)...(ω2N −ω 2 )
N, odd
(1−ω2 ω )(1−ω42 ω 2 )...(1−ω2N 2 ω2 )
Commonly used IIR filters in audio applications include the second-order parametric
filter for designing filters with specific gain and bandwidth (or Q value) and the
2.3 IIR Filter Design 41
b0 + b1 z −1 + b2 z −2
H(z) = (2.24)
a0 + a1 z −1 + a2 z −2
and the coefficients ai and bj for the various filters are given below.
Figures 2.13 and 2.14 show examples of the magnitude response for a low-
frequency boost and cut shelving filters, respectively, for a 48 kHz sampling rate
with fc = 200 Hz or Ωc = 400π and G = 10 dB.
42 2 Filter Design for Audio Applications
Parametric Filters
Parametric filters are specified in terms of the gain, G (g = 10G/20 ), center frequency
fc , and the Q value which is inversely related to the bandwidth of the filter. The
equations characterizing the second-order parametric filter for a sampling frequency
of fs are
ωc = 2πfc /fs
β = (2ωc /Q) + ωc2 + 4
b0 = [(2gωc /Q) + ωc2 + 4]/β
b1 = (2ωc2 − 8)/β
44 2 Filter Design for Audio Applications
IIR filters based on the autoregressive (AR) or autoregressive and moving average
(ARMA) process are determined based on the second-order statistics of the input
data. These filters are widely used for spectral modeling and the denominator poly-
nomial for the AR process (or the numerator and denominator polynomials for the
ARMA process) are generated through an optimization process that minimizes an
error norm.
One of the popular AR processes, yielding an all-pole IIR filter, is the linear
predictive coding (LPC) filter or model. The LPC model is widely used in speech
recognition applications [10]: (i) it provides an excellent all-pole vocal tract spectral
envelope model for a speech signal; (ii) the filter is minimum-phase, is analytically
tractable, and straightforward to implement in software or hardware; and (iii) the
model works well in speech recognition applications.
The LPC or all-pole filter of order P is characterized by the polynomial coeffi-
cients {ak , k = 1, . . . , P } with a0 = 1. Specifically, a signal x(n) at time index n
can be modeled with an all-pole filter of the form
2.3 IIR Filter Design 45
Fig. 2.16. Modeling performance of the LPC for differing filter orders.
1
H(z) = P (2.30)
1+ k=1 ak z −k
P
e(n) = x(n) + ak x(n − k) (2.32)
k=1
N −1
To determine {ak , k = 1, . . . , P }, the error signal power, E = n=0 |e(n)|2 , is
minimized with respect to the filter coefficients with N being the duration of x(n).
Thus, the filter coefficients are to be determined by setting the gradient of the error
function E to be zero. Thus,
∂E
= 0; ∀k (2.33)
∂ak
46 2 Filter Design for Audio Applications
P
al rx (k, l) = −rx (k, 0); k = 1, 2, . . . , P (2.34)
l=1
where rx (k, l) denotes the correlation of the signal x(n) for various lags. Specifi-
cally,
N −1
rx (k, l) = x(n − l)x∗ (n − k) (2.35)
n=0
Rx a = −rx (2.36)
The above system of equations can be solved through the autocorrelation method or
the covariance method [10]. The autocorrelation method is popular as the autocorre-
lation matrix, comprising the autocorrelations rx (k, l), at various lags, is a Toeplitz
matrix (i.e., a symmetric matrix with equal diagonal elements). Such a system can
be solved through well-established processes such as the Durbin algorithm.
An example of the modeling performance on using the LPC approach is shown
in Fig. 2.16 where the solid line depicts the response to be modeled by the LPC. The
order of the LPC is p = 128 and p = 256, and it can be seen that the model shows
a very good approximation at higher frequencies. Unfortunately, for such low filter
orders, necessary for real-time implementations, the low-frequency performance is
not satisfactory. In Chapter 6, we present a technique widely used for improving the
low-frequency modeling performance of the LPC algorithm.
2.4 Summary
In this chapter we have presented various filter design techniques including FIR, IIR,
parametric, shelving, and all-pole filters using second-order statistical information
for signal modeling.
Part II
This chapter introduces the theory behind sound propagation in enclosed environ-
ments, room acoustics, reverberation time, and the decibel scale. Also included are
basics of loudspeakers and microphone acoustics and responses, room impulse re-
sponses, and stimuli for measuring loudspeaker and room responses. We conclude
the chapter with a brief discussion on the structure of the ear, and some relevant
concepts such as loudness perception and frequency selectivity.
The spectrum of interest for a signal from a sound source, that is affected by
room acoustics, is frequency dependent and is a function of the type of source. For
example, human speech comprises a fundamental frequency located between 50 Hz
and 350 Hz, and is identical to the frequency of vibrations of the vocal chords, in
addition to harmonics that extend up to about 3500 Hz. Musical instruments range
from 16 Hz to about 15 kHz. Above 10 kHz, the attenuation of the signal in air is
so large that the influence of a room on high-frequency sound components can be
neglected [11], whereas below 50 Hz the wavelength of sound is so large that sound
propagation analysis using geometrical considerations is almost of no use. Thus, the
frequency range of relevance to room acoustics extends from 50 Hz to 10 kHz.
Finally, the directionality (or the intensity of sound as a function of direction)
of a sound source will vary with the type of source. For example, for speech, the
50 3 Introduction to Acoustics and Auditory Perception
function. The time-dependent pressure function p(r|r0 , t) can be found through the
Fourier inverse of (3.2).
Finally, the sound pressure
level at a distance
r, with p̃ representing the root mean
square pressure (viz., E{p2 } = [(1/t) t p2 dτ ]1/2 where E{.} is the statistical
expectation operator), can be expressed as
p̃
SP L = 20 log10 dB (3.7)
p̃ref
where p̃ref is an internationally standardized reference root mean square pressure
with a value of 2 × 10−5 N/m2 .
n
Kn (k 2 − kn2 )
x −1 N
N y −1 Nz −1
pn (q l )pn (q o )
= jQωρ0
nx =0 ny =0 nz =0
Kn (k 2 − kn2 )
n = (nx , ny , nz ); k = ω/c; q l = (xl , yl , zl )
2 2 1/2
2
nx ny nz
kn = π + + pn (q l )pm (q l )dV
Lx Ly Lz V
Kn n = m
= (3.8)
0 n = m
52 3 Introduction to Acoustics and Auditory Perception
where kn are referred to as the eigenvalues, and where the eigenfunctions pn (q l ) can
be assumed to be orthogonal to each other under certain conditions, and the point
source being at q o . The modal equations in (3.8) are valid for wavelengths, λ, where
λ > (1/3) min[Lx , Ly , Lz ] [12]. At these low frequencies, a few standing waves are
excited, so that the series terms in (3.8) converge quickly.
For a rectangular enclosure with dimensions (Lx , Ly , Lz ), q o = (0, 0, 0), the
eigenfunctions pn (q l ) in (3.8) are
nx πxl ny πyl nz πzl
pn (q l ) = cos cos cos
Lx Ly Lz
pn (q o ) = 1
Lx Ly Lz
nx πxl ny πyl nz πzl
Kn = cos2 dx cos2 dy cos2 dz
0 Lx 0 Ly 0 Lz
Lx Ly Lz V
= = (3.9)
8 8
Each of the terms in the series expansion can be considered to excite a resonant
frequency of about fn = ωn /2π = c/λn Hz, with a specific amplitude and phase as
determined by the numerator and denominator terms of (3.8). Because the different
terms in the series expansion can be considered mutually independent, the central
limit theorem can be applied to the real and imaginary parts of pω (q l ), according
to which both quantities can be considered to be random variables obeying a nearly
Gaussian distribution. Thus, according to the theory of probability, |pω (q l )| follows
the Rayleigh distribution. If z denotes pω (q l ), then the probability of finding a sound
pressure amplitude between z and z + dz is given by
π −(πz2 /4)
P (z)dz = e zdz (3.10)
2
Thus, in essence, the distribution of the sound pressure amplitude is independent of
the type of room, volume, or its acoustical properties. The probability distribution is
shown in Fig. 3.1.
The eigenfunction distribution in the z = 0 plane, for a room of dimension 6 m
×6 m ×6 m, and tangential mode (nx , ny , nz ) = (3, 2, 0) is shown in Fig. 3.2.
Finally, the time domain sound pressure, p(r, t), can be found through the Fourier
transform using
∞
p(r, t) = pω (r)e−jωt dω (3.11)
−∞
Rooms such as concert halls, theaters, and irregular-shaped rooms deviate from
the wave theory assumed rectangular shape due to the presence of pillars, columns,
balconies, and other irregularities. As such, the methods of wave theory cannot be
readily applied as the boundary conditions are difficult to formulate.
3.3 Modal Equations for Characterizing Room Acoustics at Low Frequencies 53
Fig. 3.1. The sound pressure amplitude density function in a room excited by a sinusoidal
tone.
Fig. 3.2. The eigenfunction distribution, for a tangential mode (3,2,0) over a room of dimen-
sions 6 m ×6 m ×6 m.
By using the expression cos(x) = (ejx +e−jx )/2 in (3.9), the eigenfunction equation
can be written as
54 3 Introduction to Acoustics and Auditory Perception
where the summation covers the expansion over the eight possible sign combinations
in the exponent. Each of the components, multiplied with the time-dependent expo-
nent ejωt represents a plane wave making an angle αx , αy , and αz with the x, y, and
z axis, respectively, as
nx ny nz
cos(αx ) : cos(αy ) : cos(αz ) = ± : ± : ± (3.13)
Lx Ly Lz
If any one of the angles is 90 degrees (i.e., the cosine of the angle is zero), then
the resulting wave represents a tangential mode, or a wave traveling in a plane, that
is orthogonal to the axis that makes an angle of 90 degrees with the plane wave. For
example, if αz = π/2, then the resulting wave travels in the plane defined by the
x-axis and y-axis. If any two of the angles are 90 degrees, then the resulting wave is
called an axial mode that is orthogonal to two axes that make an angle of 90 degrees
with the plane wave. If none of the angles is 90 degrees (i.e., all of the cosine terms
are nonzero) then the resulting wave represents an oblique mode.
The eigenfrequencies, fn , for the enclosure, are related to the eigenvalues, kn , as
c
fn = kn (3.14)
2π
Without going into the derivations (see [12, 11] for details), it can be shown that the
number of eigenfrequencies, Nf , that exist for a limiting frequency, f , in an enclosed
rectangular space is
3 2
4π f π f L f
Nf = V + S + (3.15)
3 c 4 c 8 c
dNf f2 π f L
∆f = = 4πV 3 + S 2 + (3.16)
df c 2 c 8c
The lowest ten eigenfrequencies (including degenerate cases) for a room with dimen-
sions 4 m ×3 m ×2 m is shown in Table 3.1.
fn (Hz) nx ny nz
42.875 1 0 0
57.167 0 1 0
71.458 1 1 0
85.75 0 0 1
85.75 2 0 0
95.871 1 0 1
103.06 0 1 1
103.06 2 1 0
111.62 1 1 1
114.33 0 2 0
Table 3.1. The ten lowest eigenfrequencies for a room of dimension 4 m × 3 m × 2 m.
(where αi and Si are the absorption coefficient and surface area of wall i, respec-
tively), then the rate of change of total acoustic energy in the room can be expressed
through the following conservation rule,
d 4V I(t)
= Π(t) − aI(t) (3.17)
dt c
where c is the speed of sound in the medium.
The solution to (3.17) can be written as
c −act/4V t
I(t) = e Π(τ )eacτ /4V dτ (3.18)
4V −∞
If the sound power Π(t) fluctuates slowly, relative to the time constant 4V /ac, then
the intensity I(t) will be approximately proportional to Π(t) as
Π(t)
I(t) ≈ (3.19)
a
Π(t)
≈ 10 log10 + 90 dB above 10−16 watt/cm2
a
if Π(t) is in ergs per second and a is in square centimeters.
In the event that the sound power Π(t) fluctuates in a time short compared to
the time constant 4V /ac, then the intensity will not follow the fluctuations of Π(t),
and if the sound is shut off suddenly at time t = 0, the subsequent intensity can be
expressed using (3.17) as
I(t) = I0 e−(act/4V ) (3.20)
act
Intensity Level = 10 log10 I0 + 90 − 4.34 (dB)
4V
Thus, upon turning off the source, the intensity level drops off linearly at a rate of
4.34act/4V every dB.
56 3 Introduction to Acoustics and Auditory Perception
The reverberation time of the room, which characterizes the time where the en-
ergy of reflections arriving from walls or boundary surfaces is non-negligible, is
defined as the time it takes for the intensity level to drop by 60 dB after the source is
switched off. Thus, if the dimensions of the room are measured in centimeters, then
the reverberation time T60 is given by
4V V
T60 = 60 = 0.161 (3.21)
4.34ac i αi Si
Subsequently, the result from (3.22) is converted to dB scale and the following ex-
pression is used for computing T60 ,
−1
∆L
T60 = 60 (3.23)
∆t
where ∆L/∆t is in dB/seconds. Frequently the slope of the decay curve is deter-
mined in the range of −5 dB and −35 dB relative to the steady-state level. In addi-
tion, the time domain for integration of (3.22), in practice, is important. Of course,
an upper limit of integration ∞ is not possible in real-world applications, so a fi-
nite integration interval is chosen. Care should be taken to ensure that the integration
interval is not too long as the decay curve will have a tail which limits the useful
dynamic range, nor should it be too short as it would cause a downward bend of the
curve.
Figure 3.3 shows a room impulse response recorded in a room using a loud-
speaker A, whereas the measured T60 , based on Fig. 3.4 (and using the measured
length L = 8192 samples of the room response as the upper limit for integration), is
found to be approximately 0.25 seconds. The effect of the upper limit of integration
2
Recall from Section 1.1.4 the impulse response is the output of a linear system when the
input is δ(n). In reality, room responses are applying a broadband signal to the room (such
as a logarithmic chirp, noise type sequences, etc.) and measuring the result through a mi-
crophone. More information is provided in a subsequent section.
3.4 Reverberation Time of Rooms 57
corresponding to 0.0625L is shown in Fig. 3.5, whereas the upper limit of integration
is 0.5L in Fig. 3.6.
Figure 3.7 shows a room impulse response recorded in a room using a loud-
speaker B, and at a different position, whereas the measured T60 , based on Fig. 3.8
Fig. 3.4. The energy decay curve based on the Schroeder integrated impulse response tech-
nique for loudspeaker A.
58 3 Introduction to Acoustics and Auditory Perception
Fig. 3.5. The energy decay curve based on using 0.0625L as an upper limit of integration.
(and using the measured length L of the room response as the upper limit for inte-
gration), is again found to be approximately 0.25 seconds showing reasonable inde-
pendence of the type of loudspeaker used in measuring the room response and the
position where the response was measured.
Finally, the effect of large reverberation is that it degrades the quality of audio
signals such as speech. Thus, to keep high speech quality in rooms, one can design
Fig. 3.6. The energy decay curve based on using 0.5L as an upper limit of integration.
3.4 Reverberation Time of Rooms 59
the reverberation time to be small by increasing the absorption of the room a. How-
ever, this is in contradiction to the requirement that the transient intensity (3.19) be
kept high. Thus, a compromise is required, during design of rooms, between these
two opposing requirements.
Fig. 3.8. The energy decay curve based on the Schroeder integrated impulse response tech-
nique for loudspeaker B.
60 3 Introduction to Acoustics and Auditory Perception
The direct field component for sound pressure, pf,d,i , of a plane wave, at far field
listener location i for a sound source of frequency f located at i0 can be expressed
as [12]
where pf,d (i|i0 ) is the direct component sound pressure amplitude, Sf is the source
strength, k = 2π/λ is the wavenumber, c = λf is the speed of sound (343 m/s) and
ρ is the density of the medium (1.25 kg/m3 at sea level).
The normalized correlation function [100] which expresses a statistical relation
between sound pressures, of reverberant components, at separate locations i and j,
is given by
where Rij is the separation between the two locations i and j relative to an origin,
and E{.} is the expectation operator.
The reverberant-field mean square pressure is defined as
4cρΠa (1 − ᾱ)
E{pf,rev,i p∗f,rev,i } = (3.27)
S ᾱ
where Πa is the power of the acoustic source, ᾱ is the average absorption coefficient
of the surfaces in the room, and S is the surface area of the room.
The assumption of a statistical description for reverberant fields in rooms is
justified if the following conditions are fulfilled [16]: (i) linear dimensions of the
room must be large relative to the wavelength, (ii) average spacing of the reso-
nance frequencies must be smaller than one-third of their bandwidth (this condi-
tion is fulfilled
in rectangular rooms at frequencies above the Schroeder frequency,
fs = 2000 T60 /V Hz (T60 is the reverberation time in seconds, and V is the vol-
ume in m3 ), and (iii) both source and microphone are in the interior of the room, at
least a half-wavelength away from the walls.
Furthermore, under the conditions in [16], the direct and reverberant sound pres-
sures are uncorrelated.
3.6 Measurement of Loudspeaker and Room Responses 61
The MLS-based method for finding the impulse response is based on cross-correlating
a measured signal with a pseudo-random (or deterministic) sequence. The motivation
for this approach is explained through the following derivation. Let x(t) be a station-
ary sound signal having autocorrelation φxx (t), which is applied to the room with
response h(t) through a loudspeaker. Then the signal received at the microphone is
Fig. 3.9. (a) Room impulse response; (b) zoomed version of the response showing direct, early
reflections, and reverberation.
62 3 Introduction to Acoustics and Auditory Perception
∞
y(t) = x(t − t )h(t )dt (3.28)
−∞
Forming the cross-correlation, φyx (τ ) between the received signal y(t) and the trans-
mitted signal x(t), we have
T0 /2 ∞
1
φyx (τ ) = lim x(t + τ − t )h(t )x(t)dt dt
T0 →∞ T0 −T /2 −∞
0
∞
= φxx (τ − t )h(t )dt (3.29)
−∞
L−1
s(k) = −1
k=0
1
L−1
1 k = 0, L, 2L, . . .
φss (k) = s(n)s(n + k) = (3.30)
L n=0 − L1 k = 0, L, 2L, . . .
L−1
y(n) = s(n) ⊗ h(n) = s(n − p)h(p)
p=0
L−1
φyx (k) = s(n)y(n + k)
n=0
L−1
L−1
= s(n)s(n + k − p)h(p)
p=0 n=0
L−1
1
L
= φss (k − p)h(p) = h(k) − h(k − p) (3.31)
p=0
L p=1
The first term in (3.31) is the recovered response whereas the second term repre-
sents a DC component that vanishes to zero with a sufficiently large value of L. An
3
Signals satisfying such autocorrelation functions are referred to as white noise signals.
3.6 Measurement of Loudspeaker and Room Responses 63
Fig. 3.10. (a) Time domain response of linear sweep; (b) magnitude response of the linear
sweep.
Fig. 3.11. (a) Time domain response of logarithmic sweep; (b) magnitude response of the log
sweep.
3.7 Psychoacoustics 65
Furthermore, it has been shown [21] that in the presence of nonwhite noise the MLS
and IRS methods for room impulse response seem to be most accurate, whereas in
quiet environments the logarithmic sine-sweep method is the most appropriate signal
of choice.
3.7 Psychoacoustics
The perception of sound is an important area and recent systems employing audio
compression techniques use principles from auditory perception, or psychoacoustics,
for designing lower bit-rate systems without significantly sacrificing audio quality.
Likewise, it seems a natural extension that certain properties of human auditory per-
ception (e.g., frequency selectivity) be exploited to design efficient systems, such as
room equalization systems, which aim at minimizing the detrimental effects of room
acoustics.
sound vibrations in the ear canal are transmitted to the tympanic membrane, and in
turn are transmitted through the articulations of the ossicles to the attachment of the
foot plate of the stapes on the membrane of the oval window. The ossicles amplify
the vibrations of sound and in turn pass them on to the fluid-filled inner ear.
The cochlea, which is the snail-shaped structure, and the semicircular canals con-
stitute the inner ear. The cochlea, enclosing three fluid-filled chambers, is encased in
the temporal bone with two membranous surfaces exposed at its base (viz., the oval
window and the round window). The foot plate of the stapes adheres to the oval
window, transmitting sound vibrations into the cochlea. Two of the three cochlear
chambers are contiguous at the apex. Inward deflections of the oval window caused
by the foot plate of the stapes compress the fluid in the scala vestibuli; this compres-
sion wave travels along the coils of the cochlea in the scala vestibuli to the apex,
then travels back down the coils in the scala tympani. The round window serves as
a pressure-relief vent, bulging outward with inward deflections of the oval window.
The third cochlear chamber, the scala media or cochlear duct, is positioned between
the scala vestibuli and scala tympani. Pressure waves from sound traveling up the
scala vestibuli and back down the scala tympani produce a shearing force on the hair
cells of the organ of Corti in the cochlear duct. Within the cochlea, hair cell sensitiv-
ity to frequencies progresses from high frequencies at the base to low frequencies at
the apex. The cells in the single row of inner hair cells passively respond to deflec-
tions of sound-induced pressure waves. Thus, space (or distance) along the cochlea is
mapped to the excitation or resonant frequency, and hence the cochlea can be viewed
as an auditory filtering device responsible for selective frequency amplification or
attenuation depending on the frequency content of the source sound.
Fig. 3.13. Equal loudness contours from Robinson and Dadson [23].
The contour for 0 phon (or 0 dB SPL of a 1 kHz tone) represents the minimum
audible field contour and represents on an average (among listeners) the absolute
lower limit of human hearing at various frequencies. Thus, for example, at 0 phon,
human hearing is most sensitive at frequencies between 3 kHz and 5 kHz as these
represent the lowest part of the 0 phon curve. Furthermore, for example, the sound
pressure level (SPL) has to be increased by as much as 75 dB at 20 Hz in order for
a 20 Hz tone to sound equally as loud as the 1 kHz tone at 0 phon loudness level
(or 0 dB SPL). In addition, the rate of growth of loudness level at lower frequencies
is much greater than the middle frequencies, for example, as the SPL at 20 Hz from
0 phon to 90 phon increases by only 50 dB whereas at 1 kHz the SPL increases
by 90 dB. This requirement of a larger increase in SPL at mid frequencies, relative
to lower frequencies, in order to retain the same loudness level difference is the
reason for the larger rate of growth of loudness level at lower frequencies. This can
be observed when human voices are played back at high levels via loudspeakers,
making them “boomy” as the ear becomes more sensitive to lower frequencies than
higher frequencies with higher intensities.
Various SPL meters attempt giving an approximate measure for the loudness of
complex tones. Such meters contain weighting networks (e.g., A, B, C, and RLB) that
weigh the intensities computed in third octave frequency bands with the appropriate
weighting curves before performing summation across frequencies.
The A weighting is based on a 30 phon equal loudness contour for measuring
complex sounds having relatively low sound levels, the B weighting is used for in-
termediate levels and approximates the 70 phon contour, whereas the C weighting is
for relatively high sound levels and approximates the 100 phon contour. The three
weighting network contours are shown in Fig. 3.14. Thus if a level is specified as 105
68 3 Introduction to Acoustics and Auditory Perception
dBC then the inverse C weighting (i.e., inverse of the contour shown in Fig. 3.14) is
used for computing the SPL.
In the previous section the equal loudness contours were presented which were a
function of the loudness level in phons. Stevens [25] presented some data that derive
scales relating the physical magnitude of sounds to their subjective loudness. In this
process, the subject is asked to adjust the level of a test tone until it has a specified
loudness, either in absolute terms or relative to a standard (e.g., twice as loud, half
as loud, etc.) Stevens derived a closed form expression that relates loudness L to
intensity I through a constant k as
L = kI 0.3 (3.37)
The detection of tones (such as those represented by the absolute threshold of hearing
or the equal loudness contours) is also based on the duration of the stimulus tone.
3.7 Psychoacoustics 69
Fig. 3.16. The detectability of a 1 kHz tone as a function of the tone duration in milliseconds.
The relation between duration of the tone, t, and threshold intensity, I, required for
detection can be expressed as [26],
(I − IL ) × t = k (3.38)
that the ear can integrate energy over a fairly narrow frequency range and this range
is exceeded for short duration signals.
The peripheral ear acts as a bank of band-pass filters due to the space-to-frequency
transformation induced by the basilar membrane [26]. These filters are known as
auditory filters and have been studied by several researchers [28, 29, 30, 31] and
are conceptualized to have either a rectangular or triangular shape with a simplified
assumption of symmetricity around the center frequency of the auditory filter.
The shape and bandwidth of these filters can be estimated, for example, through
the notched noise approach [32] where the width of the notch of a band-stop noise
spectrum is varied. Figure 3.17 shows a symmetric auditory filter which is cen-
tered on a sinusoidal tone with frequency f0 and a band-stop noise spectrum with a
notch of width 2∆f . By increasing the width of the notch, less noise passes through
the auditory filter and hence the threshold required to detect the sinusoidal tone of
frequency f0 decreases. By decreasing the notch width, more noise energy passes
through the auditory filter thereby making it harder for the sinusoidal tone to be de-
tected and thereby increasing the threshold.
The filter is parameterized in terms of the equivalent rectangular bandwidth
(ERB) and is expressed as a function of the filter center frequency f0 (expressed
in kHz) as
Fig. 3.17. Estimation of the auditory filter shape or bandwidth with the notched noise ap-
proach.
3.7 Psychoacoustics 71
increases at first as the bandwidth increases, but then flattens out beyond a critical
frequency such that any additional increase in noise will not affect detectability of the
sinusoidal tone. Fletcher referred to the bandwidth CB where the signal threshold
ceased to increase as the critical bandwidth or Bark scale. The critical bandwidth
can be related through one model which is expressed by
Fig. 3.19. The critical band model obtained from Eq. (3.40).
72 3 Introduction to Acoustics and Auditory Perception
Figure 3.20 shows the differences between the ERB and critical band filter band-
widths as a function of center frequency. It is evident that the ERB-based auditory
filter models have better frequency resolution at lower frequencies than the critical
band-based auditory filter models, whereas the differences are generally not substan-
tial at higher frequencies.
3.8 Summary
In this chapter we have presented the fundamentals of acoustics and sound propaga-
tion in rooms including reverberation time and measurement of such. We have also
presented the concept of a room response and the popular stimulus signals used for
measuring room impulse responses. Finally, concepts from psychoacoustics relating
to perception of sound were presented.
Part III
4.1 Introduction
Multichannel sound systems such as those used in movie or music reproduction in
5.1 channel surround sound systems or new formats such as 10.2 channel immersive
audio require many more tracks for content production than the number of audio
channels used in reproduction. This has been true since the early days of monophonic
and two-channel stereo recordings that used multiple microphone signals to create
the final one- or two-channel mixes.
In music recording there are several constraints that dictate the use of multiple
microphones. These include the sound pressure level of various instruments, the ef-
fects of room acoustics and reverberation, the spectral content of the sound source,
the spatial distribution of the sound sources in the space, and the desired perspective
that will be rendered over the loudspeaker system.
As a result it is not uncommon to find that tens of microphones may be used
to capture a realistic musical performance that will be rendered in surround sound.
Some of these are placed close to instruments or performers and others farther away
so as to capture the interaction of the sound source with the environment.
Despite the emergence of new consumer formats that support multiple audio
channels for music, the growth of content has been slow. In this chapter we de-
scribe methods that can be used to automatically generate the multiple microphone
signals needed for a multichannel rendering without having to record using multiple
real microphones, which we refer to as immersive audio synthesis. The applications
of such virtual microphones can be found in both the conversion of older recordings
to today’s 5.1 channel formats, but also to upconvert today’s 5.1 channel content to
future multichannel formats that will inevitably consist of more channels for more
realistic reproduction.
[2000]
c IEEE. Reprinted, with permission, from C. Kyriakakis and A. Mouchtaris, “Vir-
tual microphones for multichannel audio applications”, Proc. IEEE Conf. on Multimedia
and Expo, 1:11–14.
76 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
rithm, and the normalized frequency domain adaptive filter (NFDAF) LMS inverse
algorithm [55, 54]. The authors wish to acknowledge Dr. Athanasios Mouchtaris
whose PhD dissertation at the USC Immersive Audio Laboratory formed the basis
for much of the work described regarding synthesis and Dr. Jong-Soong Lim whose
PhD dissertation formed the basis for much of the work on rendering.
The required filter V , which can be used to synthesize the microphone signal m2
from the reference signal m1 , is
V2 A1
V = = (4.4)
V1 A2
One way to ensure that each virtual microphone filter V is both stable and com-
putationally efficient, is to use linear prediction analysis to design a stable all-pole
filter. With linear prediction, a certain number of past samples in each signal time
domain record are linearly combined to provide an estimate of future samples. For
example, if at time n in the signal m1 , q past samples are considered then the estimate
of the signal at time n can be written as
q
mlp
1 (n) = a(i)m1 (n − k) (4.5)
k=1
The prediction error of this process from the actual microphone signal is then
4.2 Immersive Audio Synthesis 79
q
E(m1 (n − k)(m1 (n) − a(i)m1 (n − i))) = 0 (4.10)
k=1
q
r(−k) = a(i)r(i − k) (4.11)
k=1
in which r(n) is the autocorrelation function of m1 (n). Equation (4.11) makes use of
the fact that the process m1 is wide-sense stationary in the block under consideration.
Finally, because the autocorrelation function is symmetric, and the absolute value of
i − k is in the interval [0, q − 1] (4.11) can be rewritten in matrix form as
⎡ ⎤⎡ ⎤ ⎡ ⎤
r(0) r(1) . . . r(q − 1) a(1) r(1)
⎢ r(1) r(0) . . . r(q − 2) ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ a(2) ⎥ ⎢ r(2) ⎥
⎢ .. .. .. .. ⎥ ⎢ . ⎥ = ⎢ . ⎥ (4.12)
⎣ . . . . ⎦ ⎣ .. ⎦ ⎣ .. ⎦
r(q − 1) r(q − 2) . . . r(0) a(q) r(q)
The coefficients a(i) of the virtual microphone filter can be found from the above
equation by inverting the correlation matrix R. This can be performed very efficiently
using a recursive method such as the Levinson and Durbin algorithm because of the
form of the correlation matrix R and under the assumption of ergodicity.
80 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
The methods described in the previous section must be applied in blocks of data of
the two microphone signal processes m1 and m2 . A set of experiments was con-
ducted to subjectively verify the validity of these methods. Signal block lengths of
100,000 samples were chosen because the reverberation time of the hall from which
the recordings were obtained is 2 sec. and were sampled at 48 kHz. Experiments
were performed with various orders of filters A1 and A2 to obtain an understanding
of the tradeoffs between performance and computational efficiency. Relatively high
orders were required to synthesize a signal m2 from m1 with an acceptable error
between m2p (the reproduced process) and m2 (the actual microphone signal). The
error was assessed through blind A/B/X listening evaluations. An order of 10,000 co-
efficients for both the numerator and denominator of V resulted in an error between
the original and synthesized signals that was not detectable by the listeners. The per-
formance of the filter was also evaluated by synthesizing blocks from a section of the
signal different from the one that was used for designing the filter. Again, the A/B/X
evaluation showed that for orders higher than 10,000 the synthesized signal was in-
distinguishable from the original. Although such high order filters are impractical for
real-time applications, the performance of this method is an indication that the model
is valid and therefore worthy of further investigation to achieve filter optimization.
In addition to listening evaluations, a mathematical measure of the distance be-
tween the synthesized and the original processes can be found. This measure can be
used during the optimization process in order to achieve good performance and at
the same time minimize the number of coefficients. The difficulty in defining such a
measure is that it must also be psychoacoustically valid. This problem has been ad-
dressed in speech processing in which measures such as the log spectral distance and
the Itakura distance are used [47]. In the case presented here, the spectral character-
istics of long sequences must be compared with spectra that contain a large number
of peaks and dips that are narrow enough to be imperceptible to the human ear. To
approximately match the spectral resolution of the human ear 1/3 octave smoothing
was performed [26] followed by a comparison of the resulting smoothed spectral
cues. The results are shown in Fig. 4.1 where the error match between the spectra of
the original (measured) microphone signal and the synthesized signal are compared.
The two spectra are practically indistinguishable below 10 kHz. Although the error
increases somewhat at higher frequencies, the listening evaluations show that this is
not perceptually significant.
The method described in the previous section is appropriate for the synthesis of
microphones placed far from the source that capture mostly reverberant sound in
the recording environment. However, it is common practice in music recording to
also use microphones that are placed very close to individual instruments. Synthe-
sis of such virtual microphone signals requires a different approach because these
4.2 Immersive Audio Synthesis 81
Fig. 4.1. Magnitude response error between the approximating and approximated spectra.
signals exhibit quite different spectral characteristics compared to the reference mi-
crophones. These microphones are used, for example, near the tympani or the wood-
winds in classical music so that these instruments can be emphasized in the multi-
channel mix during certain passages. The signal in these microphones is typically
not reverberant because of their proximity to the instruments.
As suggested in the previous section, this problem can be classified as a system
identification problem. The most important consideration in this case is that it is not
theoretically possible to design a generic time-invariant filter that will be suitable for
any recording. Such a filter would have to vary with the temporal characteristics of
the frequency response of the signal. The response is closely related to the joint time
and frequency properties of the reference microphone signals.
The approach that we followed for recreating these virtual microphones is based
on a method used for synthesizing percussive instrument sounds [48]. Thus, the
method described here is applicable only for microphones located near percussion
instruments. According to [48], it is possible to synthesize percussive sounds in a
natural way, by following an excitation/filter model. The excitation part corresponds
to the interaction between the exciter and the resonating body of the instrument and
lasts until the structure reaches a steady vibration, and the resonance part corresponds
to the free vibration of the instrument body. The resonance part can be easily de-
scribed from the frequency response of the instrument using several modeling meth-
ods (e.g., the AR modeling method that was described in the previous paragraph).
Then, the excitation part can be derived by filtering the instrument’s response with
the inverse of the resonance filter. The excitation part is independent of the frequen-
cies and decays of the harmonics of the instrument at a given time (after the in-
strument has reached a steady vibration) so it can be used for synthesizing different
sounds by using an appropriate resonance filter. Therefore, it is possible to derive an
82 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
excitation signal from a recording that contains only the instrument we wish to en-
hance and then filter it with the resonance filter at a given time point of the reference
recording in order to enhance the instrument at that particular time point. It is im-
portant to mention that the recreated instrument does not contain any reverberation
if the excitation part was derived from a recording that did not originally contain any
reverberation.
The above analysis has been successfully tested with tympani sounds. A poten-
tial drawback of this method is that the excitation part depends on the way that the
instrument was struck so it is possible that more than one excitation signal might be
required for the same instrument. Also, for the case of the tympani sounds, it is not
an easy task to define a procedure for finding the exact time points that the tympani
was struck, that is, the points when the enhancement procedure should take place.
Solutions to overcome these drawbacks described are under investigation.
The methods described above are effective for synthesizing signals in virtual mi-
crophones that are placed at a distance from the sound source (e.g., orchestra) and
therefore, contain more reverberation. The IIR filtering solution was proposed ex-
actly for addressing the long reverberation-time problem, which meant long impulse
responses for the filters to be designed. On the other hand, signals from microphones
located close to individual sources (e.g., spot microphones near a particular musical
instrument) do not contain very much reverberation. A completely different prob-
lem arises when trying to synthesize such signals. Placing such microphones near
individual sources with varying spectral characteristics results in signals whose fre-
quency content will depend highly on the microphone positions.
In order to synthesize signals in such closely placed microphones it is necessary
to identify the frequency bands that need to be amplified or attenuated for each mi-
crophone. This can be easily achieved when the reference microphone is relatively
far from the orchestra, so that we can consider that all frequency bands were equally
weighted during the recording. In order to generate a reference signal from such
a distant microphone that can be used to synthesize signals in the nonreverberant
microphones it is necessary to find some method for dereverberating the reference
signal.
One complication with this approach is that we do not know the filter that trans-
forms the signal from the orchestra to the reference microphone. We are investigating
methods for estimating these filters based on a technique for blind channel identifi-
cation using cross-spectrum analysis [49]. The idea is to use the two closely spaced
microphones in the center (hanging above the conductor’s head) as two different ob-
servations of the same signal processed by two different channels (the path from the
orchestra to each of the two microphones). The Pozidis and Petropulu algorithm uses
the phase of the cross-spectrum of the two observations and allows us to estimate the
two channels. Further assumptions, though, need to be made in order to have a unique
solution to the problem. The most important is that the two channels are assumed to
be of finite length. In general, however, they can be nonminimum phase, which is
4.3 Immersive Audio Rendering 83
a desired property. All the required assumptions are discussed in [49] and their im-
plications for the specific problem examined here are currently under investigation.
After the channels have been identified, the recordings can be equalized using the
estimated filters. These filters can be nonminimum phase, as explained earlier, so a
method for equalizing nonminimum phase channels must be used. Several methods
exist for this problem; see, for example, [50]. The result will be not only a derever-
berated signal but an equalized signal, ideally equal to the signal that the microphone
would record in an anechoic environment. That signal could then be used as the seed
to generate virtual microphone signals that would result in multichannel mixes sim-
ulating various recording venues.
E=T·W·X=X (4.14)
Fig. 4.2. Geometry and signal paths from input binaural signals to ears that show the ipsilateral
signal paths (T1 , T4 ) and contralateral signal paths (T2 , T3 ).
84 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
In (4.14), the listener’s ear signals and input binaural signals are E = [EL ER ]T ,
X = [XL XR ]T , respectively. The rendering system transfer function matrix T is
T1 T2
T= (4.15)
T3 T4
To obtain optimum performance and deliver the desired signal to each ear, the
matrix product of transfer function matrix T and crosstalk canceling weight vector
matrix W should be the identity matrix
10
T·W= (4.16)
01
Therefore the generalized rendering filter weight matrix W requires four weight
vectors to produce the desired signal at the ears of a single listener.
The weight vector matrix W described above can be implemented using the least
mean squares adaptive inverse algorithm [51]. Matrix equations (4.14) and (4.16)
must be modified based on the adaptive inverse algorithm for multiple channels as
follows [44, 45, 52]
W1 W3
E=T·W·X=T· ·X=X (4.17)
W2 W4
The desired result is to find W so that it cancels the crosstalk signals perfectly.
Then the signals E arriving at the ears are exactly the same as the input binaural
signals X. Equation (4.17) can be written as
T1 W1 + T2 W2 T1 W3 + T2 W4
E= ·X
T 3 W 1 + T4 W 2 T 3 W3 + T4 W 4
T1 W1 XL + T2 W2 XL + T1 W3 XR + T2 W4 XR
= (4.18)
T 3 W 1 X L + T4 W 2 X L + T 3 W 3 X R + T 4 W 4 X R
Fig. 4.3. LMS block diagram for the estimation of the crosstalk cancellation filter with
di (n) = XL (n − m), and di (n) = XR (n − m) for the left and right channels, respectively.
Using the time domain LMS adaptive algorithm, the weight vectors are updated
as follows,
ˆ i (n)),
Wi (n + 1) = Wi (n) + µ(−∇ i = 1, . . . , 4 (4.20)
The positive scalar step size µ controls the convergence rate and steady-state
ˆ
performance of the algorithm. The gradient estimate, ∇(n), is simply the derivative
2
of e (n) with respect to W (n) [56]. Therefore gradient estimates in the time domain
can be found as
ˆ i (n) = −2{e1 (n)[Ti (n) ∗ XL (n)] + e2 (n)[Ti+2 (n) ∗ XL (n)]}
∇ i = 1, 2
ˆ i (n) = −2{e1 (n)[Ti−2 (n) ∗ XR (n)] + e2 (n)[Ti (n) ∗ XR (n)]}
∇ i = 3, 4
(4.21)
in which all input binaural signals and transfer functions are time domain sequences.
The output error is given by
ei (n) = di (n) − yi (n)
= Xi (n − m) − {[W1 (n) ∗ T2i−1 (n) + W2 (n) ∗ T2i (n)] ∗ XL (n)
+ [W3 (n) ∗ T2i−1 (n) + W4 (n) ∗ T2i (n)] ∗ XR (n)} i = 1, 2 (4.22)
in which Xi (n) is
XL (n) i=1
Xi (n) = (4.23)
XR (n) i=2
Figure 4.3 shows that di (n) could simply be a pure delay, say of m samples,
which will assist in the equalization of the minimum phase components of the trans-
fer function matrix in (4.18). The inclusion of an appropriate modeling delay sig-
nificantly reduces the mean square error produced by the equalization process. The
filter length, as well as the delay m, can be selected based on the minimization of the
mean squared error. This method can be used either offline or in real-time according
to the location of the virtual sound source and the position of the listener’s head. The
weight vectors of the crosstalk canceller can be chosen to be either an FIR or an IIR
filter.
86 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
Fig. 4.4. Frequency domain adaptive LMS inverse algorithm block diagrams using overlap-
save method for the estimation of crosstalk canceller weighting vectors based on Fig. 4.3
(i = 1, 2).
Frequency domain implementations of the LMS adaptive inverse filter have several
advantages over time domain implementations that include improved convergence
speed and reduced computational complexity. In practical implementations of fre-
quency domain LMS adaptive filters, the input power varies dramatically over the
different frequency bins. To overcome this, the frequency domain adaptive filter
(FDAF) LMS inverse algorithm [59] can be used to estimate the input power in each
frequency bin. The power estimate can be included directly in the frequency domain
LMS algorithm [55]. The adaptive inverse filter algorithm shown in Fig. 4.3 is mod-
ified in the frequency domain using the overlap-save method FDAF LMS inverse
algorithm [54], which is shown in Fig. 4.4.
The general form of FDAF LMS algorithms can be expressed as follows,
in which the superscript H denotes the complex conjugate transpose. The time-
varying matrix µ(k) is diagonal and it contains the step sizes µ1 (k). Generally,
each step size is varied according to the signal power in that frequency bin l. In
the crosstalk canceller implementation described here
H %
S (k) S H (k)
Wi (k + 1) = Wi (k) + µ × F F −1 i · E1 (k) + 4+i · E2 (k)
Pi (k) P4+i (k)
i = 1, . . . , 4 (4.25)
Fig. 4.5. Geometry for four loudspeakers and two listeners with the ipsilateral signal paths
(solid lines) and contralateral undesired signal paths (dotted lines).
E=T·S=T·W·X (4.27)
in which the ear signal matrix E is E = [EL1 ER1 EL2 ER2 ]T , the loudspeaker
signal matrix S is S = [S1 S2 S3 S4 ]T , and the input binaural signal matrix X is
88 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
From (4.28), the signal paths T1 , T6 , T11 , and T16 are ipsilateral signal paths,
and T2 , T3 , . . . , T15 are undesired contralateral crosstalk signal paths. If the product
of matrices T, W, and X can be simplified as (4.29) for the same side rendered
image, the crosstalk cancellation filter W will be the optimum inverse control filter.
Therefore the desired binaural input signals XL and XR can be delivered to the ears
of each listener without crosstalk.
⎡ ⎤ ⎡ ⎤
10 XL
⎢0 1⎥ ⎢ XR ⎥
E=T·W·X=⎢ ⎥ ⎢
⎣ 1 0 ⎦ X = ⎣ XL ⎦
⎥ (4.29)
01 XR
E = GX (4.30)
⎡ ⎤
T1 W1 + T2 W2 + T3 W3 + T4 W4 T1 W5 + T2 W6 + T3 W 7 + T4 W 8
⎢ T 5 W 1 + T6 W2 + T7 W 3 + T 8 W4 T5 W5 + T6 W6 + T7 W 7 + T8 W 8 ⎥
G=⎢ ⎣ T9 W1 + T10 W2 + T11 W3 + T12 W4
⎥.
T9 W5 + T10 W6 + T11 W7 + T12 W8 ⎦
T13 W1 + T14 W2 + T15 W3 + T16 W4 T13 W5 + T14 W6 + T15 W7 + T16 W8
Fig. 4.6. Geometry and transfer functions for four loudspeakers and two listeners (general
nonsymmetric case).
4.3 Immersive Audio Rendering 89
Fig. 4.7. LMS block diagrams for the estimation of crosstalk canceller weighting vectors in
the general nonsymmetric case, with di (n) = XL (n − m) and di (n) = XR (n − m) for the
left and right channels, respectively.
& 'T
= XL XR XL XR (4.31)
In (4.33), all input binaural signals and transfer functions are sample sequences
in the time domain. The output error is given by
in which Xi (n) is
XL (n) i = 1, 3
Xi (n) = (4.35)
XR (n) i = 2, 4
90 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
Fig. 4.8. Frequency domain adaptive LMS inverse algorithm block diagrams using overlap-
save method for the estimation of crosstalk canceller weighting vectors based on Fig. 4.7
(i = 1, . . . , 4).
yi (n) = [W1 (n) ∗ T4i−3 (n) + W2 (n) ∗ T4i−2 (n) + W3 (n) ∗ T4i−1 (n)
+W4 (n) ∗ T4i (n)] ∗ XL (n)
= [W5 (n) ∗ T4i−3 (n) + W6 (n) ∗ T4i−2 (n) + W7 (n) ∗ T4i−1 (n)
+W8 (n) ∗ T4i (n)] ∗ XR (n) i = 1, . . . , 4 (4.36)
Figure. 4.8 shows the block diagram of the frequency domain adaptive LMS
inverse algorithm for the general nonsymmetric case.
By using the weight vector adaptation algorithm in (4.24),
Wi (n + 1)
SiH (k) S H (k)
= Wi (n) + µ × fft ifft · E1 (k) + 8+i · E2 (k)
Pi (k) P8+i (k)
%
H
S16+i (k) H
S24+i (k)
+ · E3 (k) + · E4 (k)
P16+i (k) P24+i (k)
i = 1, . . . , 8 (4.37)
In (4.37), Si (k) and Pi (k) are described in section B for single listener case.
Symmetric Case
Each of the two listeners in this configuration is seated at the center line of each loud-
speaker pair. This implies that several of the HRTFs in this geometry are identical
due to symmetry (assuming that no other factors such as room acoustics influence
the system). Therefore the 16 HRTFs from T1 to T16 can be reduced to just 6 HRTFs
4.3 Immersive Audio Rendering 91
Fig. 4.9. Geometry and transfer functions for four loudspeakers and two listeners with sym-
metry geometry.
in which Aij is an adjugate matrix. Based on symmetry, A11 = A44 , A12 = A43 ,
A13 = A42 , A14 = A41 , A21 = A34 , A22 = A33 , A23 = A32 , and A24 = A31 .
Therefore, (4.41) becomes
⎡ ⎤ ⎡ ⎤
A11 + A24 A14 + A21 W1 W 3
1 ⎢ A12 + A23 A13 + A22 ⎥ ⎢ W2 W4 ⎥
W= ⎢ ⎥=⎢ ⎥ (4.42)
det(Ts ) ⎣ A13 + A22 A12 + A23 ⎦ ⎣ W4 W2 ⎦
A14 + A21 A11 + A24 W3 W1
E = Ts · W · X
⎡ ⎤
T1 XL + T4 X R T2 XL + T3 XR T4 X L + T1 X R T3 XL + T2 XR
⎢ T2 XL + T8 X R T1 XL + T7 XR T8 X L + T2 X R T7 XL + T1 XR ⎥
=⎣⎢ ⎥
T8 XL + T2 X R T7 XL + T1 XR T2 X L + T8 X R T1 XL + T7 XR ⎦
T4 XL + T1 X R T3 XL + T2 XR T1 X L + T4 X R T2 XL + T3 XR
⎡ ⎤ ⎡ ⎤
W1 XL
⎢ W2 ⎥ ⎢ XR ⎥
·⎢ ⎥ ⎢ ⎥
⎣ W3 ⎦ · ⎣ XL ⎦
W4 XR
(4.43)
Figure 4.10 shows a block diagram for generating weight vectors using LMS
algorithms with the symmetry property.
The weight vectors of the crosstalk canceller are updated based on the LMS
adaptive algorithm
Fig. 4.10. LMS block diagrams for the estimation of crosstalk canceller weighting vectors for
the symmetric case.
4.3 Immersive Audio Rendering 93
Fig. 4.11. Frequency domain adaptive LMS inverse algorithm block diagrams using overlap-
save method for the estimation of crosstalk canceller weighting vectors based on Fig. 4.10
(i = 1, . . . , 4).
ˆ i (n))
Wi (n + 1) = Wi (n) + µ × (−∇ i = 1, . . . , 4 (4.44)
In (4.44), the convergence rate of the LMS adaptive algorithm is controlled by
ˆ
step size µ. The gradient estimate,∇(n), is simply the derivative of e2 (n) with re-
spect to W (n). Therefore gradient estimates in the time domain can be written as
ˆ i (n) = −2[e1 (n)C1i (n) + e2 (n)C2i (n) + e3 (n)C3i (n) + e4 (n)C4i (n)]
∇
i = 1, . . . , 4 (4.45)
in which Cij is an element of the ith row and jth column in matrix C
⎡ ⎤
(T1 XL + T4 XR ) (T2 XL + T3 XR ) (T4 XL + T1 XR ) (T3 XL + T2 XR )
⎢ (T2 XL + T8 XR ) (T1 XL + T7 XR ) (T8 XL + T2 XR ) (T7 XL + T1 XR ) ⎥
C=⎢ ⎥
⎣ (T8 XL + T2 XR ) (T7 XL + T1 XR ) (T2 XL + T8 XR ) (T1 XL + T7 XR ) ⎦
(T4 XL + T1 XR ) (T3 XL + T2 XR ) (T1 XL + T4 XR ) (T2 XL + T3 XR )
(4.46)
In (4.43), all input binaural signals and transfer functions are sample sequences in
the time domain. The output error is shown in Fig. 4.10. Figure 4.11 shows the block
diagram of the frequency domain adaptive LMS inverse algorithm for the symmetric
case.
By using the weight vector adaptation algorithm in (4.24) we find
Wi (k + 1)
SiH (k) S H (k)
= Wi (k) + µ × F F −1
· E1 (k) + 4+i · E2 (k)
Pi (k) P4+i (k)
%%
H
S8+i (k) H
S12+i (k)
+ · E3 (k) + · E4 (k)
P8+i (k) P12+i (k)
i = 1, . . . , 4 (4.47)
94 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
Fig. 4.12. Frequency response of crosstalk canceller adapted in the time domain. (a) Magni-
tude response of T1 (ω)W1 (ω) + T2 (ω)W2 (ω); (b) magnitude response of T1 (ω)W3 (ω) +
T2 (ω)W4 (ω); (c) phase response of T1 (ω)W1 (ω) + T2 (ω)W2 (ω).
4.3 Immersive Audio Rendering 95
Fig. 4.13. Frequency response of crosstalk canceller adapted in the frequency domain.
(a) Magnitude response of T1 (ω)W1 (ω) + T2 (ω)W2 (ω); (b) magnitude response of
T1 (ω)W3 (ω) + T2 (ω)W4 (ω); (c) phase response of T1 (ω)W1 (ω) + T2 (ω)W2 (ω).
of the contralateral signal is at least 20 dB below the ipsilateral signal in the same
range. Figure 4.13 presents the result of the normalized frequency domain adaptive
filter inverse algorithm.
The magnitude response of the ipsilateral signal is about 0 dB in the frequency
range between 200 Hz and 10 kHz with linear phase. It has almost the same mag-
nitude response as Fig. 4.12. However, the magnitude response of the contralateral
signal is suppressed more than 40 dB below the ipsilateral signal.
The experiments in this case were conducted as shown in Fig. 4.5. The tap size of
the measured HRTFs was 256 samples at a sampling rate of 44.1 kHz. A random
noise signal was used for the input of the adaptive LMS algorithm. This signal was
sampled at 44.1 kHz in the frequency bands between 200 Hz and 10 kHz. The filter
coefficients were obtained using LMS in the time and frequency domain as described
above. The performance of the rendering filter for the general nonsymmetric case was
measured based on the matrix equation (4.18). The desired magnitude and phase
response in the frequency domain should satisfy
⎡ ⎤
|A11 | |A12 |
⎢ |A21 | |A22 | ⎥
|M | = ⎢ ⎥
⎣ |A31 | |A32 | ⎦
|A41 | |A42 |
96 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
⎡ ⎤
1 0
⎢0 1⎥
=⎢
⎣1
⎥ 200 Hz ≤ f ≤ 10 kHz (4.48)
0⎦
0 1
where A11 = T1 W1 +T2 W2 +T3 W3 +T4 W4 , A12 = T1 W5 +T2 W6 +T3 W7 +T4 W8 ,
A21 = T5 W1 + T6 W2 + T7 W3 + T8 W4 , A22 = T5 W5 + T6 W6 + T7 W7 + T8 W8 ,
A31 = T9 W1 +T10 W2 +T11 W3 +T12 W4 , A32 = T9 W5 +T10 W6 +T11 W7 +T12 W8 ,
A41 = T13 W1 + T14 W2 + T15 W3 + T16 W4 , and A42 = T13 W5 + T14 W6 + T15 W7 +
T16 W8 . Furthermore, all transfer functions (HRTFs) and weight vectors (crosstalk
canceller coefficients) are in the frequency domain. Defining Mij as an element at
the ith row and jth column in the above magnitude matrix of M. In this matrix, the
desired magnitude response of the ipsilateral and contralateral signals are 1 and 0
respectively. The phase response is
⎡ ⎤
∠A11 ∠A12
⎢ ∠A21 ∠A22 ⎥
∠(P) = ⎢⎣ ∠A31 ∠A32 ⎦
⎥
∠A41 ∠A42
⎡ −jnw ⎤
e X
⎢ X e−jnw ⎥
=⎢⎣ e−jnw X ⎦
⎥ 200 Hz ≤ f ≤ 10 kHz (4.49)
X e−jnw
in which X in the matrix indicates “don’t care” because of its small magnitude re-
sponse. For optimum performance, the ipsilateral signals in (4.49) should have linear
phase in the frequency band between 200 Hz and 10 kHz so that there is no phase
distortion. Let’s define Pij as an element at the ith row and jth column in the phase
matrix P. Simulations of (4.48) using LMS adaptation in the time domain are shown
in Fig. 4.14 in the frequency domain.
It can be seen that the frequency response of the ipsilateral signal in equation
(4.48) is very close to 0 dB in the frequency range between 200 Hz and 10 kHz with
linear phase. Therefore the desired ipsilateral signal (input binaural signal) reaches
the ear from the same-side loudspeaker without distortion as desired. The magnitude
response of the undesired contralateral signal is suppressed between 20 dB and 40
dB relative to the ipsilateral signal in the same frequency range. The same results are
shown in the frequency domain in Fig. 4.15.
The frequency response of the ipsilateral signal is nearly identical to the response
in Fig. 4.14. The magnitude response of the contralateral signal is suppressed around
40 dB relative to the ipsilateral signal in the same frequency range.
4.3.4 Summary
Fig. 4.14. Frequency response where the weight vectors were obtained based on the LMS
algorithm in the time domain.
Fig. 4.15. Frequency response where the weight vectors were obtained based on the LMS
algorithm in the frequency domain.
the case of two listeners. The methods presented here can also be scaled to more
listeners.
5
Multiple Position Room Response Equalization
[2006]
c IEEE. Reprinted, with permission, from S. Bharitkar and C. Kyriakakis, “Visu-
alization of multiple listener room acoustic equalization with the Sammon map”, IEEE
Trans. on Speech and Audio Proc., (in press).
100 5 Multiple Position Room Response Equalization
5.1 Introduction
An acoustic enclosure can be modeled as a linear system whose behavior is char-
acterized by a response, known as the impulse response, h(n); n ∈ {0, 1, 2, . . . }.
When the enclosure is a room the impulse response is known as the room impulse
response with a frequency response, H(ejω ). Generally, H(ejω ) is also referred to
as the room transfer function (RTF). The impulse response yields a complete de-
scription of the changes a sound signal undergoes when it travels from a source to
a receiver (microphone/listener) via a direct path and multipath reflections due to
the presence of reflecting walls and objects. By its very definition the room impulse
response is obtained at a receiver (e.g., a microphone) located at a predetermined
position in a room, after the room is excited by a broadband source signal such as
the MLS or the logarithmic chirp signal (described in chapter 3).
It is well established that room responses change with source and receiver loca-
tions in a room [11, 63]. Other reasons for minor variations in the room responses are
due to changes in the room, such as opening/closing of doors and windows. When
these minor variations are ignored, a room response can be uniquely defined by a set
∆
of spatial coordinates, li = (xi , yi , zi ). It is assumed that the source is at an origin
and the the receiver i is at the three spatial coordinates, xi , yi , and zi , relative to a
source in the room.
When an audio signal is transmitted in a room, the signal is distorted by the
presence of reflecting boundaries. One scheme to minimize this distortion, from a
source to a specific position, is to introduce an equalizing filter that is an inverse of
the room impulse response measured between the source and the listening position.
This equalizing filter is applied to the source signal before transmitting it in a room. If
heq (n) is the equalizing filter for room response h(n), then, for perfect equalization
heq (n) ⊗ h(n) = δ(n); where ⊗ is the convolution operator and δ(n) = 1, n =
0; 0, n = 0 is the Kronecker delta function. However, two problems arise due to this
approach: (i) the room response is not necessarily invertible (i.e., it is not minimum
phase), and (ii) designing an equalizing filter for a specific position will introduce
a poor equalization performance at other positions in a room. In other words, the
multiple point equalization cannot be achieved by an equalizing filter that is designed
for equalizing the response at only one location.
A classic multiple location equalization technique is to average the room re-
sponses and invert the resulting minimum-phase part to form the equalizing filter.
Elliott and Nelson [64] propose a least squares method for designing an equalization
filter for a sound reproduction system by adjusting the filter coefficients to mini-
mize the sum of the squares of the errors between the equalized signals at multi-
ple points in a room and the delayed version of an electrical signal applied to a
loudspeaker. In [65], Mourjopoulos proposes a technique of using a spatial equal-
ization library, based on the position of a listener, for equalizing the response at the
listener position. The library is formed via vector quantization of room responses.
Miyoshi and Kaneda [67] present an “exact” equalization of multiple point room
responses. Their argument is based on the MINT (multiple-input/multiple-output in-
verse theorem) which requires that the multiple room responses have uncommon
5.2 Background 101
zeros among them. A multiple point equalization algorithm using common acousti-
cal poles is demonstrated by Haneda et al. [68]. Fundamentally, the aforementioned
multiple point equalization algorithms are based on a linear least squares approach.
Weiss et al. [62] proposed an efficient and effective multirate signal processing-based
approach for performing equalization at the listeners’ ear positions.
Thus, the main objective of room equalization is the formation of an inverse fil-
ter, heq (n), that compensates for the effects of the loudspeaker and room that cause
sound quality degradation at a listener position. In other words, the goal is to satisfy
heq (n) ⊗ h(n) = δ(n), where ⊗ denotes the convolution operator and δ(n) is the
Kronecker delta function. Because it is well established that room responses change
with source (i.e., loudspeaker) and listener locations in a room [11, 63], clearly, due
to the variations in the impulse responses, between positions, equalization has to
be done simultaneously such that the goal is satisfied at all listening positions. In
practice an ideal delta function is not achievable with low filter orders as room re-
sponses are nonminimum-phase. Furthermore, from a psychoacoustic standpoint, a
target curve, such as a low-pass filter having a reasonably high cutoff frequency is
generally applied to the equalization filter (and hence the equalized response) to pre-
vent the played-back audio from sounding exceedingly “bright”. An example of a
low-pass cutoff frequency is the frequency where the loudspeaker begins its high-
frequency roll-off in the magnitude response. Additionally, the target curve may also
be customized according to the size and/or the reverberation time of the room. A
high-pass filter may also be applied to the equalized response, depending on the
loudspeaker size and characteristics (e.g., a satellite channel loudspeaker), in order to
minimize distortions at low frequencies. Examples of environments where multiple
listener room response equalization is used are in home theater (e.g., a multichannel
5.1 system), automobile, movie theaters, and the like.
5.2 Background
To understand the effects of single location equalization on other locations, consider
a simple first-order specular room reflection model as follows (with the assumption
that the response at the desired location for equalization is invertible). Let the impulse
responses, h1 (n) and h2 (n), from a source to two positions 1 and 2 be represented
as
This first-order reflection model is valid, for example, when the two positions are
located along the same radius from a source, and each position has a differently
absorbing neighboring wall with negligible higher-order reflections from each wall.
For simplicity, the absorption due to air and the propagation delay nd in samples
(nd ≈ fs r/c; r is the distance, fs is the sampling rate and c is the speed of sound
102 5 Multiple Position Room Response Equalization
because heq (n) ⊗ h1 (n) = δ(n). However, the equalized response at position 2 can
be easily shown to be
where u(n) = 1, n ≥ 0 is the discrete step function. There are two objective mea-
sures of equalization performance for position 2: (i) frequency domain error function
(used subsequently in the chapter), and (ii) time domain error function. The time do-
main error function is easy to compute for the present problem, and is defined as
1 2 1
I−1 I−1
= e (n) = (δ(n) − heq(n) ⊗ h2 (n))2
I n=0 I n=0
(α2 − β2 )2
I−1
= (−α2 )(2n−2) (5.4)
I n=1
Clearly, the response at position 2 is unequalized because
> 0. A plot of the error
as a function of the distance |α2 − β2 | between the two coefficients, α2 and β2 , that
differentiate the two responses is shown in Fig. 5.1. Hence, the error is reduced at
position 2 if a good equalizer is designed that accounts for the changes in the room
response due to variations in the source and listening positions.
phase distortions can audibly degrade speech. By using the concept of matched fil-
tering they are able to objectively minimize the total equalization error (magnitude
and phase).
Some recent literature on spectral modeling using psychoacoustically motivated
filters for single-position equalization can be found in [71, 72, 73].
Fig. 5.1. Equalization error at position 2 as a function of the “separation” between the re-
sponses at position 1 and 2.
104 5 Multiple Position Room Response Equalization
M −p
J= (δ(k − p) − y(k))2 (5.5)
k=0
where p is the modeling delay, M is the duration of the room response h(k), and
y(k) = h(k) ⊗ hi (k) (⊗ denotes the linear convolution operator, and hi (k) is the
causal and finite duration inverse filter of h(k)).
Elliott and Nelson [64] propose a method for designing an equalization filter for
a sound reproduction system by adjusting the filter coefficients to minimize the sum
of the squares of the errors between the equalized responses at multiple points in a
room and the delayed version of an electrical signal. Basically, the objective function
is expressed as a square of the instantaneous error signal, where the error signal is the
difference between a delayed replica of the electrical signal which is supplied as an
input to a channel with given room response, and an output signal. The disadvantage
of this approach is the relatively limited equalization performance due to the equal
weighting provided to all the responses when designing the equalization filter.
Haneda et al. [68] propose a room response model, the CAPZ model (common
acoustical pole and zero model), that is causal and stable. The authors suggest that
there exist common poles in a room transfer function (i.e., the Fourier transform of
the room impulse response) irrespective of the measurement position of the room
response within a room. A multiple-point equalization filter comprising the common
acoustical poles is then determined via a linear least squares method.
The RMS spatial averaging method is used widely due to its simplicity for com-
puting the equalization filter and the spatial average of measured responses is given
by:
1 N
Havg (e ) =
jω
|Hi (ejω )|2 (5.6)
N i=1
−1 jω
Heq (ejω ) = Havg (e )
where N is the number of listening positions, with responses Hi (ejω ), that are to be
equalized.
The following section presents a novel pattern recognition technique for grouping
responses having “similar” acoustical structure and subsequently forming a general-
ized representation that models the variations of responses between positions. The
generalized representation is then used for designing an equalization filter.
5.5 Designing Equalizing Filters Using Pattern Recognition 105
Fig. 5.2. Motivation for using fuzzy c-means clustering for room acoustic equalization.
⎡
⎤−1 1
c
d 2
d2ik
µi (hk ) = ⎣ ik ⎦
= c
d2jk 1 ;
j=1 j=1 d2 jk
∗
d2ik = hk − ĥi 2 i = 1, 2, . . . , c; k = 1, 2, . . . , M (5.7)
∗
where ĥi denotes the ith cluster room response centroid. An iterative optimization
procedure proposed by Bezdek [76] was used for determining the quantities in (5.7).
Care was taken to ensure that the minimum phase room responses were used to form
the centroids so as to avoid undesirable time and frequency domain effects, due to
incoherent linear combination, resulting from using the excess phase parts.
Once the centroids are formed from minimum-phase responses, they are com-
bined to form a single final prototype. One approach to do this is by using the fol-
lowing model,
c M 2 ∗
j=1 ( k=1 (µj (hk )) )ĥj
hf inal = c M (5.8)
2
j=1 ( k=1 (µj (hk )) )
The final prototype (5.8) is formed from a nonuniform weighting of the cluster
membership functions. Specifically, the “heavier” the weight of a cluster j, in terms
of the fuzzy membership functions k=1 (µj (hk ))2 , the larger is the contribution of
M
∗
the corresponding centroid ĥj in the formation of the prototype and the subsequent
multiple position equalization filter.
The multiple listener equalization filter can subsequently be obtained by deter-
mining the minimum-phase component, hmin,f inal , of the final prototype hf inal =
hmin,f inal ⊗ hap,f inal (hap,f inal is the all-pass component), where the minimum-
phase sequence hmin,f inal is obtained from the cepstrum of hf inal . It is noted that
5.5 Designing Equalizing Filters Using Pattern Recognition 107
the final prototype, hf inal , need not be of minimum-phase because the linear combi-
nation of minimum-phase signals need not be minimum-phase.
One approach to determine the optimal number of clusters c∗ , based on given data,
is to use a validity index κ. One example of a validity index, that is popular in the
pattern recognition literature, is the Xie–Beni cluster validity index κXB [78]. This
index is expressed as
1
c M
∗
κXB = (µj (hk ))2 ĥj − hk 22 (5.9)
N β j=1
k=1
∗ ∗
β = min ĥi − ĥj 22
i=j
The term included with the double summation is simply the objective function
used in fuzzy c-means clustering, whereas the denominator term β analyzes the inter-
cluster centroid distances. The larger this distance, the better is the cluster separation
and thus the lower is the Xie–Beni index.
Thus, the clustering process involves (i) choosing the number of clusters, c, ini-
tially to be 2; (ii) performing fuzzy clustering and determining the centroid positions
according to Eq. (5.7); (iii) determining κXB Eq. (5.9); (iv) increasing the number
of clusters by unity and performing steps (ii) to (iv) until c = M − 1; and (v) plot-
ting κXB as a function of the number of clusters, where the minima of this plot will
provide the optimal number of clusters, c∗ according to this index. Typically, κXB
returns the optimal number of clusters c∗ M for applications involving very large
data sets [79]. In such cases the plot of κXB versus c increases beyond c∗ and then
monotonically decreases towards c = M − 1. The prototype is then formed from the
centroids for c∗ via 5.8.
It is to be noted that the equalization filter computed simply by using the fuzzy c-
means approach is generally high-order (i.e., at most the length of the room impulse
response). Thus, a technique for mapping the large filter lengths to a smaller length
is introduced in the subsequent section.
Linear predictive coding (LPC) [80, 81] is used widely for modeling speech spectra
with a fairly small number of parameters. It can also be used for modeling room
responses in order to form low-order equalization filters.
In addition, in order to obtain a better fit of a low-order model to a room response,
especially in the low-frequency region of the room response spectrum, the concept
of warping was introduced by Oppenheim et al. in [82]. Warping involves the use
of a chain of all-pass blocks, D1 (z), instead of conventional delay elements z −1 , as
shown in Fig. 5.3. With an all-pass filter, D1 (z), the frequency axis is warped and
108 5 Multiple Position Room Response Equalization
z −1 − λ
D1 (z) = (5.10)
1 − λz −1
resolution plot for different warping coefficients, λ, is shown in Fig. 5.5. It can be
seen that the warping to the Bark scale for λ = 0.77 gives a “balanced” mapping be-
cause it provides a good resolution at low frequencies while retaining the resolution
at mid and high frequencies (e.g., compare with λ = 0.99). Some recent literature
on spectral modeling using warping can be found in [71, 72, 73].
The general system-level approach for determining the cluster-based multiple
listener equalization filter is shown in Fig. 5.6. Specifically, the room responses are
initially warped to the psychoacoustical Bark scale. As later shown, the Xie–Beni
cluster validity index gives an indication of the number of clusters that are generated
for the given data set (particularly for the case where the number of data samples,
M , is relatively small). Subsequently, the number of clusters is used for performing
clustering in order to determine the cluster centroids and prototype, respectively. The
minimum-phase part of the prototype, having length N 2, is then parameterized
by a low-order model, such as the LPC, for realizable implementation. The inverse
filter is then found from the LPC coefficients, and the reverse step of unwarping is
performed to obtain the filter in the linear domain. The equalization performance
can then be assessed by inspecting the equalized responses along a log frequency
domain.
visual analysis. Thus, feature extraction and dimensionality reduction tools have be-
come important in pattern recognition and exploratory data analysis [85]. Dimen-
sionality reduction can be done by generating a lower-dimensional data set, from a
higher-dimensional data set, in a manner to preserve the distinguishing characteris-
tics of the higher-dimensional data set. The goal of dimensionality reduction then is
to allow data analysis by assessment of the clustering tendency, if any, in order to
identify the number of clusters.
The Sammon map is used as a tool for visualizing the relationship between room
responses measured at multiple positions. This depiction of the room responses can
also aid in the design of an equalization filter for multiple position equalization.
Subsequently, the performance of this multiple-listener equalization filter, in terms
of the uniformity of the equalized responses, can also be evaluated using the Sammon
map. This chapter expands on the work from the last chapter, by comparing the Xie–
Beni cluster validity index with the Sammon map, to show that the Sammon map can
be used for determining clusters of room responses when the number of responses is
relatively small as in the present case.
Fig. 5.6. System for determining the the multiple listener equalization filter based on percep-
tual pattern recognition.
(m+1) ∂r 2 (l)
=− (dpj − dpj )
p φ d d
j p=j pj pj
(rp (l) − rj (l))2 dpj − dpj
− 1+
dpj dpj
5.8 Results
A listening arrangement is shown in Fig. 5.7, where the microphone locations for
measuring the room responses were at the center of the listener head at each posi-
tion. The distance from the loudspeaker to the listener 2 position was about 7 meters,
whereas the average intermicrophone distance in the listener arrangement was about
1 meter. The room was roughly of dimensions 10 m × 20 m × 6 m. The result-
ing measurement was captured by an omnidirectional flat response microphone and
deconvolved by the SYSid system. The chirp was transmitted around 30 times and
the resulting measurements were averaged to get a higher SNR. The omnidirectional
microphone had a substantially flat magnitude response (viz., the Countryman ISO-
MAX B6 Lavalier microphone). The loudspeaker was a center channel speaker from
a typical commercially available home theater speaker system.
First, the Sammon map is applied to the six psychoacoustically warped responses
for visualizing the responses on a 2-D plane. One of the goals of this step is to see if
the map captures any perceptual grouping between the responses. Subsequently the
fuzzy c-means clustering algorithm on M = 6 is then applied to the psychoacous-
tically warped room responses (each response being a vector of length 8192) and
the optimal number of clusters using the Xie–Beni index (Eq. (5.9) in the previous
chapter) is determined. As shown, the Sammon map, when compared to the Xie–
Beni cluster validity index, gives a clear indication of the number of clusters that are
generated for the given data set.
1
Sammon [86] called this a magic factor and recommended α to be ≈ 0.3 or 0.4.
5.8 Results 113
Fig. 5.7. The experimental setup for measuring M = 6 acoustical room responses at six
positions in a reverberant room.
Figure 5.8 shows the responses at the six listener positions in the time domain.
Clearly, there are significant differences in the responses. Firstly, it can be visually
observed that there is a certain similarity in the time of arrival of the direct path com-
ponent of the responses at positions 1, 2, and 3. Also, noticeable is the path delay
difference between the responses at positions 4, 5, and 6 in relation to the responses
in positions 1, 2, and 3. Figure 5.9 shows the corresponding 1/3 octave smoothed
magnitude responses along a linear frequency axis in Hertz. Figure 5.10 shows the
corresponding 1/3 octave smoothed magnitude responses in the Bark domain. Specif-
ically, the x-axis (viz., the Bark axis) was computed using the expression provided in
[84],
z = 13 tan−1 (0.76f /1000) + 3.5 tan−1 (f /7500)2 (5.15)
where f is the frequency in Hz. A comparison between the plots of Figs. 5.9 and
5.10 shows a transformation effected by the mapping of (5.10) in a sense that low
frequencies are mapped higher which effectively “stretches” the magnitude response.
The Sammon map for the warped responses is shown in Fig. 5.11. The map
shows the relative proximity of the responses at positions 1, 2, and 3 that could be
identified as a group. Table 5.1 below shows all symmetric distances, dij = ri −
rj 2 , between the warped responses i, j, as computed through the Sammon map on
the 2-D plane (i.e., the ith row and jth column element is the distance between the
Sammon mapped coordinates corresponding to the ith and jth position).
From Fig. 5.11 and Table 5.1, a dominant perceptual grouping of responses 1, 2,
and 3 can be seen on the map. This can be confirmed from the distance metrics in Ta-
ble 5.1 especially where the distance between responses 1, 2, and 3 are significantly
114 5 Multiple Position Room Response Equalization
Fig. 5.8. The time domain responses at the six listener positions for the setup of Fig. 5.8.
Fig. 5.9. The corresponding 1/3 octave smoothed magnitude responses along the linear fre-
quency axis obtained from Fig. 5.8.
5.8 Results 115
Fig. 5.10. The 1/3 octave smoothed magnitude responses of Fig. 5.9 along the Bark axis.
Fig. 5.11. The Sammon map for the warped impulse responses.
116 5 Multiple Position Room Response Equalization
Table 5.1. The symmetric distances, dij = ri − rj 2 , between the warped responses i, j, as
computed through the Sammon map on the 2-D plane
Pos 1 2 3 4 5 6
1 0 2.0422 2.1559 2.8549 2.5985 3.0975
2 — 0 2.1774 4.6755 2.9005 4.9133
3 — — 0 3.3448 4.4522 5.1077
4 — — — 0 5.1052 3.4204
5 — — — — 0 3.2431
6 — — — — — 0
lower than the distances between these and the remaining responses. Also, note from
the last column in the table, the response from position 6 is close to responses from
positions 4 and 5 (i.e., distances of 3.42 and 3.24, respectively), whereas responses
from positions 4 and 5 are significantly far apart from each other (distance of 5.1052).
Thus, there are at least three clusters, where the first cluster is formed dominantly of
responses from positions 1, 2, and 3; the second cluster having, dominantly, response
from position 4 and the third cluster having, dominantly, response from position 5.
Finally, response from position 6 could be grouped with at least the distinct clus-
ters having response from positions 4 and 5. In essence, this clustering represents a
grouping of signals using the psychoacoustic scale.
Figure 5.12 shows the plot of the Xie–Beni cluster validity index as a function
of the number of clusters for the warped responses. From the plot it may be inter-
preted that the optimal number of clusters is 5, even though there is no clear c∗ = 5
where the index increases beyond this c∗ and then decreases monotonically towards
c = M − 1 = 5. This inconclusive result was partly resolved by observing the
combined plot comprising the numerator double-sum term and the denominator sep-
aration term, β, in the Xie–Beni index term. Figure 5.13 shows that the largest sep-
aration between cluster centroids is obtained at c = 3 for a reasonably small error,
thereby indicating that c∗ = 3 is a reasonable choice for the number of clusters.
Thus, this procedure validated the results from the Sammon map (viz., Fig. 5.11 and
Table 5.1).
The membership functions, µj (hk ), j = 1, 2, 3; k = 1, . . . , 6 are shown in Table
5.2 (Ci corresponds to cluster i). From the membership function table (viz., Table
5.2) it can be seen that there is a high degree of similarity among the warped re-
sponses in positions 1, 2, and 3 as they are largely clustered in cluster 1, whereas
positions 4 and 5 are dissimilar to the members of cluster 1 and each other (as they
are clustered in clusters 3 and 2, respectively, and have a low membership in other
clusters). Again warped response at position 6 has a similarity to members in all the
three clusters (which is also predicted through the Sammon map of Fig. 5.11 and the
table relating distances on the Sammon map).
In the next step equalization is performed according to Fig. 5.6, with c∗ = 3
and LPC order p = 512, and depict the equalization performance results using the
Sammon map. An inherent goal in this step is to view the equalization performance
5.8 Results 117
Fig. 5.12. The Xie–Beni cluster validity index as a function of the number of clusters for the
warped responses.
results visually, and demonstrate that the uniformity and similarity of the magnitude
responses (unequalized and equalized) can be shown on a 2-D plane using the map.
The LPC order p = 512 was selected as it gave the best results upon equalization and
Fig. 5.13. The numerator (viz., the objective function) and denominator term (viz., the sepa-
ration term) of the Xie–Beni cluster validity index of Fig. 5.12.
118 5 Multiple Position Room Response Equalization
furthermore this filter order is practically realizable (viz., the equivalent FIR length
filter can be easily implemented in various commercially available audio processors.)
Figure 5.14 shows the unequalized magnitude responses (of Fig. 5.9) in the
log frequency axis, and Fig. 5.15 depicts the equalized magnitude responses using
c∗ = 3 clusters. Clearly, substantial equalization is achieved at all of the six listener
positions as can be seen by comparing Fig. 5.15 with Fig. 5.14. The equalized mag-
nitude responses were then processed by subtracting the individual means, computed
between 80 Hz and 10 kHz (which is typically the region of interest for equalization
in the room of given size), to give the zero mean equalized magnitude responses.
Under ideal equalization, all of the magnitude responses would be 0 dB between 80
Hz and 10 kHz. Hence, upon applying the Sammon map, all of the ideal equalized
responses would be located at the origin of the Sammon map. Any deviation away
from 0 dB would show up on the map as a displacement away from the origin. If the
Fig. 5.14. The 1/3 octave smoothed unequalized magnitude responses of Fig. 5.9 (shown in
the log frequency domain).
5.8 Results 119
Fig. 5.15. The 1/3 octave smoothed equalized magnitude responses (shown in the log fre-
quency domain for better depiction of performance at low frequencies) using c∗ = 3 clusters.
equalized responses were uniform in distribution, then they would appear in a tight
circle about the origin in the 2-D plane after applying the Sammon map.
Now, applying the Sammon map algorithm to the original magnitude responses
of Fig. 5.9, between 80 Hz and 10 kHz, results in Fig. 5.16. Specifically, the re-
sponses in a 2-D plane for different positions show significant non-uniformity as
these are not located equidistant from the origin. Applying the mean corrected and
equalized responses to the Sammon map algorithm gives the distribution of the
equalized responses on a 2-D plane as shown in Fig. 5.17.
Comparing Fig. 5.16 with Fig. 5.17 shows an improved uniformity among the
responses as many of the responses lie at approximately the same distance from
the origin. Specifically, from Fig. 5.17 it is evident that the distances of equalized
responses 1, 2, 4, and 5 are close to each other from the origin, thereby reflecting
a larger uniformity between these responses. Furthermore, the standard deviation of
the distances of the equalized responses is much smaller than that of the unequalized
responses (viz., 4.64 as opposed to 11.06) indicating a better similarity between the
equalized responses. The improved similarity of the equalized magnitude responses
1, 2, 4, and 5 can be checked by visually comparing the equalized responses in Fig.
5.15. Also, it can be seen that the equalized magnitude responses 3 and 6 are quite
a bit different from each other, and from equalized responses 1, 2, 4, and 5, and this
reflects in the Sammon map as points 3 and 6 substantially offset from a circular
distribution.
120 5 Multiple Position Room Response Equalization
The room impulse response, p(t, X, X ), for the image model [61] with loudspeaker
at X = (x, y, z) and microphone at X = (x , y , z ) and room dimensions L =
(Lx , Ly , Lz ) (with walls having absorption coefficient α = 1 − β 2 ) is given as
1
∞
δ[t − (|Rp + Rr |/c)]
p(t, X, X ) = βx|n−q| βx|n| βy|l−j| βy|l|2 βz|m−k| βz|m|
p=0 r=−∞
1 2 1 1 2
4π|Rp + Rr |
(5.16)
p = (q, j, k)
Rp = (x − x + 2qx , y − y + 2jy , z − z + 2kz )
r = (n, l, m)
Rr = 2(nLx , lLy , mLz )
The room image model (5.16), thus, can be simulated for different reverberation
times by adjusting the reflection coefficients (viz., the βs), because the Schroeder
time T60 is related to the absorption coefficients by the equation T60 =
reverberation
0.161V / i Si αi (Si is the surface area of wall i).
Hence, the robustness to reverberation, of different equalization techniques, can
be modeled by varying the reflection coefficients in the image model.
The RMS average filter (as used traditionally for movie theater equalization) is ob-
tained as
1 N
Havg (e ) =
jω
|Hi (ejω )|2 (5.17)
N i=1
−1
Heq (ejω ) = Havg (ejω )
where |Hi (ejω )| is the magnitude response at position i. To obtain lower-order filters,
we used the approach as shown in Fig. 5.6 (but using RMS averaging instead of fuzzy
c-means prototype formation).
122 5 Multiple Position Room Response Equalization
5.9.3 Results
We have compared the pattern recognition and warping-based method to the RMS
averaging and warping-based method, for multiposition equalization, to determine
their robustness to reverberation variations. Ideally, it is required that the equalization
performance does not degrade significantly, when the reverberation time increases,
for (i) a fixed room, and (ii) fixed positions of the listeners in a room. The room
image model allows ease in simulating changes in responses (due to changes in re-
verberation times) thereby allowing the equalization performance of these methods
to be compared.
To quantify the equalization performance, we used the well-known spectral devi-
ation measure, σE , which indicates the degree of flatness of the spectrum. The lower
the measure, the better is the performance. The performance measure is defined as
1 P −1
σE = (10 log10 |E(ejωi )|) − B(ejω ) (5.18)
P i=0
P −1
1
B(ejω ) = (10 log10 |E(ejωi )|)2
P i=0
|E(ejω )| = |H(ejω )||Heq (ejω )|
5.10 Summary
In this chapter some background on various single position and multiple position
equalization techniques, including the importance of performing multiple position
equalization over single position was presented. Also presented was a pattern recog-
nition method of performing simultaneous multiple listener equalization with low fil-
ter orders, and some comparisons between the RMS averaging equalization method
and the pattern recognition equalization method in terms of reverberation robustness.
A technique for visualizing room impulse responses and simultaneous multiple lis-
tener equalization performance using the Sammon map was also presented. The map
is able to display results obtained through clustering algorithms such as the fuzzy
c-means method. Specifically, distances of signals in multidimensional spaces are
mapped onto distances in two dimensions, thereby displaying the clustering behav-
ior of the proposed clustering scheme. Upon determining the equalization filter from
the final prototype, the resulting equalization performance can be determined from
the size and shape (viz., circular shape indicates uniform equalization performance
at all locations) of the equalization map.
6
Practical Considerations for Multichannel
Equalization
[2004]
c AES. Reprinted, with permission, from S. Bharitkar, “Phase equalization for
multi-channel loudspeaker-room responses”, Proc. of AES 117th Convention, (preprint
6272).
[2005]
c AES. Reprinted, with permission, from S. Bharitkar and C. Kyriakakis, “Compar-
ison between time delay based and nonuniform phase based equalization for multichannel
loudspeaker-room responses,” Proc. of AES 119th Convention, (preprint 6607).
126 6 Practical Considerations for Multichannel Equalization
6.1 Introduction
A room is an acoustic enclosure that can be modeled as a linear system whose be-
havior at a particular listening position is characterized by an impulse response,
h(n); n ∈ {0, 1, 2, . . . } with an associated frequency response or room trans-
fer function H(ejω ). The impulse response yields a complete description of the
changes a sound signal undergoes when it travels from a source to a receiver (micro-
phone/listener). The signal at a listening position consists of direct path components,
discrete reflections that arrive a few milliseconds after the direct sound, as well as a
reverberant field component.
A typical 5.1 system and its system-level description are shown in Figs. 6.1 and
6.2, respectively, where the satellites (left, center, right, left surround, and right sur-
round speakers) are positioned surrounding the listener and the subwoofer may be
placed in the corner or near the edges of a wall. The high-pass (satellite) and low-
hp
pass (subwoofer) bass management filters, |Hbm,ω (ω)| = 1 − 1/ 1 + (ω/ωc )4
lp
c
8
and |Hbm,ωc (ω)| = 1/ 1 + (ω/ωc ) , are Butterworth second-order high-pass (12
dB/octave roll-off) and fourth-order low-pass (24 dB/octave roll-off), respectively,
and are designed with a crossover frequency ωc (i.e., the intersection of the cor-
responding −3 dB points) corresponding to 80 Hz. Alternatively, the fourth-order
Butterworth can be implemented as a cascade of two second-order Butterworth fil-
ters which modifies the magnitude response slightly around the crossover region. If
the satellite response rolls off at a second-order rate, then the resulting response ob-
tained through complex summation has a flat magnitude response in the crossover
region. The analysis and techniques presented in this chapter can be modified in a
straightforward manner to include cascaded Butterworth or any other bass manage-
ment filter. Examples of other crossover networks that split the signal energy between
the subwoofer and the satellites, according to predetermined crossover frequency and
6.1 Introduction 127
slopes, can be found in [91, 92, 93]. The magnitude responses of the individual bass
management filters as well as the magnitude of the recombined response (i.e., the
magnitude of the complex sum of the filter frequency responses), are shown in Fig.
6.3. If the satellite response is smooth and rolling of at a second order Butterworth
rate, then the complex summation yields a flat magnitude response in the audio signal
pass-band. In real rooms, the resulting magnitude response from the bass manage-
ment filter set, combined with with the loudspeaker and room responses, will exhibit
substantial variations in the crossover region. This effect can be mitigated by proper
selection of the crossover frequency (and/or the bass management filter orders). In
essence, the bass management filter parameter selection should be such that the sub-
woofer and the satellite channel output be substantially distortion-free with minimal
variations in the crossover region.
The acoustical block diagram for a subwoofer channel and a satellite channel is
shown in Fig. 6.4, where Hsub (ω) and Hsat (ω) are the acoustical loudspeaker and
room responses at a listening position. The resulting net acoustic transfer function,
H(ω), and magnitude response, |H(ejω )|2 , can be written as
hp lp
H(ω) = Hbm,ω c
(ω)Hsat (ω) + Hbm,ω c
(ω)Hsub (ω)
|H(ω)|2 = |A(ω)|2 + |B(ω)|2 + Γ (ω)
|A(ω)|2 = |Hbm,f
hp
c
(ω)|2 |Hsat (jω)|2
|B(ω)|2 = |Hbm,f
lp
c
(ω)|2 |Hsub (ω)|2 (6.1)
Γ (ω) = 2|A||B| cos(φsub (ω) + φlp
bm,ωc (ω) − φsat (ω) − φhp
bm,ωc (ω))
Fig. 6.2. System-level description of the 5.1 multichannel system of Fig. 6.1.
128 6 Practical Considerations for Multichannel Equalization
Fig. 6.3. Magnitude response of the industry standard bass management filters and the recom-
bined response.
Fig. 6.4. Block diagram for the combined acoustical response at a position.
where φhp lp
bm,ωc (ω) and φbm,ωc (ω) are the phase responses of the bass management
filters, whereas φsub (ω) and φsat (ω) are the phase responses of the subwoofer and
room, and satellite and room responses.
However, many of the loudspeaker systems, in a real room, interact with the room
giving rise to standing wave phenomena that manifest as significant variations in the
magnitude response measured between a loudspeaker and a microphone position. As
can be readily observed from (6.1), with an incorrect crossover frequency choice the
phase interactions will show up in the magnitude response as a region with a broad
spectral notch indicating a substantial attenuation of sound around the crossover re-
gion. In this chapter, we show that a correct choice of the crossover frequency will
influence the combined magnitude response around the crossover region.
6.1 Introduction 129
Fig. 6.5. (a) Magnitude response of the subwoofer measured in a reverberant room; (b) mag-
nitude response of the satellite measured in the same room.
As an example, individual subwoofer and satellite (in this case a center channel)
frequency responses (1/3rd octave smoothed), as measured in a room at a sampling
frequency of 48 kHz with a reverberation time T60 ≈ .75 sec, are shown in Figs.
6.5 (a) and (b), respectively. Clearly, the satellite is capable of playing audio below
100 Hz (up to about 40 Hz), whereas the subwoofer is most efficient and generally
used for audio playback at frequencies less than 200 Hz. For example, as shown in
Fig. 6.6, the resulting magnitude response, according to (6.1), obtained by summing
the impulse responses, has a severe spectral notch for a crossover frequency ωc cor-
responding to 60 Hz. This has been verified through real measurements where the
subwoofer and the satellite channels were excited with a broadband stimuli (e.g.,
log-chirp signal) and subsequently deconvolving the net response from the measured
signal.
Although room equalization has been widely used to solve problems in the mag-
nitude response, the equalization filters do not necessarily solve the problems around
the crossover frequency. In fact, many of these filters are minimum-phase and as such
may do little to influence the result around the crossover. As shown in this chapter,
automatic selection of a proper crossover frequency through an objective function al-
lows the magnitude response to be flattened in the crossover region. All-pass-based
optimization can overcome any additional limitations.
130 6 Practical Considerations for Multichannel Equalization
1 P −1
σH (ωc ) = (10 log10 |H(ωi )| − ∆) 2 (6.2)
P i=0
P −1
where ∆ = 1/P i=0 10 log10 |H(ωi )|, |H(ωi )| can be found from (1), and P is
the number of frequency points selected around the crossover region. Specifically,
the smaller the σH (ωc ) value, the flatter is the magnitude response.
For real-time applications, a typical home theater receiver includes a selectable
(either by a user or automatically as shown in this chapter) finite integer set of
crossover frequencies. For example, typical home theater receivers have selectable
crossover frequencies, in 10 Hz increments, from 20 Hz through 150 Hz (i.e.,
Ω = [20 Hz, 30 Hz, 40 Hz, . . . , 150 Hz]). Thus, although a near-optimal solution
ωc∗ can be found through a gradient descent optimization process by minimizing the
spectral deviation measure with respect to ω (viz., ∂σH (ωc )/∂ωc |ωc =ωc∗ ), this is un-
necessarily complicated. Clearly, the choice of the crossover frequency is limited to
Fig. 6.6. Magnitude of the net response obtained from using a crossover frequency of 60 Hz.
6.2 Objective Function-Based Crossover Frequency Selection 131
Fig. 6.7. Plots of the resulting magnitude response for crossover frequencies: (a) 50 Hz, (b) 60
Hz, (c) 70 Hz, (d) 80 Hz, (e) 90 Hz, (f) 100 Hz, (g) 110 Hz, (h) 120 Hz, (i) 130 Hz.
this finite set of integers (viz., as given in Ω), hence a simpler but yet effective means
to select a proper choice of the crossover frequency, is to characterize the effect of
the choice of each of the selectable integer crossover frequencies on the magnitude
response in the crossover region.
Figure 6.7 shows the resulting magnitude responses, as obtained via (6.1), for
different integer choices of the crossover frequencies from 50 Hz through 130 Hz.
The corresponding spectral deviation values, as a function of the crossover frequency,
for the crossover region around the crossover frequencies are shown in Fig. 6.8.
Clearly, comparing Fig. 6.8 results with the plots in Figs. 6.7, it can be seen that
the spectral deviation measure is an excellent measure for accurately modeling the
performance in the crossover region for a given choice of crossover frequency. The
best crossover frequency is then that which minimizes the spectral deviation measure,
in the crossover region, over the integer set of crossover frequencies. Specifically,
In this example 120 Hz provided the best choice for the crossover frequency as it
gave the smallest σH (ωc ).
132 6 Practical Considerations for Multichannel Equalization
where H̃sat (ejω ) and H̃sub (ejω ) are bass managed satellite and subwoofer channel
room responses measured at a listening position in a room, and A† (ejω ) is the com-
plex conjugate of A(ejω ). The phase responses of the subwoofer and the satellite are
Fig. 6.9. Combined subwoofer satellite response at a particular listening position in a rever-
berant room.
given by φsub (ω) and φsat (ω), respectively. Furthermore, H̃sat (ejω ) and H̃sub (ejω )
may be expressed as
where BMsat (ejω ) and BMsub (ejω ) are the the bass management IIR filters,
whereas Hsat (ejω ) and Hsub (ejω ) are the full-range satellite and subwoofer re-
sponses, respectively.
The influence of phase on the net magnitude response is via the additive term
Λ(ejω ) = 2|H̃sub (ejω )||H̃sat (ejω )| cos(φsub (ω) − φsat (ω)). This term influences
the combined magnitude response, generally, in a detrimental manner when it adds
incoherently to the magnitude response sum of the satellite and the subwoofer.
Specifically, when φsub (ω) = φsat (ω) + kπ, k = 1, 3, . . . , the resulting mag-
nitude response is actually the difference between the magnitude responses of the
subwoofer and the satellite thereby, possibly, introducing a spectral notch around the
crossover frequency. For example, Fig. 6.9 shows an exemplary combined subwoofer
center channel response in a room with reverberation time of about 0.75 seconds.
Clearly, a large spectral notch is observed around the crossover, and one of the rea-
sons for the introduction of this notch is the additive term Λ(ejω ) which adds inco-
herently to the magnitude response sum. Figure 6.10 is a third octave smoothed mag-
nitude response corresponding to Fig. 6.9, whereas Fig. 6.11 shows the effect of the
Λ(ejω ) term clearly exhibiting an inhibitory effect around the crossover region due to
the phase interaction between the subwoofer and the satellite speaker response at the
listening position. The cosine of the phase difference (viz., (φsub (ω)−φsat (ω))) that
134 6 Practical Considerations for Multichannel Equalization
causes the inhibition to the net magnitude response, is shown in Fig. 6.12. Clearly, in-
telligently controlling this Λ(ejω ) term will allow improved net magnitude response
around the crossover.
In this brief digression, we briefly explain some of the results obtained in Section
7.2. Specifically, we demonstrate that an appropriate crossover frequency enables
coherent addition of the phase interaction term Γ (ω) with the |A(ω)|2 and |B(ω)|2
terms in Eq. (6.1).
For example, Fig. 6.13(b) shows the Γ (ω) term for crossover frequency ωc corre-
sponding to 60 Hz. Clearly, this term is negative and will contribute to an incoherent
addition in (6.1) around the crossover region (marked by arrows). In contrast, by se-
lecting the crossover frequency to be 100 Hz, the Γ (ω), as shown in Fig. 6.13(a), is
positive around the crossover region. This results in a coherent addition around the
crossover region. These complex addition results are clearly reflected in the plots of
Figs. 6.7(a) and (f) as well as the σH (ωc ) values at 60 Hz and 100 Hz in Fig. 6.8.
Fig. 6.10. The 1/3 octave smoothed combined magnitude response of Fig. 6.9.
6.4 Phase Equalization with All-Pass Filters 135
Fig. 6.12. Plot of the cosine of the phase difference that contributes to the incoherent addition
around the crossover.
(
z −1 − zi† z −1 − zi ((
A(z) = ( (6.6)
1 − zi z −1 1 − zi† z −1 ( jω
z=e
136 6 Practical Considerations for Multichannel Equalization
where zi = ri ejθi is a pole of radius ri and angle θi ∈ [0, 2π). Figure 6.14 shows the
unwrapped phase (viz., arg(Ap (z))) for different ri and θi = 0.25π, whereas Fig.
6.15 shows the group delay plots for the same radii. As can be observed, the closer
the pole is to the unit circle the larger is the group delay (i.e., the larger is the phase
change with respect to frequency). One of the main advantages of an all-pass filter
is that the magnitude response is unity at all frequencies, thereby not changing the
magnitude response of any filter that is cascaded with an all-pass filter.
Fig. 6.13. Γ (ω) term from (6.1) for crossover frequency (a) 100 Hz, (b) 60 Hz.
6.4 Phase Equalization with All-Pass Filters 137
+2|H̃sub (ω)||H̃sat (jω)| cos(φsub (ω) − φsat (ω) − φAM (ω)) (6.7)
where
M
e−jω − rk e−jθk e−jω − rk ejθk
AM (e ) =
jω
1 − rk ejθk e−jω 1 − rk e−jθk e−jω
k=1
M
(k)
φAM (ω) = φAM (ω) (6.8)
k=1
(i) ri sin(ω − θi )
φAM (ω) = −2ω − 2 tan−1
1 − ri cos(ω − θi )
r sin(ω + θi )
−2 tan−1
i
1 − ri cos(ω + θi )
and ΛF (ejω ) = 2|H̃sub (ejω )||H̃sat (ejω )| cos(φsub (ω) − φsat (ω) − φAM (ω)). Thus,
to minimize the inhibitory effect of Λ term (or in effect cause it to coherently add to
|H̃sub (ω)|2 + |H̃sat (ω)|2 ), in the example above, one can define an average square
error function (or objective function) for minimization as
1
N
J(n) = W (ωl )(φsub (ω) − φsat (ω) − φAM (ω))2 (6.9)
N
l=1
Fig. 6.14. Plot of the unwrapped phase of a second-order all-pass filter for different values of
the pole magnitude for θ = 0.25π.
138 6 Practical Considerations for Multichannel Equalization
Fig. 6.15. Plot of the group delay of a second-order all-pass filter for different values of the
pole magnitude for θ = 0.25π.
N
∂φAM (ω)
∇ri J(n) = W (ωl )E(φ(ω))(−1)
∂ri (n)
l=1
N
∂φAM (ω)
∇θi J(n) = W (ωl )E(φ(ω))(−1)
∂θi (n)
l=1
E(φ(ω)) = φsub (ω) − φsat (ω) − φAM (ω) (6.11)
where
∂φAM (ω) 2 sin(ωl − θi (n))
=− 2
∂ri (n) ri (n) − 2ri (n) cos(ωl − θi (n)) + 1
6.5 Objective Function-Based Bass Management Filter Parameter Optimization 139
2 sin(ωl + θi (n))
− (6.12)
ri2 (n) − 2ri (n) cos(ωl + θi (n)) + 1
and
∂φAM (ω) 2ri (n)(ri (n) − cos(ωl − θi (n)))
=− 2
∂θi (n) ri (n) − 2ri (n) cos(ωl − θi (n)) + 1
2ri (n)(ri (n) − cos(ωl + θi (n)))
− 2 (6.13)
ri (n) − 2ri (n) cos(ωl + θi (n)) + 1
During the update process, care was taken to ensure that |ri (n)| < 1 to guarantee
stability. This was done by including a condition where the ri element that exceeded
unity would be randomized. Clearly, this could increase the convergence time, and
hence in the future other methods may be investigated to minimize the number of
iterations for determining the solution.
6.4.3 Results
For the combined subwoofer center channel response shown in Fig. 6.9, the ri and θi
with M = 9 were adapted to get reasonable minimization of J(n). Furthermore,the
frequency dependent weighting function, W (ωl ), for the above example was chosen
as unity for frequencies between 60 Hz and 125 Hz. The reason for this choice of
weighting terms can be readily seen from the domain of the Λ(ejω ) term of Fig. 6.11
and/or the domain of the “suckout” in Fig. 6.10.
The original phase difference function (φsub (ω) − φsat (ω))2 is plotted in Fig.
6.16 and the cosine term, cos(φsub (ω) − φsat (ω)) that adds incoherently is shown in
Fig. 6.12. Clearly, minimizing the phase difference (using the all-pass cascade in the
satellite channel) around the crossover region will minimize the spectral notch. The
resulting all-pass filtered phase difference function, (φsub (ω)−φsat (ω)−φAM (ω))2 ,
from the adaptation of ri (n) and θi (n) is shown in Fig. 6.17 thereby demonstrating
the minimization of the phase difference around the crossover. The resulting all-pass
filtered term, ΛF (ω), is shown in Fig. 6.18. Comparing Figs. 6.11 and 6.18, it can be
seen that the inhibition turns to an excitation to the net magnitude response around
the crossover region. Finally, Fig. 6.19 shows the resulting combined magnitude re-
sponse with the cascade all-pass filter in the satellite channel, and Fig. 6.20 shows
the third octave smoothed version of Fig. 6.19. A superimposed plot, comprising Fig.
6.20 and the original combined response of Fig. 6.10 is depicted in Fig. 6.21. Clearly,
an improvement of about 7 dB around the crossover can be seen.
responses, (iii) designing an equalization filter for each channel loudspeaker (viz.,
warping and LPC based as shown in Fig. 6.22), and (iii) applying individual bass
management filters to each of the equalization filters in a multichannel audio system.
Quantitatively, as an example, the net subwoofer and satellite response at a lis-
tening position, as shown by Fig. 6.23, can be expressed as
−1 jω −1 jω 2
|H(ejω )|2 = |Hsub (ejω )H̃sub (e ) + Hsat (ejω )H̃sat (e )| (6.14)
−1 jω 2 −1 jω 2
= |Hsub (e )H̃sub
jω
(e )| + |Hsat (ejω )H̃sat (e )|
−1 jω −1 jω
+2|Hsub (e )H̃sub (e )||Hsat (e )H̃sat (e )|
jω jω
where the “hat” above the frequency and phase responses, of the subwoofer and
satellite equalization filter, represents an approximation due to the lower order spec-
Fig. 6.16. Plot of the phase difference, (φsub (ω) − φsat (ω))2 , as a function of frequency.
6.5 Objective Function-Based Bass Management Filter Parameter Optimization 141
Fig. 6.17. Plot of the all-pass filtered phase difference, (φsub (ω) − φsat (ω) − φAM (ω))2 , as
a function of frequency. Observe the reduction around the crossover (≈ 80 Hz).
tral modeling via LPC. As is evident from Eqs. (6.14) and (6.15), the crossover region
Fig. 6.18. The influence of the all-pass filtered function ΛF (ejω ) on the combined magnitude
response.
142 6 Practical Considerations for Multichannel Equalization
Fig. 6.19. The combined magnitude response of the subwoofer and the satellite with a cascade
of all-pass filters.
response of |H(ejω )|2 can be further optimized through a proper choice of the bass
management filter parameters (ωc , N, M ).
Fig. 6.20. The 1/3 octave smoothed combined magnitude response of Fig. 6.19.
6.5 Objective Function-Based Bass Management Filter Parameter Optimization 143
Fig. 6.21. A superimposed plot of the original subwoofer and satellite combined magnitude
response and the all-pass filter-based combined magnitude response demonstrating at least
abut 7 dB improvement around the crossover region (≈80 Hz).
1 P −1
σH (ωc , N, M ) = (10 log10 |H(e i )| − ∆)
jω 2 (6.16)
P i=0
P −1
where ∆ = 1/P i=0 10 log10 |H(ejωi )|, |H(ejωi )| can be found from Eq. (6.14),
and P is the number of frequency points selected around the crossover region. Specif-
ically, the smaller the σH (ωc ) value, the flatter is the magnitude response around the
crossover region.
Fig. 6.23. Simplified block diagram for the net subwoofer and satellite channel signal at a
listener location.
6.5.1 Results
The full-range subwoofer and satellite responses of Fig. 6.24 were used for obtaining
the corresponding equalization filters with the warping and LPC modeling method
of Fig. 6.22.
The integer bass management parameters to be applied to the equalization fil-
−1 jω −1 jω
ters, Ĥsat (e ) and Ĥsub (e ), were selected from the following intervals: ωc ∈
{50, 150}, N ∈ {1, 5}, M ∈ {1, 4}. Subsequently, for a particular combination of
−1 jω −1 jω
(ωc , N, M ), the bass managed equalization filters, H̃sat (e ) and H̃sub (e ), were
applied to Hsub (e ) and Hsat (e ) to yield the equalized responses Hsub (ejω )∗
jω jω
−1 jω −1 jω
H̃sub (e ) and Hsat (ejω )H̃sub (e ). The equalized subwoofer response was 1/3 oc-
tave smoothed and level matched with the 1/3 octave smoothed equalized satellite
response. The resulting complex frequency response, H(ejω ), and the correspond-
ing magnitude response were obtained. At this point, the spectral deviation measure
σH (ωc , N, M ) was determined, for the given choice of (ωc , N, M ), in the crossover
region (chosen to be the frequency range of 40 Hz and 200 Hz given the choice of
ωc ). Finally, the best choice of bass management filter parameter set, (ωc∗ , N ∗ , M ∗ ),
is then that set which minimizes the spectral deviation measure in the crossover re-
gion. Specifically,
Fig. 6.24. (a) 1/3 octave smoothed magnitude response of the subwoofer-based LRTF mea-
sured in a reverberant room; (b) 1/3 octave smoothed magnitude response of the satellite-based
LRTF measured in the same room.
∗
The lowest spectral deviation measure, σH (ωc , N, M ), was obtained for
(ωc∗ , N ∗ , M ∗ )
corresponding to (60 Hz, 3, 4). Observing the natural full-range 18
dB/octave approximate decay rate (below 60 Hz) of the satellite, in Fig. 6.24(b), it is
evident that this choice of N = 3 (i.e., 18 dB/octave roll-off to the satellite speaker
equalized response) will not cause the satellite speaker to be distorted. If necessary, in
the event that N is not sufficiently high, the next largest σH (ωc , N , M ) can always
be selected such that N > N . Of course, other signal-limiting mechanisms may be
employed in conjunction with the proposed approach, and these are beyond the scope
of this chapter. For this choice of the bass management filter parameters (i.e., (60 Hz,
3, 4)), the net magnitude response |H(ejω )|2 (in dB) is shown in Fig. 6.25. Clearly,
the variation in the crossover region (viz., 40 Hz through 200 Hz) is negligible, and
∗
this is reflected by the smallest value found for σH (ωc , N, M ) = 0.45. Thus, the
parameter set (60 Hz, 3, 4) forms the correct choice for the bass management filters
for the room responses of Figs. 6.24.
Further examples, as provided in Figs. 6.26 to 6.28, show the net magnitude re-
sponse |H(ejω )|2 for different choices of (ωc , N, M ) that produce a larger
σH (ωc , N, M ). As can be seen, these “nonoptimal” integer choices of the bass man-
agement filter parameters, as determined from the spectral deviation measure, cause
significant variations in the magnitude response in the crossover region.
146 6 Practical Considerations for Multichannel Equalization
Fig. 6.25. Net magnitude response |H(ejω )|2 (dB) for (60 Hz, 3, 4) with
∗
σH (0.0025π, 3, 4) = 0.45.
Fig. 6.26. Net magnitude response |H(ejω )|2 (dB) for (50 Hz, 4, 4) with
σH (0.0021π, 4, 4) = 0.61.
1
L
avg
σH = σH (ωc , N, M ) (6.18)
L j=1 j
where L is the total number of positions equalized during the multiposition equal-
ization step.
In a nutshell the bass management parameters can be optimized by the fol-
lowing steps, (i) perform multiple position equalization on raw responses (i.e., re-
sponses without any bass managed applied to them), (ii) apply the candidate bass
management filters to the equalized subwoofer and satellite responses, parameter-
ized by (ωc , N, M ) with fc ∈ {40, 200} Hz, N ∈ {2, 4}, M ∈ {3, 5}, (iii) per-
form subwoofer and satellite level setting using bandlimited noise and perceptual
avg
C-weighting, (iv) determine the average spectral deviation measure, σH , after per-
forming 1/3 octave smoothing for each of the net (i.e., combined subwoofer and
satellite) responses in the range of 40 Hz and 250 Hz, and (v) select (ωc , N, M ) that
avg
minimizes σH .
6.6.1 Results
Fig. 6.27. Net magnitude response |H(ejω )|2 (dB) for (130 Hz, 2, 1) with
σH (0.0054π, 2, 1) = 1.56.
148 6 Practical Considerations for Multichannel Equalization
Fig. 6.28. Net magnitude response |H(ejω )|2 (dB) for (90 Hz, 5, 3) with
σH (0.0037π, 5, 3) = 2.52.
Fig. 6.29. An example of full-range subwoofer and satellite responses measured at four differ-
ent positions in a room with T60 ≈ 0.75 s.
6.6 Multiposition Bass Management Filter Parameter Optimization 149
Fig. 6.30. The equalized and bass management filters parameter optimized responses where
the lighter (thin) curve corresponds to all parameter optimization and the thick or darker curve
corresponds to crossover frequency optimization over multiple positions.
avg
determining the parameter set, (ωc , N, M ), that minimizes σH are described in the
preceding section. As a comparison, we also present the results of performing only
crossover frequency optimization [96], but for multiple positions, using the average
spectral deviation measure. The resulting equalized plots are shown in Fig. 6.30.
Comparing the results using the full parameter optimization (lighter curve) with
the crossover frequency optimization over multiple positions, it can be seen that all-
parameter flattens the magnitude response around the crossover region (viz., 40 Hz
through 250 Hz). Specifically, for example, in position 2 a lower Q (i.e., broad) notch
around the crossover region obtained through crossover frequency optimization is
transformed to a high Q (i.e., narrow width) notch by all-parameter optimization.
In addition, as shown via position 4, a very broad and high amplitude undesirable
peak in the magnitude response, obtained from crossover frequency optimization, is
reduced in amplitude and narrowed through all-parameter optimization (red curve).
In fact, Toole and Olive [97] have demonstrated that, based on steady-state mea-
surements, low-Q resonances producing broad peaks in the measurements are more
easily heard than high-Q resonances producing narrow peaks of similar amplitude.
The crossover frequency optimization resulted in a crossover at 90 Hz with the mini-
avg
mum of the average spectral deviation measure, σH , being 0.98. The all-parameter
optimization resulted in the parameter set (ωc , N, M ) corresponding to (80 Hz, 4, 5)
avg
and σH for this parameter set being minimum at 0.89. Also, a comparison between
150 6 Practical Considerations for Multichannel Equalization
Fig. 6.31. The equalized and bass management filters parameter optimized responses where
the thin curve corresponds to all-parameter optimization and the thick curve corresponds to
unequalized responses with the standard bass management filters (80 Hz, 4, 5).
un-equalized responses with bass management set to the standard of (80 Hz, 4, 5)
and the all-parameter optimized and equalized responses is shown in Fig. 6.31.
Fig. 6.32. System representation of time delay technique to correct crossover region response.
1 P2
σ|H| (e ) =
jω (10 log10 |H(e )| − ∆)
jωi 2 (6.21)
D
i=P1
P2
where ∆ = 1/P i=P 1
10 log10 |H(ejωi )|, D = (P2 − P1 + 1) and |H(ejωi )| can
be found from (6.20), and P is the number of frequency points selected around the
crossover region. For the present simulations, because the bass management filters
that were selected had a crossover around 80 Hz, P1 and P2 were selected to be bin
numbers corresponding to 40 Hz and 200 Hz, respectively (viz., for an 8192 length
response and a sampling rate of 48 kHz, the bin numbers corresponded to about 6
and 34 for 40 Hz and 200 Hz, respectively).
Thus, the process for selecting the best time delay n∗d is: (i) set nd = 0 (this may
be relative to any delay that may be used for time-aligning speakers such that the
relative delays between signals from various channels to a listening position is ap-
proximately zero), (ii) level match the subwoofer and satellite, (iii) determine (6.19),
(iv) determine (6.21), (v) nd = nd + 1, (vi) perform (ii) to (v) until nd < Nd , and
(vii) select n∗d = min σ|H| (ejω ).
Care should be taken to ensure that (i) the delay nd is not large enough to cause
a perceptible delay between audio and video frames, and (ii) the relative delays be-
tween channels should not be large enough to cause any imaging problems. Further-
more, if nd is set relative to time-aligned delays, then the termination condition can
152 6 Practical Considerations for Multichannel Equalization
be set as Md < nd < Nd (with Md < 0 and Nd > 0) where small negative de-
lays in each channel are allowed, as long as they are not large enough relative to
delays in other channels, to influence imaging detrimentally. In this chapter we have
selected Md = 0 and Nd = 200 which roughly translate to about a 4 ms delay at 48
kHz sampling rate and results are always presented for one loudspeaker and a sub-
woofer case. Future results (as explained in the context below) are in the direction of
joint crossover and time delay optimization so as to have minimal time delay offsets
between channels in a multichannel system.
Furthermore, this technique can be easily adapted to multiposition crossover cor-
rection (results of which are presented subsequently) by defining and optimizing over
an average spectral deviation measure given as
1
L
avg jω
σ|H| (e ) = σ|Hj | (ejω ) (6.22)
L j=1
where L is the total number of positions and σ|Hj | (ejω ) is the narrowband spectral
deviation measure at position j. Additionally, this technique can be cascaded with the
automatic crossover frequency finding method described in [98] (i.e., in conjunction
with a room equalization algorithm).
Figure 6.33 shows the full-range satellite and subwoofer response at a listening po-
sition, whereas Fig. 6.34 compares the bass managed response (dash-dot line), with
crossover at 60 Hz, with the spectral deviation-based time delay corrected crossover
region response. The optimal time delay found was n∗d = 142 samples at 48 kHz.
Figure 6.35 compares the correction being done in the crossover region using the
automatic time delay and spectral deviation-based technique but for a crossover of
70 Hz for the same speaker set and same position as that of Fig. 6.33. The optimal
time delay was 88 samples. A minimal 10 Hz difference in the crossover required an
additional 54 sample delay to correct the crossover region response. One possibility
is that this is because the suckout for the 70 Hz case was less deep than the 60 Hz
case and hence needed a smaller time delay for effective crossover correction. This
is further validated by selecting the crossover to be 90 Hz and observing the time
delay correction required. Figure 6.36 shows that a small amount of crossover region
response correction is achieved, for crossover of 90 Hz, given the optimal time delay
n∗d = 40 samples. This time delay is a further reduction of 48 samples over the 70
Hz case as the crossover region response is further optimized when the crossover
frequency is selected at 90 Hz. From the three crossover frequencies for the center-
sub case, 90 Hz is the best crossover frequency as it gives the least amount of suckout
in the crossover region and hence requires a very small time delay of just 40 samples.
Accordingly, it can be inferred that the time delay offsets between channels in
a multichannel setup can be kept at a minimum, but still provide crossover region
correction, by either of the following techniques, (i) first performing a crossover
6.8 Summary 153
frequency search by the crossover finding method to improve the crossover region
response for each channel loudspeaker and subwoofer response and then applying a
relatively smaller time delay correction to each satellite channel to further improve
the crossover response, or (ii) performing a multidimensional search for the best
choice of time delay and the crossover, simultaneously, using the spectral deviation
measure, so as to keep the time delay offsets between channels at a minimum.
Figure 6.37 shows the full-range subwoofer and left surround responses at a lis-
tening position, whereas Fig. 6.38 compares the bass managed response (dash-dot
line), with crossover at 120 Hz, with the spectral deviation-based time delay cor-
rected crossover region response. The optimal time delay n∗d was 73 samples at
48 kHz.
6.8 Summary
In this chapter we presented results that show the effect of a proper choice of
crossover frequency for improving low-frequency performance. Additional param-
eter optimization of the bass management filters is shown to yield improved per-
formance. Comparison between the results from crossover and all-parameter opti-
mization, of the bass management filters, for multiposition equalization is presented.
As was shown, cascading an all-pass filter can provide further improvements to the
equalization result in the crossover region. Alternatively, time delay adjustments can
be made in each loudspeaker channel to correct the crossover region response.
Fig. 6.33. The individual full-range subwoofer and a center channel magnitude response mea-
sured at a listening position in a reverberant room.
154 6 Practical Considerations for Multichannel Equalization
Fig. 6.34. The bass managed combined response as well as time delay and spectral deviation
measure-based corrected crossover response (crossover frequency = 60 Hz).
Fig. 6.35. The bass managed combined response as well as time delay and spectral deviation
measure-based corrected crossover response (crossover frequency = 70 Hz).
6.8 Summary 155
Fig. 6.36. The bass managed combined response as well as time delay and spectral deviation
measure-based corrected crossover response for the left surround (crossover frequency = 90
Hz).
Fig. 6.37. The individual full-range subwoofer and a left surround channel magnitude response
measured at a listening position in a reverberant room.
156 6 Practical Considerations for Multichannel Equalization
Fig. 6.38. The bass managed combined response as well as time delay and spectral deviation
measure-based corrected crossover response for the left surround (crossover frequency = 120
Hz).
7
Robustness of Equalization to Displacement Effects:
Part I
7.1 Introduction
A typical room is an acoustic enclosure that can be modeled as a linear system whose
behavior at a particular listening position is characterized by an impulse response.
The impulse response yields a complete description of the changes a sound signal
undergoes when it travels from a source to a receiver (microphone/listener). The sig-
nal at the receiver consists of direct path components, discrete reflections that arrive
a few milliseconds after the direct sound, as well as a reverberant field component. In
addition, it is well established that room responses change with source and receiver
locations in a room [11, 63].
[2004]
c ASA. Reprinted, with permission, from S. Bharitkar and C. Kyriakakis, “Robust-
ness of spatial average equalization: A statistical reverberation model approach”, J. Acoust.
Soc. Amer., 116:3491.
158 7 Robustness of Equalization to Displacement Effects: Part I
Fig. 7.1. Examples of room acoustical responses, having the direct and reverberant compo-
nents, measured at two positions a few feet in a room.
Specifically, the time of arrival of the direct and multipath reflections and the en-
ergy of the reverberant component will vary from position to position. In other words,
a room response at position i, pf,i , can be expressed as pf,i = pf,d,i + pf,rev,i ;
whereas the room response at position j, pf,j , can be expressed as pf,j = pf,d,j +
pf,rev,j where pf,d,j is the frequency response for the direct path component, and
pf,rev,j is the response for the multipath component. An example of time domain re-
sponses at two positions, displaced a few feet apart from each other, in a room with
reverberation time of about 0.25 seconds, is shown in Fig. 7.1 along with the direct
component, early reflections, and late reverberant components. Figure 7.2 shows the
corresponding frequency response from 20 Hz to 20 kHz.
One of the goals in equalization is to minimize the spectral deviations (viz., cor-
recting the peaks and dips) found in the magnitude response through an equaliza-
tion filter. This correction of the room response significantly improves the quality of
sound played back through a loudspeaker system. In essence, the resulting system
formed from the combination of the equalization filter and the room response should
have a perceptually flat frequency response.
One of the important considerations is that the equalization filter has to be de-
signed such that the spectral deviations in the magnitude response (e.g., Fig. 7.2) are
minimized simultaneously for all listeners in the environment. Simultaneous equal-
ization is an important consideration because listening has evolved into a group ex-
perience (e.g., as in home theaters, movie theaters, and concert halls). An example
of performing only a single position equalization (by designing an inverse filter for
position 1) is shown in Fig. 7.3. The top plot shows the equalization result at position
7.1 Introduction 159
Fig. 7.2. Magnitude responses of room responses of Fig. 7.1 showing different spectral devia-
tions (from flat) at the two listener positions.
Fig. 7.3. Magnitude responses, upon single position equalization, of responses of Fig. 7.2.
Specifically, the equalization filter is designed to correct for deviations at position 1, but the
equalized response at position 2 is degraded.
160 7 Robustness of Equalization to Displacement Effects: Part I
Fig. 7.4. Magnitude responses, upon spatial average equalization, of responses of Fig. 7.2.
Specifically, the equalization filter is designed to correct for deviations, on an average, at
positions 1 and 2.
1 (which shows a flat response under ideal filter design).1 However, the equalization
performance is degraded at position 2 with the use of this single position filter as can
be seen in the lower plot. For example, comparing Figs. 7.2 and 7.3, it can be seen
that the response around 50 Hz at position 2, after single position equalization, is at
least 7 dB below the response before equalization.
One method for providing simultaneous multiple listener equalization is spa-
tially averaging the measured room responses at different positions, for a given
loudspeaker, and stably inverting the result. The microphones are positioned, during
measurements, at the expected center of a listener’s head. An example of performing
spatial average equalization is shown in Fig. 7.4. Clearly, the spectral deviations are
significantly minimized for both positions through the spatial average equalization
filter.2
Although spatial average equalization is aimed at achieving uniform frequency
response coverage for all listeners, its performance is often limited due to (i) mis-
match between microphone measurement location and actual location for the center
of the listener head, or (ii) variations in listener locations (e.g., head movements).
In this chapter, we present a method for evaluating the robustness of spatial
averaging-based equalization, due to the introduction of variations in room responses
1
In practice, a low-pass filter with a large cutoff frequency (e.g., 10 kHz), depending on the
the direct to reverberant energy, is applied to the equalization filter to prevent audio from
sounding bright.
2
The filter was a finite impulse response filter of duration 8192 samples.
7.2 Room Acoustics for Simple Sources 161
(generated either through (i) or (ii)), for rectangular listener arrangements relative to
a fixed sound source. The proposed approach uses a statistical description for the
reverberant field in the responses (viz., via the normalized correlation functions) in
a rectangular listener configuration for a rectangular room.3 A similar approach is
followed in [70] for determining variations in performance. However, this was done
with a single position equalization in mind and is focused for microphone array ap-
plications (e.g., sound source localization). Talantzis and Ward [99] used a similar
analysis for understanding the effect of source displacements, but this analysis was
also presented for a microphone array setup without spatial average equalization.
The advantage of the proposed approach is that (i) it is based on established the-
ory of the statistical nature of reverberant sound fields [16]; (ii) it can be applied to
a large frequency range above the Schroeder frequency, for typical size rooms, un-
like modal equations which are valid for low frequencies having wavelengths greater
1/3 min[Lx , Ly , Lz ] [12]; and (iii) the computational complexity, due to the approx-
imations, is low.
In the next section we introduce background necessary for the development of
the robustness analysis. Specifically, an introduction is provided for the determinis-
tic direct component, and the statistical reverberant field correlations. Subsequently
we present the mismatch measure for analyzing the effects of mismatch between
microphone (during measurement of room responses) and listener position (during
playback) with a spatial average equalizer. Additionally, convergence analysis of the
equalization mismatch error, for spatial average equalization, is presented at the end
of the section. Results based on simulations for typical rectangular listener arrange-
ments relative to a fixed source are presented for a rectangular configuration as this
is fairly common in large environments (e.g., movie theaters, concert halls) as well
as in typical home theater setups. The analysis can be extended to arbitrary listening
configurations.
R2 = |i − i0 |2 (7.2)
where pf,d (i|i0 ) is the direct component sound pressure amplitude, Sf is the source
strength, k = 2π/λ is the wavenumber, c = λf is the speed of sound (343 m/s) and
ρ is the density of the medium (1.25 kg/m3 at sea level).
The normalized correlation function [100] which expresses a statistical relation
between sound pressures, of reverberant components, at separate locations i and j,
is given by
where Rij is the separation between the two locations i and j relative to an origin,
and E{.} is the expectation operator.
The reverberant field mean square pressure is defined as
4cρΠa (1 − ᾱ)
E{pf,rev,i p∗f,rev,i } = (7.4)
S ᾱ
where Πa is the power of the acoustic source, ᾱ is the average absorption coefficient
of the surfaces in the room, and S is the surface area of the room.
The assumption of a statistical description (as given in (7.3), (7.4)) for reverber-
ant fields in rooms is justified if the following conditions are fulfilled [16]. (1) Linear
dimensions of the room must be large relative to the wavelength. (2) Average spac-
ing of the resonance frequencies must be smaller than one-third of their bandwidth
(this condition is fulfilled
in rectangular rooms at frequencies above the Schroeder
frequency, fs = 2000 T60 /V Hz (T60 is the reverberation time in seconds, and V
is the volume in m3 ). (3) Both source and microphone are in the interior of the room,
at least a half-wavelength away from the walls.
Furthermore, under the conditions in [16], the direct and reverberant sound pres-
sures are uncorrelated.
A performance function, Wf , that is used for analyzing the effects of mismatch, for
spatial average equalization, of room responses is given as
1
N
W̄f =
f,i (r)
N i=1
f,i (r) = E{|p̃f (r)p̄−1 −1 2
f − pf,i p̄f | } (7.5)
7.3 Mismatch Analysis for Spatial Average Equalization 163
In (7.5),
f,i (r) represents the equalization error in the r-neighborhood of the
equalized location i having response pf,i (r neighborhood is defined as all points
at a distance of r from location i). The neighboring response, at a distance r from
location i, is denoted by p̃f (r), whereas the spatial average equalization response is
denoted by p̄f . Thus, response p̃f (r) is the response corresponding to the displaced
center of head position of the listener (viz., with a displacement of r). To get an in-
termediate equalization error measure,
f,i (r), the expectation is performed over all
neighboring locations at a distance r from the equalized location i. Furthermore, the
final performance function W̄f is the average of all the equalization errors,
f,i (r),
in the vicinity of the N equalized locations. In essence, the displacement (distance)
r can be interpreted as a “mismatch parameter,” because a room response measured
at displacement r will be different from the response measured at a nominal loca-
tion i.
For simplicity, in our analysis, we assume variations in responses due to dis-
placements (or mismatch) in a horizontal plane (i.e., the x and y plane). The analysis,
presented in this chapter, can be extended to include displacements on a spherical
surface. Thus, (7.5) can be simplified to yield
⎧( (2 ⎫
⎨( p̃ (r)N pf,i N (( ⎬
( f
f,i (r) = E ( N − N ( (7.6)
⎩( pf,j pf,j ( ⎭
j=1 j=1
An approximate simplification for (7.5) can be done by using the Taylor series
expansion [101]. Accordingly, if g is a function of random variables, xi , with average
n i } = x̄i , then g(x1 , x2 , . . . , xn ) = g(x) can be expressed as g(x) =
values E{x
g(x̄) + i=1 gi (x̄)(xi − x̄i ) + g(x̂), where g(x̂) is a function of order 2 (i.e., all its
partial derivatives up to the first-order vanish at (x̄1 , x̄2 , . . . , x̄n ). Thus, to a zeroth-
order of approximation E{g(x)} ≈ g(x̄).
Hence, an approximation for (7.6) is given as
E{p̃f (r)p̃f (r)∗ − p̃f (r)p∗f,i − p̃f (r)∗ pf,i + pf,i p∗f,i }
f,i (r) ≈ N 2 ∗ (7.7)
j k E{pf,j pf,k }
In summary (7.8) is obtained by using (7.1) and knowing that the reverberant and
direct field components of sound pressure are uncorrelated, (7.9) is derived in [12,
164 7 Robustness of Equalization to Displacement Effects: Part I
p. 311], (7.10) is determined by using (7.2) and (7.9), and (7.11) is determined from
(7.3) and (7.4). In (7.12), which is the cosine law, θjk is the angle, subtended at the
source at i0 , between locations j and k.
Thus, the denominator term in (7.7) is
E{pf,j p∗f,k }
j k
Πa cρ 4cρΠa (1 − ᾱ) sin kRjk
jk(Rj −Rk )
= e + (7.13)
j
4πRj Rk S ᾱ kRjk
k
The reverberant field correlation for the second term in the numerator of (7.7) can be
found using (7.3), and is
4cρΠa (1 − ᾱ) sin kr
E{p̃f,rev (r)p∗f,rev,i } = (7.20)
S ᾱ kr
The third numerator term in (7.7) can be found in a similar manner as compared to
the derivation for (7.19) and (7.20).
The last term in the numerator of (7.7) is computed to yield
Πa ρc 4ρcΠa (1 − ᾱ)
E{pf,i p∗f,i } = + (7.21)
4πRi2 S ᾱ
Equation (7.21) can be obtained by substituting j = k = i in (7.10) and (7.11),
respectively. Substituting the computed results into (7.7), and simplifying by can-
celing certain common terms in the numerator and the denominator, the resulting
equalization error due to displacements (viz., mismatch in responses) is
( (
N2 1 ( Ri + r (
f,i (r) ≈ log (( ( + 2ψ2 + 1 − 1 + 2ψ2 sin kr (7.22)
ψ1 8Ri rπ Ri − r ( 2ψ3 ψ3 kr
1 sin kR
ejk(Rj −Rl ) + ψ2
jl
ψ1 =
j
4πR j R l kR jl
l
4(1 − ᾱ)
ψ2 =
S ᾱ
ψ3 = 2πRi2
Rjl = Rj2 + Rl2 − 2Rj Rl cos θjl
Finally, substituting (7.22) into (7.5) yields the necessary equation for W̄f .
7.4 Results
We simulated Eq. (7.22) for frequencies above the Schroeder frequency fs = 77 Hz
(i.e., T60 = 0.7 sec, V = 8 m ×8 m ×8 m).
In this setup, we simulated a rectangular arrangement of six microphones, with
a source in the front of the arrangement. Specifically, microphones 1 and 3 were at a
distance of 3 m from the source, microphone 2 was at 2.121 m, microphones 4 and
6 were at 4.743 m, and microphone 5 was at 4.242 m. The angles θ1k in (7.12) were
(45, 90, 18.5, 45, 71.62) degrees for (k = 2, . . . , 6), respectively. Thus, the distances
of the listeners from the source are such that R6 = R4 > R5 > R1 = R3 > R2 .
The equalization error,
f,i (r), results are depicted for different listeners in Figs.
7.5 to 7.8 for four frequencies (f = 500 Hz, f = 1 kHz, f = 5 kHz, and f = 10
kHz) as a function of r/λ, where the mismatch parameter 0 ≤ r ≤ 0.7 m (r/λ corre-
sponds to no mismatch condition). Specifically, only the results for listeners 1 and 2
are shown in the top panels because listener 3 is symmetric relative to source/listener
2 (hence the results of listener 1 and 3 are identical). Similarly, only the results for
listeners 4 and 5 are shown in the bottom panels.
We observe the following.
1. It can be seen that the steady-state equalization error at listener 2 is higher than
that at listener 1 (top panel). This follows from Eq. (7.23) (because R1 = R3 > R2 ).
Similar results can be predicted for the equalization errors for listeners 4 and 5 (this
is not immediately obvious in the bottom panels, because R4 is close to R5 ).
Fig. 7.5.
f,i (r) for the listeners at different distances from the source, r/λ = 0 corresponds
to the optimal position, f = 500 Hz.
7.4 Results 167
Fig. 7.6. f,i (r) for the listeners at different distances from the source, f = 1 kHz.
Fig. 7.7.
f,i (r) for the listeners at different distances from the source, f = 5 kHz.
168 7 Robustness of Equalization to Displacement Effects: Part I
Fig. 7.8. f,i (r) for the listeners at different distances from the source, f = 10 kHz.
frequencies but not easily noticeable at higher frequencies (as can be seen from the
initial rise of the error towards a peak value before reaching a steady-state value).
3. The equalization error shows a sinc(2r/λ) dependance after the initial peak (as
emphasized in Fig. 7.6). This dependence arises from the finite correlation of the
reverberant field before it reaches a negligible value at steady-state.
Finally, Fig. 7.9 summarizes the average equalization error plot (i.e., Wavg of
(7.5)), over all listeners, for frequencies beyond fs and mismatch parameter r ranging
from 0 m to 0.7 m. This measure is a composite measure weighing the equalization
error at all positions equally, and shows that the performance degrades for all fre-
quencies with increasing mismatch or displacement in meters. Also, the degradation,
for small displacement r (of the order of 0.1 meter) is larger for higher frequencies.
For example, it can be seen that the slope of the Wavg curves in the frequency region
around 200 Hz is lower than the slopes of the curves for frequencies around 10 kHz.
Alternate measures with nonuniform weighting, depending on the “importance” of
a listening position, may be used instead. Thus, such a measure could potentially be
used to give an overall picture during comparisons to other approaches of multiple
listener equalization.
7.5 Summary
In this chapter, we analyzed the performance of spatial averaging equalization, in
a multiple listener environment, used during sound playback. As is well known,
room equalization at multiple positions, allows for high-quality sound playback in
7.5 Summary 169
Fig. 7.9. W̄ for various mismatch parameters and frequencies between 20 Hz and 20 kHz.
the room. However, as is typically the case in room equalization, the microphone po-
sitions during measurement of the room response will not necessarily correspond to
the center of head of the listener leading to a frequency-dependent degradation due
to mismatch between the measured response and the actual response corresponding
to the center of listener head during playback. Several interesting observations can
be made from the results, including: (i) the influence of frequency and distance on
the size of equalization region, (ii) the steady-state equalization error being depen-
dent on the distance of the listener from the source, and (iii) the dependence of the
reverberant field correlation on the equalization error. Future goals can be directed
to using the proposed method for comparing different multiple listener equalization
techniques in terms of their robustness to response mismatch.
8
Robustness of Equalization to Displacement Effects:
Part II
8.1 Introduction
In this chapter, we propose a statistical approach, using modal equations, for eval-
uating the robustness of magnitude response average equalization, due to variations
in room responses (due to (i) or (ii)). Specific results are obtained for a particu-
lar listener arrangement in a rectangular room. Modal equations have been used in
the analysis because they accurately model the magnitude response at low frequen-
cies. As is well known, dominant modes, in this low-frequency region, are relatively
harder to equalize than at higher frequencies. In the next section, we introduce the
necessary background used in the development of the proposed robustness analysis.
The subsequent section is devoted to the development of the robustness analysis for
spatial average-based equalization. Results based on simulations for a typical rectan-
gular listener arrangement relative to a fixed source and validation of the theoretical
analysis are presented.
172 8 Robustness of Equalization to Displacement Effects: Part II
x −1 Ny −1 Nz −1
pn (q )pn (q ) N pn (q l )pn (q o )
pω (q l ) = jQωρ0 l o
= jQωρ0
n
Kn (k 2 − kn2 ) nx =0 ny =0 nz =0
Kn (k 2 − kn2 )
n = (nx , ny , nz ); k = ω/345; q l = (xl , yl , zl )
2 2 1/2
2
nx ny nz
kn = π + + pn (q l )pm (q l )dV
Lx Ly Lz V
Kn n = m
= (8.1)
0 n = m
The magnitude response averaging method is popular for performing spatial equal-
ization over a wide area in a room. The magnitude response spatial averaging process
can be expressed in terms of the modal equations (8.1) as
8.3 Mismatch Analysis with Spatial Average Equalization 173
Fig. 8.1. The eigenfunction distribution, for a tangential mode (3,2,0) in the z = 0 plane, over
a room of dimensions 6 m ×6 m ×6 m.
1
N
pω,avg = |pω (q l )| (8.3)
N
l=1
(i)
where pω (ν ) is the pressure at location ν in the
neighborhood of equalized posi-
tion i having pressure pω (q i ) (
-neighborhood is defined as all points at a distance of
from location i), E{.} denotes the statistical expectation operator, and ω = 2πc/λ
(where c = 345m/s).
The intermediate performance measure in (8.4) is defined in such a manner that
when the displacement
, about position i (whose response pω (q i ) which is originally
used for determining the spatially averaged equalization filter p−1 ω,avg ) is zero, then
(i)
Wω (
) = 0. Thus, the performance measure is computed as an average square
174 8 Robustness of Equalization to Displacement Effects: Part II
error between the response at the equalized location and the response at a displaced
location having distance
from the equalized location.
Finally, using the intermediate performance function, a generalized average per-
N (i)
formance measure is expressed as Wω (
) = 1/N i=1 Wω (
).
For simplicity, we assume variations in responses due to displacements (or mis-
match) only in the horizontal plane (x and y plane). The analysis can be easily ex-
tended to include the mismatch in three dimensions. Thus, simplification of (8.4)
leads to
N
2
Wω(i) (
) = N 2 / |pω (q l )|
l=1
, -.
I
/ , -.
II
/
× E{pω (ν (i) )p∗ω (ν (i)
)} − E{p ∗
ω (ν (i)
)} pω (q i )
, -.
III
/
IV
, -. /
(i) ∗ 2
− E{pω (ν )} pω (q i ) + |pω (q i )| (8.5)
We only need to compute the statistics associated with Terms (I), (II), and (III)
(the terms within the expectations) in (8.5), because Term (IV) is a deterministic
quantity.
(i) (i)
Now, E{pω (ν )p∗ω (ν )} is the average over all locations along a circle of ra-
dius
from the ith listener location. Assuming the source, all listeners, and each of
the listener displacements are along the same z-plane (viz., z = 0), then (I) in (8.5)
can be simplified as
j8Qωρ0 Lx Ly
pω (ν (i) ) = (8.6)
V n
(k 2 − kn2 )
(i) ∗ (i) 2
E{pω (ν )pω (ν )} = |ψ1 | (1/ψ2 )ψ3 (8.7)
n,m
8Qωρ0
ψ1 =
V
ψ2 = (k 2 − kn2 )(k 2 − km2
)
(i) (i)
nx πφx ny πφy
ψ3 = E cos cos
Lx Ly
(i) (i)
mx πφx my πφy
× cos cos (8.8)
Lx Ly
(i) (i)
Now, with φx = xi +
cos θ and φy = yi +
sin θ
(i) (i) (i) (i)
nx πφx ny πφy mx πφx my πφy
E cos cos cos cos
Lx Ly Lx Ly
2π
1 nx π(xi +
cos θ) ny π(yi +
sin θ)
= cos cos
2π 0 Lx Ly
mx π(xi +
cos θ) my π(yi +
sin θ)
× cos cos dθ. (8.9)
Lx Ly
Equation (8.9) can be solved using the MATLAB trapz function. However, we
found an approximate closed-form expression to be computationally much faster.
The following expressions were derived from standard trigonometric formulae, using
the first two terms in the polynomial expansion of the cosine function, and the first
term in the polynomial expansion of the sine function because (
/Lx ,
/Ly ,
/Lz )
1. Thus,
(i) (i) (i) (i)
nx πφx ny πφy mx πφx my πφy
E cos cos cos cos
Lx Ly Lx Ly
1
= (A + B + C) (8.10)
2π
where
nx πxi ny πyi mx πxi my πyi
A = π cos cos cos cos
Lx Ly Lx Ly
3 1
× 2 −
2y vy −
2x vx + (
4x u4x +
4y u4y ) − (
2x
4y u2y vx +
4x
2y u2x vy )
4 8
1 3
+
2x
2y vx vy +
4x
4y u2x u2y
4 64
176 8 Robustness of Equalization to Displacement Effects: Part II
nx πxi mx πxi ny πyi my πyi
B= π
2y uycos cos sin sin
Lx Lx Ly Ly
& 2
0 2 2 2 2
1'
× 2 − 0.5
x mx + nx − 0.5
x ux
2 2 nx πxi ny πyi mx πxi my πyi
C = π
y
x ux uy sin sin sin sin
Lx Ly Lx Ly
2 nx πxi mx πxi my πyi ny πyi
+π
x ux sin sin cos cos
Lx Lx Ly Ly
& 2
0 2 2 2 2
1'
× 2 − 0.5
y ny + my − 0.5
y uy (8.11)
where
π
π
x = √
y = √
2Lx 2Ly
ux = nx mx u y = n y my
vx = (m2x + n2x ) vy = (m2y + n2y ) (8.12)
(i) (i)
Again using φx = xi +
cos θ; φy = yi +
sin θ, we have
(i) (i)
nx πφx ny πφy
E cos cos
Lx Ly
2π
1 nx π(xi +
cos θ) ny π(yi +
sin θ)
= cos cos dθ (8.14)
2π 0 Lx Ly
Thus, upon again using the fact that (
/Lx ,
/Ly ,
/Lz ) 1, we can solve
(8.14) as
(i) (i)
nx πφx ny πφy
E cos cos
Lx Ly
1 nx πxi ny πyi 4 0 1 π 5
= cos cos 2π − π
2x n2x +
2y n2y +
2y
2x n2x n2y
2π Lx Ly 4
(8.15)
Substituting (8.15) in (8.13) and subsequently into (8.5) gives Terms II and III.
8.4 Results 177
8.4 Results
8.4.1 Magnitude Response Spatial Averaging
Fig. 8.3. Simulated setup for a six-position displacement effects analysis in a rectangular con-
figuration in front of a source.
178 8 Robustness of Equalization to Displacement Effects: Part II
Fig. 8.4. Magnitude of the sound pressures, from 25 Hz up to 200 Hz at 1/3 octave frequencies,
that can be expected for each of the listener positions.
the same response, because these positions are symmetrically located relative to the
source, thereby making the cosine products in the eigenfunction equation (8.2) equal.
Figure 8.5 shows the spatial average of the magnitude responses, whereas Fig.
8.6 shows the equalized responses at the six positions based on direct frequency do-
main inversion. It is clear that these equalized responses will be affected due to (i)
microphone/listener mismatch (e.g., when a microphone is used for measuring the
response at a position, and the actual center of listener head is at a different posi-
tion) and/or (ii) listener head displacement, due to the variations in the eigenfunction
pn (q l ).
Nx −1 Ny −1
nx =0 ny =0 pn (ql )/(k 2 − kn2 ), will be accurate if the first ten integers of nx , ny
are used because, 1/(k 2 −kn2 ) ≈ 0 for nx > 10 and ny > 10 for wavelength λ = 5.4
m.1
For the present, we have confirmed that the first 10 integers for the quantum
number tuple, nx , ny , nz , are sufficient
for all
the lower 1/3 octave frequencies under
consideration to accurately model nx ny nz pn (ql )/(k 2 − kn2 ).
1
Even though the plot shows significant attenuation of the 1/(k2 − kn2 ) term at nx > 6 and
ny > 6, we have chosen the limit Nx and Ny , liberally, to account for the negligible, but
nonzero, additional terms.
8.4 Results 181
8.4.4 Validation
This section validates the results obtained from using the closed-form expressions
from the previous section. For this we did the following, for a given 1/3 octave fre-
quency.
1. Determine the sound pressures, pω (q i ), using (8.1) for the six positions (i =
1, 2, . . . , 6).
2. Determine the average of the sound pressure magnitudes at the six positions.
3. Determine the inverse of the average p−1 ω,avg .
A. For each position i (i = 1, 2, . . . , 6):
A.i. Compute −pω (q i )p−1
ω,avg .
A.ii. Generate 250 positions in a circle at a displacement of
with the center
being the listener position i.
A.iii. For the given displacement
, determine the sound pressures at the each
of the 250 positions, using (1), for the position i. The sound pressure at each of the
(i)
250 displaced positions can be expressed as pω (ν ) (see Eq. (8.3)).
(i)
A.iv. For each of the 250 displaced positions compute |pω (ν )p−1 ω,avg −
pω (q i )p−1
ω,avg |
2
(i)
A.v. Compute the average of |pω (ν )p−1 −1 2
ω,avg − pω (q i )pω,avg | over all 250
(i)
positions. This is effectively computing the expectation in (3) to obtain Wω (
).
6 (i)
B. Determine W (in dB) by using Wω (
) = 1/6 i=1 Wω (
).
Figures 8.11 to 8.13 show the results for all the frequencies. By comparing Figs.
8.8 to 8.10 with 8.11 to 8.13, it can be seen that the plots are quite similar, thus
confirming the validity of the proposed closed-form expression for characterizing
the equalization performance using spatial averaging due to mismatch/displacement
effects.
8.4 Results 183
Fig. 8.11. Validation of the analytic solution for Wω (
) versus displacement
for lower 1/3
octave frequencies.
In this section we present some results obtained from single listener equalization. It
is well known that single listener equalization may cause significant degradations,
Fig. 8.12. Validation of the analytic solution for Wω (
) versus displacement
for middle 1/3
octave frequencies.
184 8 Robustness of Equalization to Displacement Effects: Part II
Fig. 8.13. Validation of the analytic solution for Wω (
) versus displacement
for upper 1/3
octave frequencies.
in the frequency response, at other listening positions. In fact, the degradation at the
other positions could end up being more than what it would have been if the sin-
gle position were not equalized [102]. Thus, the goal of this section is to further
demonstrate that, besides degradation in frequency response at other listening posi-
tions, single listener equalization will cause significant degradation of equalization
performance in the presence of mismatch.
Figure 8.14 shows the equalization performance Wω (
), at 40 Hz, if either posi-
tion 2 or position 6 (see Fig. 8.3) was equalized, and Fig. 8.15 shows the equalization
performance at 160 Hz. Specifically, the results in Figs. 8.14 and 8.15 were obtained
by replacing pω,avg with either |pω (q 2 )| or |pω (q 6 )| in (8.4).
It can be immediately observed from Figs. 8.14 and 8.15 that the equalization per-
formance depends on the position being equalized. In this particular listening setup,
the equalization performance is biased favorably towards position 2. Of course this
bias is introduced by a “favorable” weighted eigenfunction distribution (the weights
being the denominator term in (8.1)), and because there is generally no a priori in-
formation on the distribution, there is generally no way of knowing which position
will provide the “most favorable” equalization performance. Thus, a safe equaliza-
tion choice combines the modal equations (or room responses), at expected listener
positions, to get a good equalization performance for the room.
8.5 Summary 185
Fig. 8.14. Single listener equalization results for f = 40 Hz, where equalization is done for
position 6 (dashed line) or position 2 (solid line).
8.5 Summary
In this chapter we presented a statistical approach using modal equations for eval-
uating the robustness of equalization based on magnitude response averaging, due
to the variations in room responses, for a realistic listener arrangement relative to a
source. The simulations were performed for a six-listener setup with a simple source
in a cubic room.
Clearly, there is a degradation in the equalization performance due to displace-
ment effects. Furthermore, this degradation is different for different frequencies (viz.,
generally smaller for relatively lower frequencies as compared to higher frequen-
cies). We have also experimentally confirmed the validity of the proposed closed-
form solution for measuring degradation performance.
Furthermore, we also demonstrated the importance of average equalization over
single listener equalization when considering mismatch/displacement effects.
Finally, an interesting future research direction is the formulation of a percep-
tually motivated performance function and evaluation of the robustness using this
measure.
186 8 Robustness of Equalization to Displacement Effects: Part II
Fig. 8.15. Single listener equalization results for f = 160 Hz, where equalization is done for
position 6 (dashed line) or position 2 (solid line).
9
Selective Audio Signal Cancellation
9.1 Introduction
Integrated media systems are envisioned to have a significant impact on the way
media, such as audio, are transmitted to people in remote locations. In media ap-
plications, although a great deal of ongoing research has focused on the problem of
delivering high-quality audio to a listener, the problem of delivering appropriate au-
dio signals to multiple listeners in the same environment has not yet been adequately
addressed.
In this chapter we focus on one aspect of this problem that involves presenting
an audio signal at selected directions in the room, while simultaneously minimizing
the signal at other directions. For example, in home theater or television viewing
applications a listener in a specific location in the room may not want to listen to the
audio signal being transmitted, whereas another listener at a different location would
prefer to listen to the signal. Consequently, if the objective is to keep one listener in
[2003]
c IEEE. Reprinted, with permission, from S. Bharitkar and C. Kyriakakis, “Selec-
tive signal cancellation for multiple-listener audio applications using eigenfilters”, IEEE
Transactions on Multimedia, 5(3):329–338.
188 9 Selective Audio Signal Cancellation
a region with a reduced sound pressure level, then one can view this problem as that
of signal cancellation in the direction of that listener. Similar applications arise in the
automobile (e.g., when only the driver would prefer to listen to an audio signal), or
any other environment with multiple listeners in which only a subset wish to listen
to the audio signal.
Several methods have been proposed in the literature to lower the signal level ei-
ther globally or in a local space within a region. Elliott and Nelson [105] proposed a
global active power minimization technique for reducing the time-averaged acoustic
pressure from a primary source in an enclosure, using a set of secondary source distri-
butions. This least squares-based technique demonstrated that reduction in potential
energy (and therefore sound pressure) can be achieved if the secondary sources are
separated from the primary source by a distance which is less than half the wave-
length of sound at the frequency of interest. It was suggested that this method can be
employed to reduce the cockpit noise in a propeller-powered aircraft. Similarly, Ross
[106] suggested the use of a filter that can minimize the signal power in the lobby of
a building due to a generator outside the lobby by blocking the dominant plane wave
mode with a loudspeaker. The reader is referred to several other interesting tutorial
papers that have been published in active noise control [107, 108, 109]. Other ex-
amples could include head-mounted reference sensors using adaptive beamforming
techniques [110].
In this chapter, the problem of signal cancellation is tackled by designing objec-
tive functions (criteria) that aim at reducing the sound pressure levels of signals in
predetermined directions. A first objective criterion is designed for maximizing the
difference in signal power between two different listener locations that have differ-
ent source and receiver response characteristics. Thus, one application of this sys-
tem lies in an environment having conflicting listening requirements, such as those
mentioned earlier (e.g., automobiles, home environment). The filter, known as the
eigenfilter, that is derived by optimizing the objective function, operates on the raw
signal before being linearly transformed by the room responses in the direction of
the listeners. Such filters aim at increasing the relative gain in signal power between
the two listeners with some associated tradeoffs such as: (i) spectral distortion that
may arise from the presence of the eigenfilter, and (ii) the sensitivity of the filter
to the length of the room impulse response (reverberation). Further issues that can
be researched, and which are beyond the scope of this chapter, include human per-
ception of loudness, as well as perceptual aspects, such as coloration, and speech
intelligibility.
The organization of this chapter is as follows. In the next section, we derive the
required eigenfilter from a proposed objective function, and prove some of the theo-
retical properties of such filters. We provide experimental results for the performance
(and tradeoff) of the eigenfilters in two situations: (i) using a synthesized room im-
pulse response with a speech excitation, and (ii) using an actual room impulse re-
sponse with a stochastic excitation. We also investigate the performance differences
that are observed when using a minimum-phase model for the room response. We
conclude the chapter by discussing some future research directions for selective sig-
nal cancellation using eigenfilters.
9.2 Traditional Methods for Acoustic Signal Cancellation 189
This is a well-known and by far the cheapest passive sound control method for ab-
sorbing unwanted audio signals. In this method, an uninterested listener places a
cotton ball inside each ear to limit the intensity (increase the attenuation) of the
sound signal that enters the ear canal and strikes the ear drums. The disadvantage
of this method is that the attenuation of the wad of cotton decreases with decrease
in frequency. So, the wad of cotton is not quite good for passively canceling acous-
tic signals at low frequencies. Moreover, the insertion of the cotton ball can cause
discomfort.
Ear Defenders
Ear defender [128] is a term used to designate a device that introduces attenuation
of sound between a point outside the head and the eardrum. There are two types,
namely, (i) the cushion type, and (ii) the insert type. The cushion type is similar to
a pair of headphones with soft cushion ear pads. The cushion types are heavy and
cumbersome. The insert type is a form of a plug that is pushed into the ear canal.
Soft plastics and synthetic rubbers are the commonly used materials for the insert
type defenders. A good ear defender will introduce an attenuation of 30 to 35 dB
from 60 Hz to 8 kHz. However, the insertion may cause discomfort. Furthermore,
they need to be custom made depending on the size of the ear.
190 9 Selective Audio Signal Cancellation
Acoustic Barriers
Simply put, an acoustic barrier is a glorified wall. The aim of building a barrier is
to redirect acoustic signal power generated by a source away from an uninterested
listener. To be effective at this, the barrier must be constructed from “heavy” material
(i.e., having a high surface density). Clearly, this is prohibitive in a room or an
automobile. Moreover, a reasonable sized wall provides something of the order of 10
dB of attenuation in acoustic signal pressure levels. Attenuation of 20 dB or more is
almost impossible to achieve with a simple barrier.
There are several variants in this approach containing at least one secondary loud-
speaker. Historically, the fundamental concept was first presented in a patent granted
to Lueg [130], wherein Lueg suggested using a loudspeaker for canceling a one-
dimensional acoustic wave, where a source generates a primary acoustic waveform
p(x, t) = p(x)ejωt ,1 expressed as an instantaneous sound pressure, in a duct (the
solid line indicates the primary waveform). A microphone located farther down-
stream in the duct detects the acoustic signal. The output from the microphone is
used to drive a loudspeaker, a secondary source, after being manipulated by a con-
troller. The output from the loudspeaker is another acoustic signal s(x, t) = s(x)ejωt
indicated by the dotted line. The loudspeaker is positioned and the controller is de-
signed in a manner such that the secondary source s(x, t) generates a signal that is
of the same amplitude but opposite in phase as the primary source. That is,
1
The decomposition of the acoustic wave into its time-dependent and frequency-dependent
components has its origins in the solution to the one-dimensional wave equation expressed
as ∂ 2 p(x, t)/∂x2 − ∂ 2 p(x, t)/c20 ∂t2 = 0, with c0 being the speed of the acoustic wave in
the given medium.
9.3 Eigenfilter Design for Conflicting Listener Environments 191
where ⊗ represents the convolution operation. With this background, we view the
signal cancellation problem as a gain maximization problem (between two arbitrary
receivers); we can state the performance criterion as
2
2
1 σy2 (n) λ σy1 (n)
J(n) = max − −ψ (9.5)
w 2 σv22 (n) 2 σv21 (n)
in which we would like to maximize the signal-to-noise ratio (or signal power) in
the direction of listener 2, while keeping the power towards listener 1 constrained at
10ψdB /10 (where ψdB = 10 log10 ψ). In (9.5), σy2i (n) /σv2i (n) denotes the transmit-
ted signal to ambient noise power at listener Ri with yi (n) as defined in (9.4). The
quantity λ is the well-known Lagrange multiplier.
9.3 Eigenfilter Design for Conflicting Listener Environments 193
It is interesting to see that, when x(n) and v(n) are mutually uncorrelated, the
two terms in the objective function (9.5) are structurally related to the mutual infor-
mation between the source and listeners R2 and R1 , respectively, under Gaussian
noise assumptions [103].
Now observe that,
M −1
y1 (n) = h1 (n) ⊗ wk x(n − k) + v1 (n) (9.6)
k=0
where h1 (n) is the room response in the direction for the listener labeled 1. Let
w = (w0 , w1 , . . . , wM −1 )T , and x(n) = (x(n), x(n − 1), . . . , x(n − M + 1))T ;
then (9.6) can be expressed as
where z(n) = wT x(n). We assume that the zero mean noise and signal are real and
statistically independent (and uncorrelated in the Gaussian case). In this case signal
power in the direction of listener 1 is
L−1 L−1
2
σy1 (n) = E h1 (p)h1 (q)z(n − p)z (n − q) + σv21 (n)
T
p=0 q=0
L−1
L−1
= h1 (p)h1 (q)(wT Rx (p, q)w) + σv21 (n) (9.8)
p=0 q=0
Similarly,
S−1
S−1
σy22 (n) = h2 (p)h2 (q)(wT Rx (p, q)w) + σv22 (n) (9.10)
p=0 q=0
Solving ∇w J(n) = 0, will provide the set of optimal tap coefficients. Hence from
(9.5), (9.8), and (9.10), we obtain
1
S−1 S−1
∂J(n)
= 2 h2 (p)h2 (q)Rx (p, q)w∗
∂w σv2 (n) p=0 q=0
194 9 Selective Audio Signal Cancellation
λ L−1
L−1
− h1 (p)h1 (q)Rx (p, q)w∗ = 0 (9.11)
σv21 (n) p=0 q=0
L−1
L−1
B= h1 (p)h1 (q)Rx (p, q) (9.12)
p=0 q=0
By assuming equal ambient noise powers at the two receivers (i.e., σv22 (n) = σv21 (n) ),
(9.11) can be written as
(
∂J(n) ((
= (B −1 A − λI)w∗ = 0 (9.13)
∂w (w=w∗
The reason for arranging the optimality condition in this fashion is to demonstrate
that the maximization is in the form of an eigenvalue problem (i.e., the eigenval-
ues corresponding to the matrix B −1 A), with the eigenvectors being w∗ . There are
in general M distinct eigenvalues for the M × M matrix, B −1 A, with the largest
eigenvalue corresponding to the maximization of the ratio of the signal powers be-
tween receiver 2 and receiver 1. The optimal filter that yields this maximization is
given by
w∗ = eλmax [B −1 A] (9.14)
where eλmax [B −1 A] denotes the eigenvector corresponding to the maximum eigen-
value λmax of B −1 A. An FIR filter whose impulse response corresponds to the
elements of an eigenvector is called an eigenfilter [115, 8]. Finally, the gain between
the two receiver locations can be expressed as
σy22 (n) w∗ T Aw∗
GdB = 10 log10 = 10 log10 (9.15)
σy21 (n) w∗ T Bw∗
Clearly it can be seen from (9.14) that the optimal filter coefficients are deter-
mined by the channel responses between the source and the two listeners. The de-
grees of freedom for the eigenfilter are the order M of the eigenfilter.
Fundamentally, by recasting the signal cancellation problem as a gain maximiza-
tion problem, we aim at introducing a gain of G dB between two listeners, R1 and
R2 . This
√ G dB gain is equivalent to virtually positioning listener R1 at a distance
that is 10GdB /10 the distance of listener R2 from a fixed sound source C.3 This is
depicted in Fig. 9.2, where R1 (solid head)
√ is experiencing signal power levels that
he would expect if he were positioned at 10GdB /10 (indicated by the dotted head).
3
Strictly speaking, in the free field, the gain based on the inverse square law is expressed
as Q = 10 log10 r12 /r22 (dB), where r1 , r2 are the radial distances of listeners R1 and R2
from the source.
9.3 Eigenfilter Design for Conflicting Listener Environments 195
Some interesting properties of the proposed eigenfilter emerge under wide-sense sta-
tionary (WSS) assumptions. In this section we derive some properties of eigenfilters
for selective signal cancellation, which we then use in a later section.
In signal processing applications, the statistics (ensemble averages) of a stochas-
tic process are often independent of time. For example, quantization noise exhibits
constant mean and variance, whenever the input signal is “sufficiently complex.”
Moreover, it is also assumed that the first-order and second-order probability density
functions (PDFs) of quantization noise are independent of time. These conditions
impose the constraint of stationarity. Because we are primarily concerned with sig-
nal power, which is characterized by the first-order and second-order moments (i.e.,
mean and correlation), and not directly with the PDFs, we focus on the wide-sense
stationarity aspect. It should be noted that in the case of Gaussian processes, wide-
sense stationarity is equivalent to strict-sense stationarity, which is a consequence of
the fact that Gaussian processes are completely characterized by the mean and vari-
ance. Below, we provide some definitions, properties, and a basic theorem pertaining
to eigenfilter structure for WSS processes.
Property 1 : For a WSS process x(n), and y(n) with finite variances, the matrix
Rx (p, q) is Toeplitz, and the gain (9.15) can be expressed as
|W ∗ (ejω )|2 |H2 (ejω )|2 Sx (ejω )dω
GdB = 10 log10 2π (9.16)
2π
|W ∗ (ejω )|2 |H1 (ejω )|2 Sx (ejω )dω
where rx (k) ∈ Rx (k) and Sx (ejω ) form a Fourier transform pair, and h1 (n) and
h2 (n) are stable responses. Moreover, because we are focusing on real processes in
196 9 Selective Audio Signal Cancellation
Q = JQJ (9.18)
Hence, from the above properties, the matrices A and B (in (9.12)) are persymmetric.
Property 5 : The inverse of a persymmetric matrix is persymmetric.
Q = JQJ
Q−1 = (JQJ)−1 = J−1 Q−1 J−1 = JQ−1 J (9.21)
V1 = {w : Jw = w}
V2 = {w : Jw = −w} (9.22)
Now,
Jν 1 = ν 1 ν 1 ∈ V1 (9.23)
ν T2 Jν 1 = ν T2 ν 1 (9.24)
But,
Jν 2 = −ν 2 ⇒ ν T2 J = −ν T2 (9.25)
−ν T2 ν 1 = ν T2 ν 1 ⇒ ν T2 ν 1 = 0 (9.26)
Theorem 9.3. The optimal eigenfilter (9.14) is a linear phase FIR filter having a
constant phase and group delay (symmetric case), or a constant group delay (skew-
symmetric case).
Proof :
∗ w∗ (M − 1 − m) symmetric
w (m) = m = 0, 1, . . . , M − 1
−w∗ (M − 1 − m) skew-symmetric
(9.29)
9.4 Results
The degrees of freedom for the eigenfilter in (9.14), is the order M . Variabili-
ties such as (i) the choice for the modeled duration (S, L) for the room responses
(9.12), (ii) the choice of the impulse response (i.e., whether it is minimum-phase
or nonminimum-phase), and (iii) variations in the room response due to listener (or
head) position changes affect the performance (gain). We study (i) and (ii) in the
present chapter with the assumption of L = S. The choice for the filter order and
the modeled impulse response duration affects the gain (9.15) and distortion (defined
later in this section) of the signal at the microphones. Basically, a lower duration re-
sponse used for designing the eigenfilter will reduce the operations for computing the
eigenfilter, but may affect performance. In summary, the length of the room response
(reverberation) modeled in the design of the eigenfilter affects the performance and
this variation in performance is referred to as the sensitivity of the eigenfilter to the
length of the room response.
In this experiment, the excitation, x(n), was a segment of speech signal obtained
from [120]. The speech was an unvoiced fricated sound /S/ as in “sat” obtained from
a male subject and is shown in Fig. 9.3.
As is well known, this sound is obtained by exciting a locally time-invariant,
causal, stable vocal tract filter by a stationary uncorrelated white noise sequence,
which is independent from the vocal tract filter [114]. The stability of the vocal tract
Fig. 9.3. The speech signal segment for the unvoiced fricative /S/ as in sat.
9.4 Results 199
Fig. 9.4. Impulse responses for the front and back positions.
filter is essential, as it guarantees the stationarity of the sequence x(n) [122]. The
impulse responses were generated synthetically from the room acoustics simulator
software [123]. The estimation of these responses was based on the image method
(geometric modeling) of reflections created by ideal omnidirectional sources, and re-
ceived by ideal omnidirectional receivers [61]. For the present scenario the modeled
room was of dimensions, 15 m × 10 m × 4 m. The source speaker was at (1 m, 1 m,
1 m) from a reference northwest corner. The impulse response for the “front” micro-
phone located at (4.9 m, 1.7 m, 1 m) relative to the reference, was denoted as h2 (n),
and the “back microphone” located at (4.5 m, 6.4 m, 1 m) had impulse response
measurement h1 (n). The two responses are plotted as positive pressure amplitudes
in Fig. 9.4 (ignoring the initial delay). This situation is similar to the case for listen-
ers in an automobile, where the front left speaker is active, and the relative gain to be
maximized is between the front driver and the back passenger.
A plot of the gain (9.15) as a function of the filter order for the aforementioned
signal and impulse responses is shown in Fig. 9.5.
Firstly, a different microphone positioning will require a new simulation for com-
puting (9.14), and determining the performance thereof. Secondly, larger duration fil-
ters increase the gain, but affect the signal characteristics at the receiver in the form
of distortion. Basically, a distortion measure is an assignment of a nonnegative num-
ber between two quantities to assess their fidelity. According to Gray et al. [124], a
distortion measure should satisfy the following properties: (1) it must be meaningful,
in that small and large distortions between the two quantities correspond to good and
bad subjective quality, (2) it must be tractable and should be easily tested via mathe-
matical analysis, and (3) it must be computable (the actual distortions in a real system
200 9 Selective Audio Signal Cancellation
Sŷ (ejω ) = |H2 (ejω )|2 |WM (ejω )|2 Sx (ejω ) = |WM (ejω )|2 Sy (ejω ) (9.31)
where Sŷ (ejω ), Sy (ejω ) are the spectra associated with the presence and absence
of the eigenfilter,
M −1 respectively (an equivalent model is shown in Fig. 9.6), and
WM (ejω ) = i=0 wi e−jωi .
Proof: From the L1 definition, we have
π ( (
( Sŷ (ejω ) ( dω
EM = ( ( (9.32)
( jω (
−π Sy (e ) 2π
1
The evaluation of the distortion at listener 1 is not important, because the intention is to
“cancel” the signal in her direction.
9.4 Results 201
Fig. 9.6. Equivalent spectral model in the direction of listener 2 using the eigenfilter wk .
It is interesting to observe that a similar result can be established for the linear
prediction spectral matching problem [121]. Also, when the FIR eigenfilter is of the
lowest order with M = 1, and w0 = 1, then the impulse response of the eigenfilter
is w(n) = δ(n), and E1 is unity (observe that with w(n) = δ(n) we have h2 (n) ⊗
δ(n) = h2 (n)).
An interpretation of (9.33) is that irrespective of the filter order (M > 1), the
average spectral ratio is unity, which means that in terms of the two spectra, Sŷ (ejω )
will be greater than Sy (ejω ) in some regions, and less in other regions, such that
(9.33) holds.
The log-spectral distortion dM (Sŷ (ejω ), Sy (ejω )) for an eigenfilter of order M
on an L1 space is defined as
6 6
dM (Sŷ (ejω ), Sy (ejω )) = 6log Sy (ejω ) − log Sŷ (ejω )61
6 6
= 6log Sŷ (ejω )/Sy (ejω )61
6 6
= 6log |WM (ejω )|2 61
π
( (
= (log |WM (ejω )|2 ( dω (9.34)
−π 2π
It can be easily shown that dM (Sŷ (ejω ), Sy (ejω )) ≥ 0, with equality achieved when
the eigenfilter is of unit order with w0 = 1. In Fig. 9.7, we have computed the
distortion (9.34), using standard numerical integration algorithms, as a function of
the filter order for the present problem. Figure 9.8 summarizes the results from Fig.
9.9 and Fig. 9.7, through the gain-distortion constellation diagram. Thus depending
on whether a certain amount of distortion is allowable, we can choose a certain point
in the constellation (distortionless performance is obtained for the point located along
the positive ordinate axis in the constellation).
Clearly, there is an improvement in the gain-to-distortion ratio with the increase
in filter order (e.g., from Fig. 9.8, M = 400 gives a gain-to-distortion ratio of
101.6 /9.8 ≈ 4, whereas M = 250 gives the gain-to-distortion ratio as 3). Also,
for example, with filter order M = 400, the relative gain between the two locations
is as much as 16 dB. This ideally corresponds to a virtual position of listener 1, for
whom the sound cancellation is relevant, to be at a distance that is four times as far
from a fixed source as the other listener (listener 2).
202 9 Selective Audio Signal Cancellation
From Eqs. (9.12), (9.14), and (9.15) we see that the eigenfilter performance can be
affected by (i) the room response duration modeled in the eigenfilter design, as well
as (ii) the nature of the room response (i.e., whether it is characterized by an equiv-
alent minimum phase model). In summary, a short duration room response if used
in (9.12), for determining (9.14), will reduce the computational requirements for
designing the eigenfilter. However, this could reduce the performance because the
eigenfilter does not use all the information contained in the room responses. This
then introduces a performance tradeoff. The question then is, can an eigenfilter (9.14)
be designed with short duration room
response (for savings in computation) in the A and B matrices in (9.12), but yet
does not cause the performance (9.15) to be affected. Of course, care should be taken
to evaluate the performance in that the A and B matrices in (9.15) should have the
full duration room responses.
To understand this performance tradeoff, we design the eigenfilter of length
M < L (L being the actual duration of the room impulse responses in the two direc-
tions), based on windowing both room responses with the window being rectangular
and having duration P < L. We then analyze the performance (9.15) of the filter
to increasing room response length. Basically the goal of this experiment is, can
we design an eigenfilter with sufficiently short room responses (in (9.14)) without
compromising the performance? To answer this question, the following procedure is
adopted.
(a) Design the eigenfilter ŵ∗ ∈
M ×1 for a shortened room response duration
P < L,
with
P −1 P
−1
 = h2 (p)h2 (q)Rx (p, q)
p=0 q=0
P −1 P
−1
B̂ = h1 (p)h1 (q)Rx (p, q) M ≤P <L (9.36)
p=0 q=0
where the hat above the matrices in (9.36) denotes an approximation to the true
quantities in (9.12), and the corresponding eigenfilter (9.35) is the resulting
approximation (due to reduced duration P < L) to (9.14). We have included the
constraint M ≤ P < L to keep the order of the eigenfilter low (reduced
processing), for a given real room response duration L = 8192, as explained below.
(b) Evaluate the performance (9.15) of the filter with the true matrices A and B (9.12)
containing the full duration room responses.
We consider the performance when we select the responses according to (a)
hi (n) = hi,min (n) ⊗ hi,ap (n), and b) hi (n) = hi,min (n); i = 1, 2; where hi,min (n)
and hi,ap (n) are the minimum-phase and all-pass components of the room responses.
The impulse responses h1 (n) and h2 (n) (comprising 8192 points) were obtained in
a highly reverberant room from the same microphones.
204 9 Selective Audio Signal Cancellation
In Fig. 9.9, we show the performance of the eigenfilter design as a function of the
length of the impulse response. The length of the FIR filter was M = 64. The
performance in each subplot as a function of the impulse response increments is
shown, where we chose ∆P = {0} ∪ {2k : k ∈ [7, 12], k ∈ I}, where I denotes
the integer set. Thus, Fig. 9.9(a) represents an eigenfilter of length M = 64 de-
signed with duration P , of the windowed impulse response, to be 64 (after removing
the pure delay). The second performance evaluation, marked by an asterisk, is at
P + ∆P = 64 + 27 = 192. In Fig. 9.10 and Fig. 9.11, we show the sensitivity of
the eigenfilter for filter lengths M = 128 and M = 256 for various windowed room
impulse responses.
From the figures, we confirmed a better gain performance with increased filter
length. By considering a larger duration room impulse response in the eigenfilter
design, we lower the gain relatively but improve its evenness (flatness), Ideally, we
want a small duration filter length (relative to the length of the room responses)
with a large gain and uniform performance (low sensitivity to the length of the room
impulse response).
In Figs. 9.12 to 9.14, we show the performance of the eigenfilter for various win-
dowed room responses and with different filter lengths. The performance (in terms
of uniformity and level of the gain) is better than the nonminimum-phase impulse
response model. We need to investigate this difference in the future.
9.5 Summary 205
9.5 Summary
There is a proliferation of integrated media systems that combine multiple audio and
video signals to achieve tele-immersion among distant participants. One of the key
aspects that must be addressed is the delivery of the appropriate sound to each local
Fig. 9.12. Performance for minimum-phase room impulse response models. M = 64; (a)
P = 64; (b) P = 128; (c) P = 512.
participant in the room. In addition, sound intended for other participants or originat-
ing from noise sources must be canceled. In this chapter we presented a technique for
canceling audio signals using a novel approach based on information theory. We ad-
Fig. 9.13. Performance for minimum-phase room impulse response models. M = 128; (a)
P = 128; (b) P = 256; (c) P = 512.
9.5 Summary 207
Fig. 9.14. Performance for minimum-phase room impulse response models. M = 256; (a)
P = 256; (b) P = 512.
dressed this technique as the eigenfilter method, because the filter was derived based
on maximizing the relative power between the two listeners in an acoustic enclo-
sure. We also derived some of its theoretical properties (e.g., linear phase). For fixed
room responses, we investigated (i) performance (gain) tradeoff to distortion, (ii) sen-
sitivity of the performance to modeled room impulse duration. Our findings, for the
present channel conditions, indicate that increasing the filter order improves the gain-
to-distortion ratio. Thus, depending on the application, a suitable order filter may be
chosen from the gain-to-distortion constellation diagram or from the sensitivity re-
sults. Furthermore, our findings for a particular scenario indicate that by extracting
the minimum-phase component we get a better performance (in terms of uniformity
and level of the gain) than the nonminimum-phase impulse response model.
In summary, this chapter addressed a fairly new application area, and clearly not
all answers are contained. Hence, future directions include research in the following
areas.
(a) The distortion measure that is introduced in Eq. (9.34) is easy to compute
and is well known in the literature. Of course speech intelligibility is affected by
a change in the frequency spectrum (this change in the spectrum is computed in
the form of the distortion measure), and large changes will result in a degradation
in speech intelligibility. To determine how large is large, as a next step one could
perform speech intelligibility tests for consonants, for example, using a “confusion
matrix”.
(b) Investigation of the characteristics of gain zones, regions in space around the
microphones which have a gain improvement of at least 10 dB in SPL, and view-
208 9 Selective Audio Signal Cancellation
ing them from the acoustical physics viewpoint. Also the evaluation of the loudness
(which is frequency-dependent) criteria using eigenfilters is a topic for research.
(c) Performing psychoacoustical/subjective measurements. In this chapter, we
have addressed the effects of prefiltering an audio signal, objectively, through the
spectral distortion measure. Subjective (double-blind) listening tests need to be per-
formed for investigating the perceptual coloration of the transmitted signals.
(e) Investigation of the effects on gain-to-distortion by designing LPC filters to
approximate the room transfer functions.
(f) Alternate objective functions can be evaluated (viz., those that minimize the
SPL at one position but keep the sound quality at other positions as high as possible).
References
68. Haneda Y, Makino S, and Kaneda Y (1997), IEEE Trans. on Speech and Audio Proc.,
5(4):325–333.
69. Neely S, Allen J (1979), J. Acoust. Soc. Amer., 66(1):165–169.
70. Radlović B, Kennedy R (2000), IEEE Trans. on Speech and Audio Proc., 8(6):728–737.
71. Karjalainen M, Piirilä E, Järvinen A, and Huopaniemi J (1999),J. Audio Eng. Soc., 47
(1/2):15–31.
72. Karjalainen M, Härmä A, Laine UK, and Huopaniemi J (1997), Proc. 1997 IEEE Wkshp.
on Appl. Signal Proc. Audio and Acoust. (WASPAA ’97).
73. Härmä A, Karjalainen M, Savioja L, Välimäki V, Laine UK, and Huopaniemi J (2000),
J. Audio Eng. Soc., 48(11):1011–1031.
74. Chang PR, Lin CG, and Yeh BF (1994), J. Acoust. Soc. Amer., 95(6):3400–3408.
75. Mourjopoulos J, Clarkson P, and Hammond J (1982), Proc. ICASSP, 1858–1861.
76. Bezdek J (1981), Pattern recognition with fuzzy objective function algorithms, Plenum.
77. Dunn JC (1973), J. Cybern., 3:32–57.
78. Xie XL, Beni G (1991), IEEE Trans. on Pattern Analysis and Mach. Intelligence, 3:841–
846.
79. Pal NR, Bezdek JC (1995), IEEE Trans. on Fuzzy Syst., 3(3):370–379.
80. Markel JD, Gray, AH Jr. (1976), Linear Prediction of Speech, Springer-Verlag.
81. Alku P, Bäckström T (2004), IEEE Trans. on Speech and Audio Proc., 12(2):93–99.
82. Oppenheim A, Johnson D, and Steiglitz K (1971), Proc. IEEE, 59:299–301.
83. Smith JO, Abel JS (1999), IEEE Trans. on Speech and Audio Proc., 7(6):697–708.
84. Zwicker E, Fastl H (1990), Psychoacoustics: Facts and Models, Springer-Verlag.
85. Fukunaga K (1990), Introduction to Statistical Pattern Recognition, Academic Press.
86. Sammon, JW Jr. (1969), IEEE Trans. on Computers., C-18(5):401–409.
87. Kohonen T (1997), Self-Organizing Maps, Springer.
88. Torgerson WS (1952), Psychometrika, 17:401–419.
89. Young G, Householder AS (1938), Psychometrika, 3:19–22.
90. Pȩkalska E, Ridder D, Duin RPW, and Kraaijveld MA (1999), Proc. ASCI’95 (5th An-
nual Int. Conf. of the Adv. School for Comput. & Imag.), 221–228.
91. Woszczyk W (1982), Proc. of 72nd AES Conv., preprint 1949.
92. Lipshitz S, Vanderkooy J (1981), Proc. of 69th AES Conv., preprint 1801.
93. Thiele N (2001), Proc. of 108th AES Conv., preprint 5106.
94. Bharitkar S, Kyriakakis C (2003), IEEE Wkshp. on Appl. Signal Proc. Audio and Acoust.
(WASPAA ’97).
95. Radlović B, Kennedy R (2000), IEEE Trans. on Speech and Audio Proc., 8(6):728–737.
96. Bharitkar S, Kyriakakis C (2005), Proc. IEEE Conf. on Multimedia and Expo.
97. Toole FE, Olive SE (1988), J. Audio Eng. Soc., 36(3):122–141.
98. Bharitkar S, Kyriakakis C (2005), Proc. 13th Euro. Sig. Proc. Conf. (EUSIPCO).
99. Talantzis F, Ward DB (2003), J. Acoust. Soc. Amer., 114:833–841.
100. Cook RK, Waterhouse RV, Berendt RD, Edelman S, and Thompson MC (1955), J.
Acoust. Soc. Amer., 27(6):1072–1077.
101. Kendall M, Stuart A (1976), The Advanced Theory of Statistics, Griffin.
102. Bharitkar S (2004), Digital Signal Processing for Multi-channel Audio Equalization and
Signal Cancellation, Ph.D Thesis, University of Southern California, Los Angeles (CA).
103. Bharitkar S, Kyriakakis C (2000), Proc. IEEE Conf. on Mult. and Expo.
104. Bharitkar S, Kyriakakis C (2000), Proc. IEEE Int. Symp. on Intell. Signal Proc. and
Comm. Syst.
105. Nelson PA, Curtis ARD, Elliott SJ, and Bullmore AJ (1987), J. Sound and Vib.,
117(1):1–13.
212 References