100% found this document useful (4 votes)
550 views222 pages

Immersive Audio Signal Processing

immersive audio signal processing

Uploaded by

derghal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
550 views222 pages

Immersive Audio Signal Processing

immersive audio signal processing

Uploaded by

derghal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 222

Immersive Audio Signal Processing

Information Technology: Transmission, Processing, and Storage

Series Editors: Robert Gallager


Massachusetts Institute of Technology
Cambridge, Massachusetts
Jack Keil Wolf
University of California, San Diego
La Jolla, California

Immersive Audio Signal Processing


Sunil Bharitkar and Chris Kyriakakis
Digital Signal Processing for Measurement Systems: Theory and Applications
Gabriele D'Antona and Alessandro Ferrero
Coding for Wireless Channels
Ezio Biglieri
Wireless Networks: Multiuser Detection in Cross-Layer Design
Christina Comaniciu, Narayan B. Mandayam and H. Vincent Poor

The Multimedia Internet


Stephen Weinstein

MIMO Signals and Systems


Horst J. Bessai

Multi-Carrier Digital Communications: Theory and Applications of OFDM


Ahmad R.S. Bahai, Burton R. Saltzberg and Mustafa Ergen
Performance Analysis and Modeling of Digital Transmission Systems
William Turin

Wireless Communications Systems and Networks


Mohsen Guizani
Interference Avoidance Methods for Wireless Systems
Dimitrie C. Popescu and Christopher Rose

Stochastic Image Processing


Chee Sun Won and Robert M. Gray

Coded Modulation Systems


John B. Anderson and Arne Svensson

Communication System Design Using DSP Algorithms:


With Laboratory Experiments for the TMS320C6701 and TMS320C6711
Steven A. Tretter
A First Course in Information Theory
Raymond W. Yeung

Nonuniform Sampling: Theory and Practice


Edited by Farokh Marvasti
Simulation of Communication Systems, Second Edition: Methodology,
Modeling, and Techniques
Michael C. Jeruchim, Phillip Balaban and K. Sam Shanmugan
Immersive Audio Signal
Processing

Sunil Bharitkar
Audyssey Laboratories, Inc. and
University of Southern California
Los Angeles, CA, USA

Chris Kyriakakis
University of Southern California
Los Angeles, CA, USA

~ Springer
Sunil Bharitkar Chris Kyriakakis
Dept. of Electrical Eng.-Systems Dept. of Electrical Eng.-Systems and
University of Southern California Integrated Media Systems Center (IMSC)
Los Angeles, CA 90089-2564, and University of Southern California
Audyssey Laboratories, Inc. Los Angeles, CA 90089-2564
350 S. Figueroa Street, Ste. 196 [email protected]
Los Angeles, CA 90071
[email protected]

Library of Congress Control Number: 2005934526

ISBN-IO: 0-387-28453-2 e-ISBN: 0-387-28503-2


ISBN-I3: 978-0-387-28453-8

Printed on acid-free paper.

© 2006 Springer Science+Business Media LLC


All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media LLC, 233 Spring Street, New York, NY
10013, U.S.A.), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection
with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks and similar terms, even if they are not
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.

Printed in the United States of America. (TXQIEB)

987654321

springer.com
To Aai and Baba

To Wee Ling, Anthony, and Alexandra


Preface

This book is the result of several years of research in acoustics, digital signal process-
ing (DSP), and psychoacoustics, conducted in the Immersive Audio Laboratory at the
University of Southern California’s (USC) Viterbi School of Engineering, the Signal
and Image Processing Institute at USC, and Audyssey Laboratories. The authors’
association began five years ago when Sunil Bharitkar joined Chris Kyriakakis’ re-
search group as a PhD candidate.
The title Immersive Audio Signal Processing refers to the fact that signal pro-
cessing algorithms that take into account human perception and room acoustics, as
described in this book, can greatly enhance the experience of immersion for listeners.
The topics covered are of widespread interest in both consumer and professional au-
dio, and have not been previously presented comprehensively in an audio processing
textbook.
Besides the basics of DSP and psychoacoustics, this book contains the latest
results in audio processing for audio synthesis and rendering, multichannel room
equalization, audio selective signal cancellation, signal processing for audio appli-
cations, surround sound synthesis and processing, and the incorporation of psychoa-
coustics in audio signal processing algorithms.
Chapter 1, “Foundations of Digital Signal Processing for Audio,” includes con-
cepts from signals and linear systems, analog–to–digital and digital–to–analog con-
version, convolution, digital filtering concepts, sampling rate alteration, and transfer
function representations (viz., z-transforms, Fourier transforms, bilinear transforms).
Chapter 2, “Filter Design Techniques for Audio Processing,” introduces the de-
sign of various filters such as FIR, IIR, parametric, and shelving filters.
Chapter 3, “Introduction to Acoustics and Auditory Perception,” introduces the
theory and physics behind sound propagation in enclosed environments, room acous-
tics, reverberation time, the decibel scale, loudspeaker and room responses, and
stimuli for measuring room responses (e.g., logarithmic chirp, maximum length se-
quences). We also briefly discuss some relevant topics in pyschoacoustics, such as
loudness perception and frequency selectivity.
In Chapter 4, “Immersive Audio Synthesis and Rendering,” we present tech-
niques that can be used to automatically generate multiple microphone signals
viii Preface

needed for a multichannel rendering without having to record using multiple real mi-
crophones for performing spatial audio playback over loudspeakers. We also present
techniques for spatial audio playback over loudspeakers. It is assumed that readers
have sufficient knowledge of head-related transfer functions (HRTFs). However, ad-
equate references are provided at the end of the book for interested readers.
Chapter 5, “Multiple Position Room Response Equalization for Real-Time Ap-
plications,” provides the necessary theory behind equalization of room acoustics for
immersive audio playback. Theoretical analysis and examples for single listener and
multiple listener equalization are provided. Traditional techniques of single position
equalization using FIR and IIR filters are introduced. Subsequently, a multiple lis-
tener equalization technique employing a pattern recognition technique is presented.
For real-time implementations, warping for designing lower filter orders is intro-
duced. The motivation for the pattern recognition approach can be seen through a
statistical analysis and visual interpretation of the clustering phenomena through the
Sammon map algorithm. The Sammon map also permits a visual display of room
response variations as well as a multiple listener equalization performance measure.
The influence of reverberation on room equalization is also discussed. Results from
a typical home theater setup are presented in the chapter.
Chapter 6, “Practical Considerations for Multichannel Equalization,” discusses
distortions due to phase effects, and presents algorithms that minimize the effect
of phase distortions. Selecting proper choices of bass management filters, crossover
frequencies, as well as all-pass coefficients and time-delay adjustments that affect
crossover region response are presented.
Chapter 7, “Robustness of Equalization to Displacement Effects: Part I,” explores
robustness analysis (viz., mismatch between listener positions during playback and
microphone position during room response measurement) in room equalization for
frequencies above the Schroeder frequency.
Chapter 8, “Robustness of Equalization to Displacement Effects: Part II,” ex-
plores robustness analysis in room equalization for low frequencies.
Chapter 9,“Selective Audio Signal Cancellation,” presents a signal processing-
based approach for audio signal cancellation at predetermined positions. This is im-
portant, for example, in automobile environments, to create a zone of silence spa-
tially.
The material in this book is primarily intended for the practicing engineer, sci-
entists, and researchers in the field. It is also suitable for a semester course at the
upper-class undergraduate and graduate level. A basic knowledge of signal process-
ing and linear system theory is assumed, although relevant topics are presented early
on in this book. References to supplemental information are given at the end of the
book.
Several individuals provided technical comments and insight on a preliminary
version of the manuscript for improving the book. Hence, we would like to acknowl-
edge and thank the following individuals, Dr. Randy Cole from Texas Instruments,
Prof. Tomlinson Holman from the University of Southern California, and Prof.
Stephan Weiss from the University of Southampton. Ana Bozicevic and Vaishali
Damle at Springer, provided incentive to the authors to produce the manuscript and
Preface ix

make the book a reality, and we are thankful for their valuable assistance during
the process. We would also like to thank Elizabeth Loew for production of this
volume. Thanks also go out to the people at Audyssey Laboratories, in particular
Philip Hilmes and Michael Solomon, for their support during the preparation of this
manuscript.
We invite you to join us on this exciting journey, where signal processing, acous-
tics, and auditory perception have merged to create a truly immersive experience.

Los Angeles, California Sunil Bharitkar


July, 2005 Chris Kyriakakis
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Part I Digital Signal Processing for Audio and Acoustics

1 Foundations of Digital Signal Processing for Audio and Acoustics . . . . 3


1.1 Basics of Digital Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Discrete Time Signals and Sequences . . . . . . . . . . . . . . . . . . . 4
1.1.2 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Time-Invariant Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Linear and Time-Invariant Systems . . . . . . . . . . . . . . . . . . . . . 6
1.2 Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Transfer Function Representation . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 The z-Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Sampling and Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Ideal Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Reconstruction of Continuous Time Signals from Discrete
Time Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Sampling Rate Reduction by an Integer Factor . . . . . . . . . . . . 19
1.4.4 Increasing the Sampling Rate by an Integer Factor . . . . . . . . 21
1.4.5 Resampling for Audio Applications . . . . . . . . . . . . . . . . . . . . . 22
1.5 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6 Bilinear Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Filter Design for Audio Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


2.1 Filter Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Desired Response Specification . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 Approximating Error Function . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 FIR Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Linear Phase Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
xii Contents

2.2.2 Least Squares FIR Filter Design . . . . . . . . . . . . . . . . . . . . . . . . 29


2.2.3 FIR Windows for Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4 Adaptive FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 IIR Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 All-Pass Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Butterworth Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.3 Chebyshev Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.4 Elliptic Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.5 Shelving and Parametric Filters . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.6 Autoregressive or All-Pole Filters . . . . . . . . . . . . . . . . . . . . . . 44
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Part II Acoustics and Auditory Perception

3 Introduction to Acoustics and Auditory Perception . . . . . . . . . . . . . . . . . 49


3.1 Sound Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Acoustics of a Simple Source in Free-Field . . . . . . . . . . . . . . . . . . . . . 50
3.3 Modal Equations for Characterizing Room Acoustics at Low
Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.1 Axial, Tangential, Oblique Modes and Eigenfrequencies . . . 53
3.4 Reverberation Time of Rooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Room Acoustics from Schroeder Theory . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Measurement of Loudspeaker and Room Responses . . . . . . . . . . . . . 61
3.6.1 Room Response Measurement with Maximum Length
Sequence (MLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6.2 Room Response Measurement with Sweep Signals . . . . . . . . 63
3.7 Psychoacoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7.1 Structure of the Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7.2 Loudness Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.7.3 Loudness Versus Loudness Level . . . . . . . . . . . . . . . . . . . . . . . 68
3.7.4 Time Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7.5 Frequency Selectivity of the Ear . . . . . . . . . . . . . . . . . . . . . . . . 70
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Part III Immersive Audio Processing

4 Immersive Audio Synthesis and Rendering Over Loudspeakers . . . . . 75


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Immersive Audio Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.1 Microphone Signal Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.2 Subjective Evaluation of Virtual Microphone Signals . . . . . . 80
4.2.3 Spot Microphone Synthesis Methods . . . . . . . . . . . . . . . . . . . . 80
4.2.4 Summary and Future Research Directions . . . . . . . . . . . . . . . . 82
Contents xiii

4.3 Immersive Audio Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


4.3.1 Rendering Filters for a Single Listener . . . . . . . . . . . . . . . . . . 83
4.3.2 Rendering Filters for Multiple Listeners . . . . . . . . . . . . . . . . . 87
4.3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5 Multiple Position Room Response Equalization . . . . . . . . . . . . . . . . . . . . 99


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Single-Point Room Response Equalization . . . . . . . . . . . . . . . . . . . . . 102
5.4 Multiple-Point (Position) Room Response Equalization . . . . . . . . . . . 103
5.5 Designing Equalizing Filters Using Pattern Recognition . . . . . . . . . . 105
5.5.1 Review of Cluster Analysis in Relation to Acoustical Room
Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.2 Fuzzy c-means for Determining the Prototype . . . . . . . . . . . . 105
5.5.3 Cluster Validity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.4 Multiple Listener Room Equalization with Low Filter Orders 107
5.6 Visualization of Room Acoustic Responses . . . . . . . . . . . . . . . . . . . . . 109
5.7 The Sammon Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.9 The Influence of Reverberation on Room Equalization . . . . . . . . . . . 121
5.9.1 Image Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.9.2 RMS Average Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6 Practical Considerations for Multichannel Equalization . . . . . . . . . . . . 125


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Objective Function-Based Crossover Frequency Selection . . . . . . . . . 130
6.3 Phase Interaction Between Noncoincident Loudspeakers . . . . . . . . . . 132
6.3.1 The Influence of Phase on the Net Magnitude Response . . . . 134
6.4 Phase Equalization with All-Pass Filters . . . . . . . . . . . . . . . . . . . . . . . 134
6.4.1 Second-Order All-Pass Networks . . . . . . . . . . . . . . . . . . . . . . . 134
6.4.2 Phase Correction with Cascaded All-Pass Filters . . . . . . . . . . 136
6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.5 Objective Function-Based Bass Management Filter Parameter
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.6 Multiposition Bass Management Filter Parameter Optimization . . . . 146
6.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.7 Spectral Deviation and Time Delay-Based Correction . . . . . . . . . . . . 150
6.7.1 Results for Spectral Deviation and Time Delay-Based
Crossover Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
xiv Contents

7 Robustness of Equalization to Displacement Effects: Part I . . . . . . . . . 157


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2 Room Acoustics for Simple Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.3 Mismatch Analysis for Spatial Average Equalization . . . . . . . . . . . . . 162
7.3.1 Analytic Expression for Mismatch Performance
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.3.2 Analysis of Equalization Error . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8 Robustness of Equalization to Displacement Effects: Part II . . . . . . . . . 171


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.2 Modal Equations for Room Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.3 Mismatch Analysis with Spatial Average Equalization . . . . . . . . . . . . 172
8.3.1 Spatial Averaging for Multiple Listener Equalization . . . . . . 172
8.3.2 Equalization Performance Due to Mismatch . . . . . . . . . . . . . . 173
8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.4.1 Magnitude Response Spatial Averaging . . . . . . . . . . . . . . . . . . 177
8.4.2 Computation of the Quantum Numbers . . . . . . . . . . . . . . . . . . 178
8.4.3 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.4.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.4.5 Magnitude Response Single-Listener Equalization . . . . . . . . 183
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

9 Selective Audio Signal Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.2 Traditional Methods for Acoustic Signal Cancellation . . . . . . . . . . . . 189
9.2.1 Passive Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.2.2 Active Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.2.3 Parametric Loudspeaker Array . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.3 Eigenfilter Design for Conflicting Listener Environments . . . . . . . . . 191
9.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.3.2 Determination of the Eigenfilter . . . . . . . . . . . . . . . . . . . . . . . . 192
9.3.3 Theoretical Properties of Eigenfilters . . . . . . . . . . . . . . . . . . . . 195
9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.4.1 Eigenfilter Performance as a Function of Filter
Order M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.4.2 Performance Sensitivity as a Function of the Room
Response Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Part I

Digital Signal Processing for Audio and Acoustics


1
Foundations of Digital Signal Processing for Audio
and Acoustics

The content presented in this chapter includes relevant topics in digital signal pro-
cessing such as the mathematical foundations of signal processing (viz., convolution,
sampling theory, etc.), basics of linear and time-invariant (LTI) systems, minimum-
phase and all-pass systems, sampling and reconstruction of signals, discrete time
Fourier transform (DTFT), discrete Fourier transform (DFT), z-transform, bilinear
transform, and linear-phase finite impulse response (FIR) filters.

1.1 Basics of Digital Signal Processing


Digital signal processing (DSP) involves either one or more of the following [1]:
(i) modeling or representation of continuous time signals (viz., analog signals) by
discrete time or digital signals, (ii) operation on digital signals, typically through
linear or nonlinear filtering and/or time to frequency mapping, to transform them
to desirable signals, and (iii) the generation of continuous time signals from digital
signals. The block diagram for a general DSP system is shown in Fig. 1.1 where the
blocks identify each of the three processes described above. In this book, there is
an implicit assumption of such DSP systems satisfying linearity and time-invariance
(i.e., LTI property as explained below) unless explicitly stated.

Fig. 1.1. General digital signal processing system.


4 1 Foundations of Digital Signal Processing for Audio and Acoustics

1.1.1 Discrete Time Signals and Sequences

A discrete time signal, x(n), is represented as a sequence of numbers x with corre-


sponding time indices represented by integer values n [2]. In reality, such discrete
time signals can arise out of sampling a continuous time signal (viz., the first block
in Fig. 1.1). Specifically,

x(n) = xc (nTs ) (1.1)

where the continuous time signal xc (t) is sampled with a sampling period Ts which
is the inverse of the sampling frequency fs . Typical sampling frequencies used in
audio processing applications include 32 kHz, 44.1 kHz, 48 kHz, 64 kHz, 96 kHz,
and 192 kHz.
Some examples of discrete time signals include:
(i) Kronecker delta function shown in Fig. 1.2 and as defined by

1 n=0
x(n) = δ(n) = (1.2)
0 n = 0

(ii) Exponential sequence shown in Fig. 1.3 and as defined by

x(n) = Aαn (1.3)

(iii) Step sequence shown in Fig. 1.4 and as defined by


 n 
1 n≥0
x(n) = u(n) = δ(k) = (1.4)
0 n<0
k=−∞

Fig. 1.2. Kronecker delta signal.


1.1 Basics of Digital Signal Processing 5

Fig. 1.3. Exponentially decaying sequence with A = 1 and α = 0.98.

(iv) The special case when α = ejω0 n and A = |A|ejφ the resulting magnitude
and phase of the complex exponential sequence shown in Fig. 1.5 (in the equivalent
continuous form) and given by

x(n) = Aej(ω0 n+φ) = A(cos(ω0 n + φ) + j sin(ω0 n + φ)) (1.5)

Fig. 1.4. Step sequence.


6 1 Foundations of Digital Signal Processing for Audio and Acoustics

Fig. 1.5. Real and imaginary part of a complex exponential sequence with φ = 0.25π and
ω0 = 0.1π.

1.1.2 Linear Systems

Operations performed by a DSP system are based on the premise that the system
satisfies the properties of linearity and time-invariance. Specifically, from the theory
of linear systems, if T {.} denotes the transformation performed by a linear system
(i.e., y(n) = T {x(n)}), then the input and output of a linear system satisfy the
following properties of additivity and homogeneity, respectively,

T {x1 (n) + x2 (n)} = T {x1 (n)} + T {x2 (n)} = y1 (n) + y2 (n)


T {ax(n)} = aT {x(n)} = ay(n) (1.6)

1.1.3 Time-Invariant Systems

A time-invariant or a shift-invariant system is one for which a delayed input of n0


samples results in a delayed output of n0 samples. Specifically, if x1 (n) = x(n−n0 )
then y1 (n) = y(n − n0 ) for a time-invariant system.

1.1.4 Linear and Time-Invariant Systems

With a Kronecker delta function, δ(n), applied as an input to a linear system, the
output of the linear system is defined as an impulse response h(n) that completely
characterizes the linear system as shown in Fig. 1.6. An important class of DSP sys-
tems is those that exhibit linear and time-invariant (LTI) properties and are extremely
important for designing immersive audio signal processing systems. Thus, if the in-
put x(n) is represented as a series of delayed impulses as
1.1 Basics of Digital Signal Processing 7

Fig. 1.6. A linear system with impulse response h(n).



x(n) = x(k)δ(n − k) (1.7)
k=−∞

then the output y(n) can be expressed as the well-known convolution formula where
h(n) is the impulse response,
 ∞ 

y(n) = T x(k)δ(n − k)
k=−∞


= x(k)T {δ(n − k)}
k=−∞
∞
= x(k)h(n − k)
k=−∞
∞
= h(k)x(n − k)
k=−∞
= x(n) ⊗ h(n) (1.8)

Some common examples of LTI systems include:


(i) Ideal Delay
h(n) = δ(n − nd ) (1.9)
(ii) Moving Average (MA)

1 
M2
h(n) = δ(n − k) (1.10)
M 1 + M2 + 1
k=M1

(iii) Autoregressive (AR)



N2
ai h(n − i) = δ(n) (1.11)
i=N1

(iv) Autoregressive and Moving Average (ARMA)



N2 
M2
ai h(n − i) = bi δ(n − k) (1.12)
i=N1 k=M1
8 1 Foundations of Digital Signal Processing for Audio and Acoustics

The input and output signals from any LTI system can be found through various
methods. For simple impulse responses (as given in the above examples), substitut-
ing δ(n) with x(n) and h(n) with y(n) provides the input and output signal descrip-
tion of the LTI system. For example, the input and output signal description for the
ARMA system can be written as


N2 
M2
ai y(n − i) = bi x(n − k) (1.13)
i=N1 k=M1

A second method for finding the output from an LTI system involves the con-
volution formula (1.8) (along with any graphical plotting). For example, if x(n) =
α1n u(n) and h(n) = α2n u(n) are two right-sided sequences (where |α1 | < 1 and
|α2 | < 1 and u(n) is the step function), then with (1.8),

y(n) = x(n) ⊗ h(n)


n
= α1k α2n−k
k=0

n
α1
= α2n ( )k
α2
k=0
α2 −
n+1
α1n+1
= (1.14)
α2 − α1
where the following series expansion has been used


N2
αN1 − αN2 +1
αk = (1.15)
1−α
k=N1

MATLAB software, from Mathworks, Inc. (http://www.mathworks.com), in-


cludes the conv(a,b) command for convolving two signals. An example from using
this function with α1 = 0.3 and α2 = 0.6 is shown in Fig. 1.7. Finally, another
method for determining the output signal is through the use of linear transform the-
ory (viz., the Fourier transform and the z-transform).

1.2 Fourier Transforms


Complex exponential sequences are eigenfunctions of LTI systems and the response
to a sinusoidal input is a sinusoid with the same frequency of the input signal and
with the amplitude and phase as determined by the system [2]. Specifically, when
x(n) = ejωn , then the output y(n) can be expressed by the convolution formula
(1.8) as


y(n) = h(k)ejω(n−k)
k=−∞
1.2 Fourier Transforms 9

Fig. 1.7. The convolution of two exponentially decaying right-sided sequences (viz., (1.14))
with α1 = 0.3 and α2 = 0.6.



=e jωn
h(k)e−jωk
k=−∞

= ejωn H(ejω ) (1.16)

where H(ejω ) = HR (ejω ) + jHI (ejω ) = |H(ejω )|ej∠H(e ) is the discrete time

Fourier transform of the system and characterizes the LTI system along a real fre-
quency axis ω expressed in radians. The real frequency variable ω is related to the
analog frequency Ω = 2πf (f is in Hz), as shown in a subsequent section on sam-
pling, through the relation ω = ΩTs where the sampling frequency fs and the sam-
pling period Ts are related by Ts = 1/fs .
An important property of the discrete time Fourier transform or frequency re-
sponse, H(ejω ), is that it is periodic with a period of 2π, as

 ∞

H(ejω ) = h(k)e−jωk = h(k)e−j(ω+2π)k = H(ej(ω+2π) ).
k=−∞ k=−∞

A periodic discrete time frequency response of a low-pass filter, with a cutoff fre-
quency ωc , is shown in Fig. 1.8.
Thus, the discrete time forward and inverse Fourier transforms can be expressed
as
h(n) ←→ H(ejω ) (1.17)


H(e ) =

h(k)e−jωk (1.18)
k=−∞

1
h(n) = H(ejω )ejωn dω (1.19)
2π 2π
10 1 Foundations of Digital Signal Processing for Audio and Acoustics

Fig. 1.8. Periodicity of a discrete time Fourier transform.

The properties and theorems on Fourier transforms can be found in several texts
(e.g., [2, 3]).

1.2.1 Transfer Function Representation

In general, if an LTI system can be expressed as a ratio of numerator and denominator


polynomials,
M −jωk
M
k=1 ak e b0 k=1 (1 − ck e−jω )
H(e ) =

N =  (1.20)
k=1 bk e
−jωk a0 Nk=1 (1 − dk e
−jω )

and is stable (i.e., |dk | < 1∀k) then the magnitude response of the transfer function
in decibel scale (dB scale) can be expressed as

|H(ejω )| = H(ejω )H ∗ (ejω )


M M
b0

k=1 (1 − ck e
−jω ) ∗ jω
k=1 (1 − ck e )
= N  (1.21)
−jω ) N ∗ jω
k=1 (1 − dk e k=1 (1 − dk e )
a0

b0  
M N
−jω −jω
|H(e )|(dB) = 20 log10

+ log10 |1 − ck e |− log10 |1 − dk e |
a0
k=1 k=1

the phase response can be written as


  
M 
N
b0 −jω
∠H(e ) = ∠

+ ∠(1 − ck e )− ∠(1 − dk e−jω ) (1.22)
a0
k=1 k=1

and the group delay is


grd[H(ejω )] = − ∠H(ejω ) (1.23)
∂ω
M
∂ 
N

=− arg(1 − ck e−jω ) + arg(1 − dk e−jω )
∂ω ∂ω
k=1 k=1
1.2 Fourier Transforms 11

The numerator roots ck and the denominator roots dk of the transfer function H(ejω )
(1.20) are called the zeros and poles of the transfer function, respectively.
Because H(ejω ) = H(ej(ω+2π) ), the phase of each of the terms in the phase
response (1.22) is ambiguous. A correct phase response can be obtained by taking
the principal value ARG(H(ejω )) (which lies between −π and π), computed by
any computer subroutine (e.g., the angle command in MATLAB) or the arctangent
function from a calculator, and adding 2πr(ω) [4]. Thus,

∠H(ejω ) = ARG(H(ejω )) + 2πr(ω) (1.24)

The unwrap command in MATLAB computes the angle in terms of the principal
value. For example, a fourth-order (N = 4) low-pass Butterworth transfer function
used for audio bass management, with a cutoff frequency ωc , can be expressed as

−1 N
 b0,k + b1,k e−jω + b2,k e−j2ω
2

H(e ) =

(1.25)
a0,k + a1,k e−jω + a2,k e−j2ω
k=0
b0,k = b2,k = K 2
b1,k = 2K 2
a0,k = 1 + 2K cos(π(2k + 1)/2N ) + K 2
a2,k = 1 − 2K cos(π(2k + 1)/2N ) + K 2
a1,k = 2(K 2 − 1)

where K = tan(ωc /2) = tan(πfc /fs ). The magnitude response, principal phase,
and unwrapped phase, for the fourth-order Butterworth low-pass filter with cutoff
frequency fc = 80 Hz and fs = 48 kHz, as shown in Fig. 1.9, reveal the 2π phase
rotation in principal value (viz., Fig. 1.9(b)).

Minimum-Phase Systems

From system theory, if H(ejω ) is assumed to correspond to a causal1 and stable sys-
tem, then the magnitude of all its poles is less than unity [2]. For certain classes of
problems it is important to constrain the inverse of the transfer function, H(ejω ),
to be causal and stable. Hence if H(ejω ) is causal and stable, then if the inverse,
1/H(ejω ), too is constrained to be causal and stable it must be true that the magni-
tude of all of the zeros (i.e., the roots of the numerator polynomial) of H(ejω ) must
be less than unity. Classes of systems that satisfy this property (where the transfer
function, as well as its inverse, is causal and stable) are called minimum-phase sys-
tems [2] and the transfer function is usually denoted by Hmin (ejω ).

1
A causal system is one for which the output signal depends on the present value and/or the
past values of the input signal or y(n) = f [x(n), x(n − 1), . . . , x(n − p)], where p ≥ 0.
12 1 Foundations of Digital Signal Processing for Audio and Acoustics

Fig. 1.9. (a) Magnitude response of the fourth-order Butterworth low-pass filter; (b) principal
value of the phase; (c) unwrapped phase.

All-Pass Systems

An all-pass system or transfer function is one whose magnitude response is flat for
all frequencies. A first-order all-pass transfer function can be expressed as

e−jω − λ∗ ∗ jω
−jω 1 − λ e
Hap (ejω ) = = e (1.26)
1 − λe−jω 1 − λe−jω
where the roots of the numerator and denominator (viz., 1/λ∗ and λ, respectively)
are conjugate reciprocals of each other. Thus,

|Hap (ejω )| = Hap (ejω )Hap ∗ (ejω )

1 − λ∗ ejω jω 1 − λe−jω
= e−jω e =1 (1.27)
1 − λe−jω 1 − λ∗ ejω
A generalized all-pass transfer function, providing a real time-domain response,
can be expressed as [2],
N 
Ncomplex
real
e−jω − di e−jω − gk∗ e−jω − gk
Hap (e ) =

(1.28)
i=1
1 − di e−jω 1 − gk e−jω 1 − gk∗ e−jω
k=1

where di is a real pole and gk is a complex pole.


The phase responses for a first-order and second-order all-pass transfer function
are [2],
1.2 Fourier Transforms 13

Fig. 1.10. (a) Magnitude response of a second order real response all-pass filter; (b) principal
value of the phase; (c) unwrapped phase.

e−jω − re−jθ r sin(ω − θ)


∠ = −ω − 2 arctan
1 − rejθ e−jω 1 − r cos(ω − θ)
 −jω   −jω 
e − re−jθ e − rejθ r sin(ω − θ)
∠ = −2ω − 2 arctan
1 − rejθ e−jω 1 − re−jθ e−jω 1 − r cos(ω − θ)
r sin(ω + θ)
−2 arctan (1.29)
1 − r cos(ω + θ)

The magnitude and phase response for a second-order all-pass transfer function
with complex poles (r = 0.2865 and θ = 0.1625π) is shown in Fig. 1.10.
As shown in a subsequent chapter, using an all-pass filter in cascade with other
filters allows the overall phase response of a system to approximate a desired phase
response which is useful to correct for phase interactions between the subwoofer and
satellite speaker responses in a multichannel sound playback system.

Decomposition of Transfer Functions

An important fact on any rational transfer function is that it can be decomposed as a


product of two individual transfer functions (viz., the minimum-phase function and
the all-pass function) [2]. Thus,

H(ejω ) = Hmin (ejω )Hap (ejω ) (1.30)


h(n) = hmin (n) ⊗ hap (n)
14 1 Foundations of Digital Signal Processing for Audio and Acoustics

Specifically, (1.30) specifies that any transfer function having poles and/or zeros,
some of whose magnitudes are greater than unity, can be decomposed as a product
of two transfer functions. The minimum-phase transfer function Hmin (ejω ) includes
poles and zeros whose magnitude is less than unity, whereas the all-pass transfer
function Hap (ejω ) includes poles and zeros that are conjugate reciprocal of each
other (i.e., if λ is a zero then 1/λ∗ is a pole of the all-pass transfer function).

Linear Phase Systems

As shown in the next chapter, traditional filter techniques do not consider the phase
response during the design of filters. This can cause degradation to the shape and
quality of the signal that is being filtered by such a filter, especially in the relevant
frequency regions, due to phase distortion induced by a nonlinear phase response of
the filter. Thus, in many cases, it is desirable that the phase response of the filter be
kept a linear function of frequency ω (or the group delay be kept constant).
A classic example of a linear phase signal is the delay function, h(n) = δ(n−k),
which delays the input signal x(n) by k samples. Specifically,

y(n) = h(n) ⊗ x(n) = x(n − k)


H(ejω ) = e−jωk (1.31)
|H(e )| = 1

∠H(ejω ) = −kω
grd[H(ejω )] = k

A generalized linear phase system function, H(ejω ), may be expressed as a


product of a real function A(ejω ), with a complex exponential e−jψ such that

H(ejω ) = A(ejω )e−jαω+jβ (1.32)


∠H(ejω ) = β − αω
grd[H(ejω )] = α

More details on designing linear phase filters are given in the next chapter.

1.3 The z-Transform


A generalization of the Fourier transform is the z-Transform which can be expressed
as a two-sided power series (also called the Laurent series), of a signal x(n), in the
complex variable z = ejω is expressed as


X(z) = x(n)z −n (1.33)
n=−∞

whereas the inverse z-transform can be expressed in terms of the contour integral
1.3 The z-Transform 15

1
x(n) = X(z)z n−1 dz (1.34)
2jπ C

A principle motivation for using this transform is that the Fourier transform does
not converge for all discrete time signals, or sequences, and a generalization via the
z-transform encompasses a broader class of signals. Furthermore, powerful complex
variable techniques can be used to analyze signals when using the z-transform.
Being a complex variable, the poles and zeros of the resulting system function
can be depicted on a two-dimensional complex z-plane of Fig. 1.11. Because the
z-transform is related to the Fourier transform through the transformation z = ejω ,
then the Fourier transform exists on the unit circle depicted in Fig. 1.11.
The region of convergence (ROC) is defined to be the set of values on the com-
plex z-plane where the z-transform converges.
Some examples of using z-transforms for determining system functions corre-
sponding to discrete time signals are given
∞ below.
(i) x(n) = δ(n − nd ) =⇒ X(z) = n=−∞ δ(n − nd )z −n = z −nd , where the
ROC is z-plane. ∞ ∞
(ii) x(n) = an u(n) =⇒ X(z) = n=0 an z −n = n=0 (az −1 )n = 1/(1 −
az −1 ), where the ROC is the region |z| > |a| exterior to the dotted circle in Fig. 1.12
(where |a| = 0.65 and the ROC includes the unit circle).
(1/3)n n≥0
(iii) x(n) =
2n n<0

−1
 ∞

X(z) = (2z −1 )n + ((1/3)z −1 )n
n=−∞ n=0

Fig. 1.11. The complex z-plane.


16 1 Foundations of Digital Signal Processing for Audio and Acoustics

Fig. 1.12. The ROC in the complex z-plane for the sequence x(n) = an u(n) with a = 0.65.


 ∞

= (2z −1 )−n + ((1/3)z −1 )n
n=1 n=0


 
= ((1/2)z) − 1n
+ ((1/3)z −1 )n
n=0 n=0
1 −1
2z 1
= +
1 − 12 z −1 1 − 13 z −1

where the ROC is the intersection of the region in the complex z-plane defined by
|z| < 2 and |z| > 1/3 or 2 > |z| > 1/3.
Again, several properties of the z-transform, the determination of a time domain
signal from the z-transform using various techniques (e.g., residue theorem), and
theory behind the z-transform can be found in several texts including [2] and [4].

1.4 Sampling and Reconstruction


Up to this point the signals were assumed to be discrete time. However, audio signals
that are to be delivered to the loudspeaker have to be converted to the analog coun-
terpart through a digital-to-analog converter, whereas the signals measured through
the microphone need to be converted to discrete time for DSP processing using an
analog-to-digital converter. Accordingly, a fundamental process used for convert-
ing an analog signal to its digital counterpart is called sampling, whereas the basic
process for converting a discrete time signal to its analog counterpart is called recon-
struction.
1.4 Sampling and Reconstruction 17

1.4.1 Ideal Sampling

A continuous time to discrete time conversion is achieved through ideal sampling


where a periodic pulse train (shown in Fig. 1.13),


p(t) = δ(t − kTs ) (1.35)
k=−∞

of sampling period Ts = 1/fs is multiplied with a continuous time signal x(t) to


obtain the sampled version of xs (t) given as

 ∞

xs (t) = x(t)p(t) = x(t) δ(t − kTs ) = x(kTs )δ(t − kTs ) (1.36)
k=−∞ k=−∞

Based on the properties of Fourier transform (viz., multiplication in the time


domain is equivalent to convolution in the frequency domain), the continuous time
frequency response, Xs (jΩ), of the sampled signal xs (t) can be expressed as
1
Xs (jΩ) = X(jΩ) ⊗ P (jΩ) (1.37)


1 
= X(jΩ − jkΩs )
Ts
k=−∞

where Ω = 2πf is the continuous  time angular frequency in rad/s, Ωs = 2πfs



(fs = 1/Ts ), and P (jΩ) = (2π/Ts ) k=−∞ δ(jΩ − jkΩs ).
Thus, 1.37 represents a periodicity in the frequency domain, upon sampling, such
as shown in Fig. 1.14 for a bandlimited signal x(t) with a limiting or cutoff frequency
of Ωc . From the figure, it can be observed that the signal x(t) can be obtained by

Fig. 1.13. The periodic pulse train of period Ts = 125 µs.


18 1 Foundations of Digital Signal Processing for Audio and Acoustics

Fig. 1.14. (a) Fourier transform of a bandlimited signal x(n) with limiting frequency Ωc ; (b)
periodicity of the Fourier transform of the signal x(n) upon ideal sampling.

simply low-pass filtering the baseband spectrum of Xs (jΩ) with a cutoff frequency
Ωc and inverse Fourier transforming the result. For recovering the signal x(t), as
can be seen from Fig. 1.14(b), it is required that Ωs − Ωc > Ωc or Ωs > 2Ωc to
prevent an aliased signal recovery. This condition is called the Nyquist condition, Ωc
is called the Nyquist frequency, and Ωs is called the Nyquist rate.
Subsequently, the frequency response of the discrete time signal from the sam-
pled signal can be obtained from (1.36) by using the continuous time Fourier trans-
form relation,2


Xs (jΩ) = x(kTs )e−jkTs Ω (1.38)
k=−∞
x(n) = x(t)|t=nTs = x(nTs )
∞ ∞

X(ejω ) = x(n)e−jωn = x(nTs )e−jωk (1.39)
n=−∞ k=−∞

Comparing (1.38) and (1.39) it can be seen that


Xs (jΩ) = X(ejω )|ω=ΩTs (1.40)

1.4.2 Reconstruction of Continuous Time Signals from Discrete Time


Sequences
Reconstruction of bandlimited signals can be done by appropriately filtering the dis-
crete time signal by means of a low-pass filter (e.g., as shown in Fig. 1.14(b)). This
2
The
 ∞ continuous time forward and inverse Fourier transforms are X(jΩ) =

−∞
x(t)e−jΩtdt and x(t) = (1/2π) ∞ X(jΩ)ejΩt dΩ, respectively.
1.4 Sampling and Reconstruction 19

Fig. 1.15. The sinc interpolation filter for Ts = 1.

is mathematically described by

 ∞

xr (t) = x(nTs )hr (t − nTs ) = x(n)hr (t − nTs ) (1.41)
n=−∞ n=−∞

By selecting hr (t) to be an ideal low-pass filter with a response of hr (t) =


sin(πt/Ts )/(πt/Ts ),3 (1.41) can be written as

 sin(π(t − nTs )/Ts )
xr (t) = x(n) (1.42)
n=−∞
π(t − nTs )/Ts
∞
sin(π(m − n))
xr (mTs ) = x(n) = x(m) = x(mTs )
n=−∞
π(m − n)

because the sinc function is unity at time index zero and is zero at other discrete
time indices as shown in Fig. 1.15 for Ts = 1. At other noninteger time values, the
sinc filter acts as an interpolator by performing interpolation between the impulses
of xs (t) to form the continuous time signal xr (t).

1.4.3 Sampling Rate Reduction by an Integer Factor

As shown in the previous section, a discrete time sequence can be obtained by sam-
pling a continuous time signal x(t) with a sampling frequency fs = 1/Ts , and the
subsequent sequence can be expressed as x(n) = x(t)|t=nTs = x(nTs ). In many
situations, it is necessary to reduce the sampling rate or frequency by an integer
3
The function sin(πx)/(πx) is referred to as the sinc function.
20 1 Foundations of Digital Signal Processing for Audio and Acoustics

Fig. 1.16. (a) x(t) having response X(jΩ) begin bandlimited such that −π/D < ΩTs <
π/D; (b) X(ejω ); (c) Xd (ejω ) with D = 2.

amount.4 Thus, in order to reduce the sampling rate by an amount D, the discrete
  
time sequence is obtained by using a period Ts such that Ts = DTs ⇒ fs = fs /D,

or xd (n) = x(nTs ) = x(nDTs ). The signal xd (n) is called a downsampled or deci-
mated signal, which is obtained from x(n) by reducing the sampling rate by a factor
of D.
In order for xd (n) to be free of aliasing error, the continuous time signal x(t)
shown in Fig. 1.16(a), from which x(n) is obtained, must be bandlimited a priori
such that −π/D < ΩTs < π/D or the original sampling rate should be at least D
times the Nyquist rate.
The Fourier expression for the decimated signal xd (n) is

  
1 ω 2πk
Xd (e ) =

X j −j (1.43)
DTs DTs DTs
k=−∞

In time domain, xd (n) is obtained by removing every D − 1 samples from x(n).


The block diagram for performing decimation is shown in Fig. 1.17, where the
low-pass filter H(ejω ) (also called the anti-aliasing filter) is used for bandlimiting
the signal x(n) such that π/D < ω < π/D and the arrow indicates decimation.

4
In audio applications, there are several sampling frequencies in use, including 32 kHz, 44.1
kHz, 64 kHz, 48 kHz, 96 kHz, 128 kHz, and 192 kHz, and in many instances it is required
that the sampling rate be reduced by an integer amount, such as from 96 kHz to 48 kHz.
1.4 Sampling and Reconstruction 21

1.4.4 Increasing the Sampling Rate by an Integer Factor

In this case, the sampling rate increase is reflected by altering the sampling period
  
such that Ts = Ts /L ⇒ fs = Lfs , or xi (n) = x(nTs ) = x(nTs /L) = x(n/L),
n = 0, ±L, ±2L, . . . . The signal xi (n) is called an interpolated signal, and is ob-
tained from x(n) by increasing the sampling rate by a factor of L. To obtain the
interpolated signal, x(n), the first step involves an expander stage [5, 6] that gener-
ates a signal xe (n) such that

x(n/L) n = 0, ±L, ±2L, . . .
xe (n) =
0 otherwise
∞
xe (n) = x(k)δ(n − kL) (1.44)
k=−∞

The Fourier transform for the expander can be expressed as




Xe (e ) =

xe (n)e−jωn
n=−∞
∞ ∞

= x(k)δ(n − kL)e−jωn
n=−∞ k=−∞
∞
−jωkL
= x(k)e
k=−∞

= X(ejωL ) (1.45)

Instead of generating zero valued samples every L − 1 samples, a better ap-


proach is to use to an interpolation stage using an ideal low-pass filter hi (n) =
sin(πn/L)/(πn/L), bandlimited between π/L and π/L, subsequent to the expander
so that interpolated values are obtained for the intervening L − 1 samples. Thus,
(1.44) becomes

 sin(π(n − kL)/L)
xi (n) = x(k) (1.46)
π(n − kL)/L
k=−∞

Fig. 1.18 shows the spectrum of a bandlimited continuous time signal x(t) along
with the expanded signal spectrum and the interpolated spectrum. As is evident from
Fig. 1.18(c), the expander introduces L − 1 copies of the continuous time spectrum
between −π and π. Subsequently, an ideal low-pass interpolation filter, having a

Fig. 1.17. System for performing decimation.


22 1 Foundations of Digital Signal Processing for Audio and Acoustics

Fig. 1.18. (a) x(t) having response X(jΩ); (b) X(ejω ); (c) extraction of baseband expanded
and interpolated spectrum of X(ejω ) with L = 2.

cutoff frequency of π/L and a gain of L (shown by dotted lines in Fig. 1.18(c)),
extracts the baseband discrete time interpolated spectrum of Xe (jΩ).
A block diagram employing an L-fold expander (depicted by an upwards arrow)
and the interpolation filter Hi (ejω ) is shown in Fig. 1.19.

Fig. 1.19. System for performing interpolation.

1.4.5 Resampling for Audio Applications


Combining the decimation and interpolation processes, it is possible to obtain the
various sampling rates used for audio processing. For example, if the original sam-
pling rate that was used to convert the continuous time signal x(t) to a discrete time
sequence x(n) was 48 kHz, and if it is required to generate an audio signal at 44.1
kHz (i.e., compact disc or CD-rate), then the block diagram of Fig. 1.20 may be used
to generate the 44.1 kHz resampled audio signal. The decimation factor D = 160
and interpolation factor L = 147 are used.5
5
This ratio can be obtained, for example, using the [N,D] = rat(X,tol) function in MATLAB,
where X = 44100/48000 and tol is the tolerance for the approximation to determine the
numerator and denominator integers.
1.5 Discrete Fourier Transform 23

Fig. 1.20. System for performing resampling.

The function resample(x,A,B) in MATLAB (where A is the new sampling rate


and B is the original sampling rate) also converts an audio signal x(n) between dif-
ferent rates.

1.5 Discrete Fourier Transform


The forward (analysis) and inverse (synthesis) discrete Fourier transform (DFT) of a
finite duration signal x(n), of length N , are expressed as


X(k) = x(n)e(−jk2πn/N ) : Analysis (1.47)
n=−∞
∞
1
x(n) = X(k)e(jk2πn/N ) : Synthesis (1.48)
N n=−∞

The relation between the DFT and the discrete time Fourier transform (DTFT) is

X(ejω )|ω=2πk/N = X(k), k = 0, . . . , N − 1 (1.49)

Equation (1.49) basically states that the DFT is obtained by uniformly, or equally,
sampling the DTFT (i.e., uniform sampling along the unit circle in the complex z-
plane).
An important property for the DFT is the circular shift of an aperiodic signal,
where any delay to a signal constitutes a circular shift in the signal. The appropriate
relation between the DFT and the m-sample circularly shifted sequence is

x((n − m)N ) = x((n − m)moduloN ) ↔ e−jk(2π/N )m X(k) (1.50)


24 1 Foundations of Digital Signal Processing for Audio and Acoustics

Fig. 1.21. Example of modulo shifting with m = 2.

An example of modulo shifting operation for m = −2 for the sequence x(n) is


shown in Fig. 1.21.
The N -point circular convolution of two finite length sequences, x1 (n) and
x2 (n), each of length N , is expressed as

 
N −1
x3 (n) = x1 (n) x2 (n) = x1 (m)x2 ((n − m)N ) n = 0, . . . , N − 1
N k=0
x3 (n) ↔ X3 (k) = X1 (k)X2 (k) (1.51)

By simply considering each of the N length sequences, x1 (n) and x2 (n), as 2N


length sequences (i.e., by appending N zeros to each sequence) the 2N -point circular
convolution of the augmented sequences is identical to the linear convolution of (1.8).
Again, properties of the DFT can be found in several texts including [2].

1.6 Bilinear Transform


Bilinear transform is used to convert between the continuous time and discrete time
frequency variables. If Hc (s) and H(z) are the continuous time and discrete time fre-
quency responses (where s = σ + jΩ and z = ejω ), the bilinear transform between
them can be expressed as

2 (1 − z −1 )
s=
Td (1 + z −1 )
1.7 Summary 25

1 + (Td /2)s
z=
1 − (Td /2)s
 
2 (1 − z −1 )
H(z) = Hc (1.52)
Td (1 + z −1 )

where Td is a sampling parameter representing a numerical integration step size. If


σ < 0, then |z| < 1 for any value of Ω, and if σ > 0, then |z| > 1. Thus, stable
poles, of Hc (s), in the left-half of the complex s-plane are mapped inside the unit
circle in the complex z-plane.
After some simplification, the relation between the continuous time frequency
variable, Ω, and the discrete time frequency variable, ω, as determined through the
bilinear transform is
2
Ω= tan(ω/2)
Td
ω = 2 arctan(ΩTd /2) (1.53)

1.7 Summary
In this chapter we have presented the fundamental prerequisites in digital signal pro-
cessing such as convolution, sampling theory, basics of linear and time-invariant
(LTI) systems, minimum-phase and all-pass systems, sampling and reconstruction of
signals, discrete time Fourier transform (DTFT), discrete Fourier transform (DFT),
z-transform, bilinear transform, and linear-phase finite impulse response (FIR) filters.
2
Filter Design for Audio Applications

In this chapter we present a summary of various approaches for finite impulse re-
sponse (FIR) and infinite duration impulse response (IIR) filter designs.

2.1 Filter Design Process


A typical filter design approach includes the following steps.
• Specify a desired response Hd (ejω ) (including magnitude and/or phase speci-
fication).
• Select an FIR (or IIR) model filter having frequency response H(ejω ) for mod-
eling the desired response.
• Establish a weighted or unweighted (frequency domain or time domain) ap-
proximation error criterion for comparing Hd (ejω ) with H(ejω ).
• Minimize the error criterion by optimizing the model filter parameters.
• Analyze the model filter performance (error criterion, computational complex-
ity, etc.).

2.1.1 Desired Response Specification

The desired response can be specified in the frequency domain (viz., Hd (ejω )) or in
the time domain (e.g., hd (n) = δ(n − nd )). For example, a low-pass filter specifica-
tion is

1 ω ∈ [0, ωc ]
Hd (e ) =

(2.1)
0 ω ∈ [ωs , π]

The domains [0, ωc ], (ωc , ωs ), [ωs , π] are called the pass-band, transition-band,
and stop-band, respectively, and are specified by their tolerance parameters. Exam-
ples of tolerance parameters include allowable ripples δp and δs which describe the
pass-band amplitude variance Ap and stop-band attenuation As .

Ap = 20 log10 (1 + δp )/(1 − δp ) (dB)


28 2 Filter Design for Audio Applications

As = −20 log10 (δs ) (dB) (2.2)

Alternatively, if the signal waveform needs to be preserved, then the phase re-
sponse of the desired response is specified with linearity constraint,

φ(ejω ) = −τ0 ω + τ1 (2.3)

where τ0 and τ1 are constants.

2.1.2 Approximating Error Function

The specifications of (2.1) and (2.2) of the low-pass filter, for example, can also be
written in terms of a frequency weighting approximation such as

−δp ≤ W (ejω )(|H(ejω )| − Hd (ejω )) ≤ δp ω ∈ Xp


W (e )|H(e )| ≤ δs
jω jω
ω ∈ Xs (2.4)

where the accuracy of the amplitude of the selected filter, H(ejω ), in the pass-band
domain, Xp and stop-band domain, Xs , is controlled by the frequency weighting
function, W (ejω ).
Thus, according to (2.4) the frequency weighted approximating error function,
E(ejω ), can be written as E(ejω ) = W (ejω )(|H(ejω )|−Hd (ejω )), with Hd (ejω ) =
0 on the stop-band domain Xs .
Other widely used error criteria are:
• Minimax error, where , the maximum error in a particular frequency band, is
minimized. Specifically, = maxω∈X |E(ejω )|.
 • Minimization of the Lp norm, where the minimization of the quantity, Jp ·
ω∈X
E (e
p jω
)dω, is done for p > 0. When p → ∞, the solution that minimizes the
integration approaches the minimax solution. The classic case is the L2 norm where
p = 2.
• Maximally flat approximation, which is obtained by means of a Taylor series
expansion of the desired response at a particular frequency point.
• Combination of any of the above approximating schemes.

2.2 FIR Filter Design

There are many advantages of using FIR filters (over their IIR counterparts) which
include [7] linear-phase constraint design, computationally efficient realizations, sta-
ble designs free of limit cycle oscillations when implemented on finite-wordlength
systems, arbitrary specification-based designs, low output noise due to multiplica-
tion roundoff errors, and low sensitivity to variations in the filter coefficients. The
disadvantages include a larger length filter for extremely narrow or stringent tran-
sition bands thereby increasing the computational requirements but which can be
minimized through fast convolution algorithms and multiplier-efficient realizations.
2.2 FIR Filter Design 29

2.2.1 Linear Phase Filter Design

There are four types of causal linear phase responses, of finite duration or finite
impulse response (FIR).
1. Type 1 linear phase filter of length M + 1 (M even, constant group delay
M/2, β = {0, π}) having finite duration response h(n) = h(M − n), and frequency
M/2
response H(ejω ) = e(−jωM/2) k=0 ak cos(kω) with a0 = h(M/2) and ak =
2h((M/2) − k), 1 ≤ k ≤ M/2. Type 1 filters are used to design low-pass, high-
pass, and band-pass filters.
2. Type 2 linear phase filter of length M +1 (M odd, a delay M/2 corresponding
to an integer plus one-half, β = {0, π}) having finite duration response h(n) =
(M +1)/2
h(M − n), and frequency response H(ejω ) = e(−jωM/2) k=1 bk
cos(ω(k − (1/2))) with bk = 2h((M + 1)/2 − k), 1 ≤ k ≤ (M + 1)/2. Type 2
filters have a zero at z = −1 (i.e., ω = π) and hence cannot be used for designing
high-pass filters.
3. Type 3 linear phase filter of length M + 1 (M even, a delay M/2, β =
{π/2, 3π/2}) having finite duration response h(n) = −h(M − n), and frequency
M/2
response H(ejω ) = je(−jωM/2) k=1 ck sin(kω) with ck = 2h((M/2) − k),
1 ≤ k ≤ M/2. Type 3 filter has a zero at z = 1 and z = −1 and hence cannot
be used for designing a low-pass or a high-pass filter.
4. Type 4 linear phase filter of length M + 1 (M odd, M/2 being an integer plus
one-half, β = {π/2, 3π/2}) having finite duration response h(n) = −h(M − n),
(M +1)/2
and frequency response H(ejω ) = je(−jωM/2) k=1 dk sin(ω(k − (1/2)) with
dk = 2h((M + 1)/2 − k), 1 ≤ k ≤ (M + 1)/2. Type 4 filter has a zero at z = 1
and hence cannot be used in the design of a low-pass filter.
For simplicity in notation in subsequent sections, the general linear-phase filter
frequency response can be described in the following functional form,

H(ejω ) = tn ψ(ω, n) (2.5)
n

where the trigonometric function ψ(·, ·) is a symbolic description for the sin or cos
term in the four types of linear phase filters described above.
The design of linear-phase FIR filters, depending on the zero locations of these
filters, is shown in [2]. As in the case of the decomposition of a general transfer func-
tion into minimum-phase and all-pass components, any linear-phase system function
can also be decomposed into a product of three terms comprising: (i) a minimum-
phase function, (ii) a maximum-phase function (where all of the poles and zeros have
magnitude strictly greater than unity), and (iii) a function comprising zeros having
strictly unit magnitude.

2.2.2 Least Squares FIR Filter Design

The least squares FIR filter design approximation criteria is given as


30 2 Filter Design for Audio Applications
 
J2 = E 2 (ejω )dω = [W (ejω )(|H(ejω )| − Hd (ejω ))]2 dω (2.6)
ω∈X ω∈X

For a discrete frequency representation {ωk : k = 1, . . . , K}, (2.6) can be recast as


K
J2 = [W (ejωk )(|H(ejωk )| − Hd (ejωk ))]2 (2.7)
k=1

In order to design a linear-phase FIR filter model to approximate Hd (ejωk )∀k, as


given in the previous section, the generalized functional representation, H(ejωk ) =
M
n=0 tn ψ(ωk , n), is used. Thus,
 2

K 
M
J2 = W (e
jωk
) tn ψ(ωk , n) − Hd (ejωk
) (2.8)
k=1 n=0

which can be expressed in matrix-vector notation as,

J2 = eT e (2.9)

where

e = Xt − d (2.10)

where the matrix X ⊂ K×(M +1) is given by

X = [W (ω1 )ψ(ω1 , 0) W (ω1 )ψ(ω1 , 1) ... W (ω1 )ψ(ω1 , M );


W (ω2 )ψ(ω2 , 0) W (ω2 )ψ(ω2 , 1) . . . W (ω2 )ψ(ω2 , M );
W (ωK )ψ(ωK , 0) W (ωK )ψ(ωK , 1) . . . W (ωK )ψ(ωK , M )] (2.11)

The vectors t ⊂ (M +1)×1 and d ⊂ K×1 are

t = (t0 , t1 , . . . , tM )T
d = (W (ω1 )ψ(ω1 , 0)Hd (ejω1 ), . . . , W (ωK )ψ(ωK , 0)Hd (ejωK ))T (2.12)

The least squares optimal solution is then given by

t = (XT X)−1 XT t (2.13)

2.2.3 FIR Windows for Filter Design

In many instances the FIR (or IIR) filters designed can be of very large duration
which will increase the computational requirements for implementing such filters,
which are typically not available in real-time DSP devices. Then one approach is
to limit the duration of the filter without significantly affecting the resulting perfor-
mance of the filter. There are several windowing filters that limit the signal duration
and achieve a tradeoff between the main-lobe width and side-lobe amplitudes.
2.2 FIR Filter Design 31

Fig. 2.1. (a) Impulse response of a rectangular window function hr (n) with N = 10; (b)
magnitude response of the rectangular window.

A direct truncation of a signal x(n) with a rectangular window filter hr (n) gives
a shortened duration signal xr (n),

xr (n) = hr (n)x(n)

1 n ∈ {−N, N }
hr (n) = (2.14)
0 |n| > N

The frequency response of the rectangular window is given by Hr (ejω ) = sin[(2N +


1)ω/2]/ sin(ω/2). The time domain response and the magnitude response of the
rectangular window are shown in Fig. 2.1.
The Bartlett window time domain and frequency response are given by hBt =
1 − (|n|/(N + 1)) and HBt (ejω ) = (1/(N + 1))[sin((N + 1)ω/2)/(sin(ω/2))]2
and are shown in Fig. 2.2.
The Hann window is given by hHn = 0.5(1 + cos(2πn/(2N + 1))) and its
frequency response, HHn (ejω ), is expressed in relation to the frequency response of
the rectangular window, Hr (ejω ), as
 
HHn (ejω ) = 0.5Hr (ejω ) + 0.25Hr ej(ω−2π/(2M +1))
 
+ 0.25Hr ej(ω+2π/(2M +1)) .

The time domain and magnitude response for this window are shown in Fig. 2.3.
The Hamming window is given by hHm = 0.54(1+0.8519 cos(2πn/(2N +1)))
and its frequency response, HHm (ejω ), is expressed in relation to the frequency
32 2 Filter Design for Audio Applications

Fig. 2.2. (a) Impulse response of a Bartlett window function with N = 10; (b) magnitude
response of the Bartlett window.

Fig. 2.3. (a) Impulse response of a Hann window function with N = 10; (b) magnitude
response of the Hann window.
2.2 FIR Filter Design 33

Fig. 2.4. (a) Impulse response of a Hamming window function with N = 10; (b) magnitude
response of the Hamming window.

response of the rectangular window, Hr (ejω ), as HHm (ejω ) = 0.54Hr (ejω ) +


0.23Hr (ej(ω−2π/(2M +1)) ) + 0.23Hr (ej(ω+2π/(2M +1)) ). The time domain and mag-
nitude response for this window are shown in Fig. 2.4.
The Blackman window is given by hBl = 0.42 + 0.5 cos(2πn/(2N + 1)) +
0.08 cos(4πn/(2N + 1)) and its frequency response, HBl (ejω ), is expressed in re-
lation to the frequency response of the rectangular window, Hr (ejω ), as
 
HBl (ejω ) = 0.42Hr (ejω ) + 0.25Hr ej(ω−2π/(2M +1))
   
+ 0.25Hr ej(ω+2π/(2M +1)) + 0.04Hr ej(ω−4π/(2M +1))
 
+ 0.04Hr ej(ω+4π/(2M +1)) .

The time domain and magnitude response for this window are shown in Fig. 2.5.
The Kaiser window is given by hK = I0 [β(1 − [(n − α)/α]2 )0.5 ]/I0 (β), 0 ≤
n ≤ N where I0 represents the zeroth-order modified Bessel function of the first
kind. The main lobe width and side-lobe levels can be adjusted by varying the length
(N + 1) and β. The parameter, β, performs a tapering operation with high values
of β achieving a sharp taper. In the extreme case, where β = 0, the Kaiser window
becomes the rectangular window hR (n). Figure 2.6(a) shows the Kaiser window
response for various values of β, whereas Figure 2.6(b) shows the corresponding
magnitude response which shows a lower side-lobe level but increasing main-lobe
width as β increases in value. Fig. 2.7 shows the magnitude responses of the Kaiser
window filter, as a function of the filter or window length, N , with β = 6. As is
34 2 Filter Design for Audio Applications

Fig. 2.5. (a) Impulse response of a Blackman window function with N = 10; (b) magnitude
response of the Blackman window.

Fig. 2.6. (a) The effect of β on the shape of the Kaiser window; (b) the magnitude response of
the Kaiser window for the various values of β.
2.2 FIR Filter Design 35

Fig. 2.7. The effect of N on the magnitude response of the Kaiser window with β = 6.

evident as the filter length increases, advantageously the main-lobe width as well as
the side-lobe amplitude decreases. Kaiser determined empirically that to achieve a
specified stop-band ripple amplitude A = −20 log10 δs , the value of β can be set as

⎨ 0.1102(A − 8.7) A > 50
β = 0.5842(A − 21)0.4 + 0.07886(A − 21) 21 ≤ A ≤ 50 (2.15)

0 A < 21

with N = (A − 8)/(2.285∆ω) and ∆ω = ωs − ωp , where ωs and ωp are the stop-


band and pass-band cutoff frequencies.
Other approaches to “static” FIR filter design (e.g., the Remez exchange algo-
rithm based on the alternation theorem) can be found in [7, 2].

2.2.4 Adaptive FIR Filters

Adaptive filtering is found widely in applications involving radar, sonar, acoustic sig-
nal processing (noise and echo cancellation, source localization, etc.), speech com-
pression, and others. The advantage of using adaptive filters is their ability to track
information from a nonstationary environment in real-time by optimization of its in-
ternal parameters. Haykin [8] provides a number of references to various types of
adaptive filters and their applications. Of them, the most popular is the least mean
square (LMS) algorithm of Widrow and Hoff [13] which is explained in this section.
An adaptive FIR filter (also called the transversal or tapped delay line filter) struc-
ture, shown in Fig. 2.8, differs from a fixed FIR filter in that the filter coefficients
Wk = (w0 (k), w1 (k), . . . , wN −1 (k))T are varied with time as a function of the
36 2 Filter Design for Audio Applications

Fig. 2.8. An adaptive FIR filter structure.

filter inputs X(n) = [x(n), x(n − 1), . . . , x(n − N + 1)] and an approximation error
signal e(n) = d(n) − y(n). The filter coefficients are adapted such that the mean
m−1
square error Jm (n) = E{e(n)2 } ≈ 1/m k=0 e2 (n − k) (the operator E{·} is the
statistical expectation operator) is minimized. For the well-known LMS method of
[13], the instantaneous error, J1 (n), where m = 1, is minimized.
The LMS filter adaptation equations, for a complex input signal vector X(n) =
[x(n), x(n − 1), . . . , x(n − N + 1)], are expressed as
W(n) = W(n − 1) + µe(n − 1)X∗ (n − 1) (2.16)
where µ is the adaptation rate that controls the rate of convergence to the solution,
and the superscript ∗ denotes complex conjugation. Details on the convergence and
steady-state performance of adaptive filters (e.g., based on LMS, the recursive least
squares or RLS error criteria) can be found in various texts and articles including [8,
126]. Other variations include the filtered-X, frequency domain, and block adaptive
filters.

2.3 IIR Filter Design


The infinite duration impulse response (IIR) digital filter, H(z), of numerator order
and denominator order M and N , respectively, has a transfer function that resembles

b0 + b1 z −1 + b2 z −2 + · · · + bM −1 z −(M −1)
H(z) =
a0 + a1 z −1 + a2 z −2 + · · · + aM −1 z −(M −1)
N −1 −1

1  M
h(n) = − ak h(n − k) + bk x(n − k) (2.17)
a0
k=1 k=0

Generally IIR filters can approximate specific frequency responses with shorter or-
ders than an FIR (especially where sharp and narrow transition bands are required),
2.3 IIR Filter Design 37

but have associated problems, including (i) converging to a stable design, (ii) higher
computational complexity requiring nonlinear optimization to converge to a valid
solution, and (iii) numerical problems in computing an equivalent polynomial, that
defines the transfer function, if multiple-order poles are densely packed near the unit
circle. Furthermore, IIR filters cannot be designed to have linear phase, like FIR fil-
ters, but by cascading an all-pass filter the phase can be approximately linearized in
a particular band of the frequency response.

2.3.1 All-Pass Filters


A well-known class of IIR filters, introduced in Chapter 1, are the all-pass filters
having a transfer function of the form
b0 + b1 z −1 + b2 z −2 + · · · + bM −1 z −(M −1)
Hap (z) = (2.18)
bM −1 + bM −2 z −1 + · · · + b1 z −(M −2) + b0 z −(M −1)
where the coefficients of the numerator and denominator polynomial are reversed in
relation to each other.

2.3.2 Butterworth Filters


Butterworth (IIR) filters are widely used in audio systems as high-pass and low-pass
bass management filters. Bass management ensures that specific loudspeakers repro-
duce audio content, with minimal distortion, when supplied with signals in specific
frequency bands.1 The magnitude responses of the high-pass Butterworth IIR filter
employed in the satellite channel, and the low-pass Butterworth IIR filter employed
in the subwoofer channel can be expressed as

|Hωhpc ,N (ejω )| = 1 − 1/ 1 + (ω/ωc )2N

|Hωlpc ,M (ejω )| = 1/ 1 + (ω/ωc )2M (2.19)
The filters are second-order with a decay rate of 6N dB/octave and 6M dB/octave
for the high-pass and low-pass filters, below and above the crossover frequency, fc ,
respectively. For example, a typical choice of the bass management filter parameters,
used in consumer electronics applications, involves N = 2 and M = 4 with a
crossover frequency ωc corresponding to 80 Hz. The magnitude responses of the bass
management filters, as well as the magnitude of the recombined response (i.e., the
magnitude of the complex sum of the bass management filter frequency responses),
for the 80 Hz crossover with N = 2 and M = 4, are shown in Fig. 2.9.2
1
As an example, the satellite speakers may be driven with signals above 80 Hz and the
subwoofer may be driven with signals below 80 Hz.
2
Optionally two second-order Butterworth low-pass filters (i.e., M = 2) are cascaded in a
manner such that the speaker roll-off is initially effected with the first second-order Butter-
worth filter and the bass-management system includes a second second-order Butterworth
filter such that the net response, from from the fourth order low-pass Butterworth and the
two second-order Butterworth high-pass filters, has unit amplitude through half the sam-
pling rate.
38 2 Filter Design for Audio Applications

Fig. 2.9. Magnitude response of typically used bass management filters in consumer electron-
ics, and the recombined response (the sum of the low-pass and high-pass frequency responses).

2.3.3 Chebyshev Filters

Type-1 Chebyshev IIR filters exhibit equiripple error in the pass-band and monoton-
ically decreasing response in the stop-band. A low-pass N th-order Chebyshev IIR
filter is specified by the squared magnitude response, Hcheby (ejω ), where the mag-
nitude response oscillates between 1 and 1/(1 + 2 ) in the pass-band, where it will
have a total of N local maxima and local minima.
1
|Hcheby,1 (ejω )|2 = (2.20)
(1 + 2 TN2 (ω/ωP ))
where TN (x) is the N th-order Chebyshev polynomial. For nonnegative integers k,
the kth-order Chebyshev polynomial is expressed as

cos(k cos−1 x) |x| ≤ 1
Tk (x) = (2.21)
cosh(k cosh−1 x) |x| ≥ 1
A Type-1 low-pass Chebyshev filter, with pass-band frequency of 1 kHz, stop-band
frequency of 2 kHz, and attenuation of 60 dB, is shown in Fig. 2.10.
Type-2 Chebyshev IIR filters exhibit equiripple error in the stop-band and the
response decays monotonically in the pass-band. The squared magnitude response is
expressed by,
1
|Hcheby,2 (ejω )|2 = (2.22)
(1 + 2 [T 2
N (ωs /ωP )]/TN (ωs /ω)] )

and is shown in Fig. 2.11.


2.3 IIR Filter Design 39

Fig. 2.10. (a) Magnitude response, between 20 Hz and 1500 Hz, of Type-1 Chebyshev low-
pass filter of order 7 having pass-band frequency of 1000 Hz, stop-band frequency of 2000
Hz, and attenuation of 60 dB; (b) magnitude response of the filter in (a), between 1000 Hz and
20,000 Hz.

Fig. 2.11. (a) Magnitude response, between 20 Hz and 1500 Hz, of Type-2 Chebyshev low-
pass filter of order 7 having pass-band frequency of 1000 Hz, stop-band frequency of 2000
Hz, and attenuation of 60 dB; (b) magnitude response of the filter in (a), between 1000 Hz and
20,000 Hz.
40 2 Filter Design for Audio Applications

Fig. 2.12. Magnitude response of an N = 5-order elliptic filter having pass-band frequency
of 200 Hz and stop-band frequency of 300 Hz, with stop-band attenuation of 60 dB.

2.3.4 Elliptic Filters

Elliptic filters exhibit equiripple pass-band magnitude response and the stop-band.
For a specific filter order N , pass-band ripple , and maximum stop-band amplitude
1/A, the elliptic filter provides the fastest transition from pass-band to stop-band. In
fact, this feature is advantageous as a low-pass filter response, with a rapid decay
time, and can be designed for low-order transversal or direct form two implementa-
tions. The magnitude response of an N th-order low-pass elliptic filter can be written
as
1
|H(ejω )|2 =
1 + 2 FN2 (ω/ωP )
⎧ 2 2
⎨ γ 2 (ω1 −ω )(ω32 −ω 2 )...(ω2N
2 2
−1 −ω )
(1−ω 2 ω 2 )(1−ω 2 ω 2 )...(1−ω 2 ω2 )
N, even
FN (ω) = 1 3 2N −1
(2.23)
⎩ γ 2 ω(ω222−ω2 2 )(ω42 −ω 2 2
)...(ω2N −ω 2 )
N, odd
(1−ω2 ω )(1−ω42 ω 2 )...(1−ω2N 2 ω2 )

The magnitude response of a fifth-order elliptic filter having pass-band frequency at


200 Hz and stop-band frequency at 300 Hz with a stop-band attenuation of 60 dB is
shown in Fig. 2.12.

2.3.5 Shelving and Parametric Filters

Commonly used IIR filters in audio applications include the second-order parametric
filter for designing filters with specific gain and bandwidth (or Q value) and the
2.3 IIR Filter Design 41

shelving filter which introduces amplitude boosts or cuts in low-frequency or high-


frequency regions. A second-order transfer function is expressed as

b0 + b1 z −1 + b2 z −2
H(z) = (2.24)
a0 + a1 z −1 + a2 z −2
and the coefficients ai and bj for the various filters are given below.

Low-Frequency Shelving Filter

1. Boost of G (dB) (i.e., g = 10G/20 ) with K = tan(Ωc Ts /2):



1 + 2gK + gK 2
b0 = √
1 + 2K + K 2
2(gK 2 − 1)
b1 = √
1 + 2K + K 2

1 − 2gK + gK 2
b2 = √
1 + 2K + K 2
a0 = 1 (2.25)
2(K 2 − 1)
a1 = √
1 + 2K + K 2

1 − 2K + K 2
a2 = √
1 + 2K + K 2

2. Cut of G (dB) (i.e., g = 10G/20 ) with K = tan(Ωc Ts /2):



1 + 2K + K 2
b0 = √
1 + 2gK + gK 2
2(K 2 − 1)
b1 = √
1 + 2gK + gK 2

1 − 2K + K 2
b2 = √
1 + 2gK + gK 2
a0 = 1 (2.26)
2(gK 2 − 1)
a1 = √
1 + 2gK + gK 2

1 − 2gK + gK 2
a2 = √
1 + 2gK + gK 2

Figures 2.13 and 2.14 show examples of the magnitude response for a low-
frequency boost and cut shelving filters, respectively, for a 48 kHz sampling rate
with fc = 200 Hz or Ωc = 400π and G = 10 dB.
42 2 Filter Design for Audio Applications

Fig. 2.13. Magnitude response for a low-frequency boost shelving filter.

Fig. 2.14. Magnitude response for a low-frequency cut shelving filter.


2.3 IIR Filter Design 43

High-Frequency Shelving Filter

1. Boost of G (dB) (i.e., g = 10G/20 ) with K = tan(Ωc Ts /2):



g + 2gK + K 2
b0 = √
1 + 2K + K 2
2(K 2 − g)
b1 = √
1 + 2K + K 2

g − 2gK + K 2
b2 = √
1 + 2K + K 2
a0 = 1 (2.27)
2(K 2 − 1)
a1 = √
1 + 2K + K 2

1 − 2K + K 2
a2 = √
1 + 2K + K 2

2. Cut of G (dB) (i.e., g = 10−G/20 ) with K = tan(Ωc Ts /2):



1 + 2K + K 2
b0 = √
g + 2gK + K 2
2(K 2 − 1)
b1 = √
g + 2gK + K 2

1 − 2K + K 2
b2 = √
g + 2gK + K 2
a0 = 1 (2.28)
2((K 2 /g) − 1)
a1 = 
1 + 2/gK + K 2 /g

1 − 2/gK + K 2 /g
a2 = 
1 + 2/gK + K 2 /g

Parametric Filters

Parametric filters are specified in terms of the gain, G (g = 10G/20 ), center frequency
fc , and the Q value which is inversely related to the bandwidth of the filter. The
equations characterizing the second-order parametric filter for a sampling frequency
of fs are

ωc = 2πfc /fs
β = (2ωc /Q) + ωc2 + 4
b0 = [(2gωc /Q) + ωc2 + 4]/β
b1 = (2ωc2 − 8)/β
44 2 Filter Design for Audio Applications

Fig. 2.15. Magnitude response of a parametric filter with various Q values.

b2 = [4 − (2gωc /Q) + ωc2 ]/β


a0 = 1 (2.29)
a1 = (2ωc2 − 8)/β
a2 = [4 − (2ωc /Q) + ωc2 ]/β
Figure 2.15 shows the effect of varying Q from 0.5 through 4 on the magnitude
response of the parametric filter with fcenter = 187 Hz and G = 5 dB.

2.3.6 Autoregressive or All-Pole Filters

IIR filters based on the autoregressive (AR) or autoregressive and moving average
(ARMA) process are determined based on the second-order statistics of the input
data. These filters are widely used for spectral modeling and the denominator poly-
nomial for the AR process (or the numerator and denominator polynomials for the
ARMA process) are generated through an optimization process that minimizes an
error norm.
One of the popular AR processes, yielding an all-pole IIR filter, is the linear
predictive coding (LPC) filter or model. The LPC model is widely used in speech
recognition applications [10]: (i) it provides an excellent all-pole vocal tract spectral
envelope model for a speech signal; (ii) the filter is minimum-phase, is analytically
tractable, and straightforward to implement in software or hardware; and (iii) the
model works well in speech recognition applications.
The LPC or all-pole filter of order P is characterized by the polynomial coeffi-
cients {ak , k = 1, . . . , P } with a0 = 1. Specifically, a signal x(n) at time index n
can be modeled with an all-pole filter of the form
2.3 IIR Filter Design 45

Fig. 2.16. Modeling performance of the LPC for differing filter orders.

1
H(z) = P (2.30)
1+ k=1 ak z −k

In order to determine the filter coefficients {ak , k = 1, . . . , P }, an intermediate error


signal is defined as e (n) = x(n) − h(n). Thus,

E  (z) = X(z) − H(z)



1 
P
= X(z) − P 1+ ak z −k E  (z) + 1
1+ −k
k=1 ak z k=1


P
= X(z) 1 + ak z −k (2.31)
k=1
P
Defining a new error signal e(n) = k=0 ak e(n − k) + 1, (2.31) may be written as


P
e(n) = x(n) + ak x(n − k) (2.32)
k=1
N −1
To determine {ak , k = 1, . . . , P }, the error signal power, E = n=0 |e(n)|2 , is
minimized with respect to the filter coefficients with N being the duration of x(n).
Thus, the filter coefficients are to be determined by setting the gradient of the error
function E to be zero. Thus,
∂E
= 0; ∀k (2.33)
∂ak
46 2 Filter Design for Audio Applications

giving rise to the following all-pole normal equations,


P
al rx (k, l) = −rx (k, 0); k = 1, 2, . . . , P (2.34)
l=1

where rx (k, l) denotes the correlation of the signal x(n) for various lags. Specifi-
cally,


N −1
rx (k, l) = x(n − l)x∗ (n − k) (2.35)
n=0

or using matrix vector notation,

Rx a = −rx (2.36)

The above system of equations can be solved through the autocorrelation method or
the covariance method [10]. The autocorrelation method is popular as the autocorre-
lation matrix, comprising the autocorrelations rx (k, l), at various lags, is a Toeplitz
matrix (i.e., a symmetric matrix with equal diagonal elements). Such a system can
be solved through well-established processes such as the Durbin algorithm.
An example of the modeling performance on using the LPC approach is shown
in Fig. 2.16 where the solid line depicts the response to be modeled by the LPC. The
order of the LPC is p = 128 and p = 256, and it can be seen that the model shows
a very good approximation at higher frequencies. Unfortunately, for such low filter
orders, necessary for real-time implementations, the low-frequency performance is
not satisfactory. In Chapter 6, we present a technique widely used for improving the
low-frequency modeling performance of the LPC algorithm.

2.4 Summary
In this chapter we have presented various filter design techniques including FIR, IIR,
parametric, shelving, and all-pole filters using second-order statistical information
for signal modeling.
Part II

Acoustics and Auditory Perception


3
Introduction to Acoustics and Auditory Perception

This chapter introduces the theory behind sound propagation in enclosed environ-
ments, room acoustics, reverberation time, and the decibel scale. Also included are
basics of loudspeakers and microphone acoustics and responses, room impulse re-
sponses, and stimuli for measuring loudspeaker and room responses. We conclude
the chapter with a brief discussion on the structure of the ear, and some relevant
concepts such as loudness perception and frequency selectivity.

3.1 Sound Propagation


Any complex sound field can be represented as a linear superposition of numerous
simple sound waves such as plane waves. This is particularly true in the case of
room acoustics, where boundaries such as walls are a source of reflections. In all of
the analysis below, we assume that the medium is homogeneous and at rest, in which
case the speed of sound, c, is constant with reference to space and time and is only a
function of temperature, Θ. Thus,

c = 331.4 + 0.6Θ m/s (3.1)

The spectrum of interest for a signal from a sound source, that is affected by
room acoustics, is frequency dependent and is a function of the type of source. For
example, human speech comprises a fundamental frequency located between 50 Hz
and 350 Hz, and is identical to the frequency of vibrations of the vocal chords, in
addition to harmonics that extend up to about 3500 Hz. Musical instruments range
from 16 Hz to about 15 kHz. Above 10 kHz, the attenuation of the signal in air is
so large that the influence of a room on high-frequency sound components can be
neglected [11], whereas below 50 Hz the wavelength of sound is so large that sound
propagation analysis using geometrical considerations is almost of no use. Thus, the
frequency range of relevance to room acoustics extends from 50 Hz to 10 kHz.
Finally, the directionality (or the intensity of sound as a function of direction)
of a sound source will vary with the type of source. For example, for speech, the
50 3 Introduction to Acoustics and Auditory Perception

directionality is less pronounced at low frequencies as the wavelengths of sound, λ


(i.e., λ = c/f , where f is the frequency), at such low frequencies are less affected
by diffraction effects around the head (or head shadowing), whereas musical instru-
ments display pronounced directivity as the linear dimensions of the instrument are
large compared to the wavelengths of emitted sound.

3.2 Acoustics of a Simple Source in Free-Field


The propagation of an acoustic wave in three dimensions can be modeled through a
linear relationship (referred to as the wave equation), relating pressure p at position
r = (x, y, z) and time t, as
1 ∂2p
∇2 p = (3.2)
c2 ∂t2
where ∇2 p = (∂ 2 p/∂x2 ) + (∂ 2 p/∂y 2 ) + (∂ 2 p/∂z 2 ). The solution to the wave equa-
tion, describing the sound pressure, p, is given in terms of two arbitrary functions,
f1 (·) and f2 (·), as
p(x, t) = f1 (ct − x) + f2 (ct + x) (3.3)
where f1 (·) and f2 (·) describe acoustic waves traveling in the positive and negative
x-direction, respectively. The functions f1 (·) and f2 (·) can be sinusoidal functions
that satisfy (3.2). Specifically,
p(x, t) = p0 ejk(ct−x) (3.4)
where k = ω/c = 2π/λ is the wavenumber corresponding to analog frequency
ω. The plane wave assumption is valid whenever, at a distance r from the source,
kr >> 1. Furthermore, to accommodate the nonvanishing dimensions of real-world
sound sources (viz., loudspeakers), (3.4) can be generalized as
p(r, φ, θ, t) = (A/r)Γ (φ, θ)ejk(ct−r) (3.5)
where r is the distance of the sound pressure measurement point and Γ (φ, θ) repre-
sents the directivity function of the source, normalized to unity.
In a more specific form, the free-field, or unbounded medium, sound pressure pω
at any point r due to a simple harmonic source having an outward flow of Sω e−jωt
from a position r0 can be written as
pω (r|r0 )e−jωt = −jkρcSω e−jωt gω (r|r0 )e−jωt
1 jkR
gω (r|r0 ) = e (3.6)
4πR
R2 = |r − r0 |2 = (x − x0 )2 + (y − y0 )2 + (z − z0 )2
where k is referred to as the wavenumber and relates to the harmonic frequency ω
through k = ω/c = 2π/λ, ρ is the density of air,1 and gω (r|r0 ) is called the Green’s
1
This is a function of temperature of the air, moisture, and barometric pressure, but for all
practical purposes a value of 1.25 kg/m3 can be used.
3.3 Modal Equations for Characterizing Room Acoustics at Low Frequencies 51

function. The time-dependent pressure function p(r|r0 , t) can be found through the
Fourier inverse of (3.2).
Finally, the sound pressure
 level at a distance
 r, with p̃ representing the root mean
square pressure (viz., E{p2 } = [(1/t) t p2 dτ ]1/2 where E{.} is the statistical
expectation operator), can be expressed as

SP L = 20 log10 dB (3.7)
p̃ref
where p̃ref is an internationally standardized reference root mean square pressure
with a value of 2 × 10−5 N/m2 .

3.3 Modal Equations for Characterizing Room Acoustics at Low


Frequencies
The free-field sound propagation behavior was described in the previous section.
However, the sound field in a real room is much more complicated to characterize
because of the large number of reflection components, standing waves influenced by
room dimensions, and the variabilities in the geometry, size, and absorption between
various rooms. For example, if a wave strikes a wall, in general, a part of the inci-
dent sound energy will be absorbed and some of it will be reflected back with some
phase change. The resulting wave, created by the interference of the reflected and
incident wave, is called a standing wave. Given that a room includes various walls,
surfaces, and furniture, the prediction of the resulting sound field is very difficult.
Hence several room models have evolved that characterize the sound field through
deterministic or statistical techniques. One such model is based on the wave theory
of acoustics where the acoustic wave equation of (3.2) is solved with boundary con-
ditions that are set up which describe, mathematically, the acoustical properties of
the walls, ceiling, floor, and other surfaces.
Without going into the derivation, the Green’s function, or the sound pressure in
a room for a frequency ω, derived from the wave theory of acoustics in a bounded
enclosure is given by [11] and [12]
 pn (q )pn (q )
pω (q l ) = jQωρ0 l o

n
Kn (k 2 − kn2 )
x −1 N
N y −1 Nz −1
  pn (q l )pn (q o )
= jQωρ0
nx =0 ny =0 nz =0
Kn (k 2 − kn2 )
n = (nx , ny , nz ); k = ω/c; q l = (xl , yl , zl )
   2  2 1/2   
2
nx ny nz
kn = π + + pn (q l )pm (q l )dV
Lx Ly Lz V

Kn n = m
= (3.8)
0 n = m
52 3 Introduction to Acoustics and Auditory Perception

where kn are referred to as the eigenvalues, and where the eigenfunctions pn (q l ) can
be assumed to be orthogonal to each other under certain conditions, and the point
source being at q o . The modal equations in (3.8) are valid for wavelengths, λ, where
λ > (1/3) min[Lx , Ly , Lz ] [12]. At these low frequencies, a few standing waves are
excited, so that the series terms in (3.8) converge quickly.
For a rectangular enclosure with dimensions (Lx , Ly , Lz ), q o = (0, 0, 0), the
eigenfunctions pn (q l ) in (3.8) are
     
nx πxl ny πyl nz πzl
pn (q l ) = cos cos cos
Lx Ly Lz
pn (q o ) = 1
 Lx    Ly    Lz  
nx πxl ny πyl nz πzl
Kn = cos2 dx cos2 dy cos2 dz
0 Lx 0 Ly 0 Lz
Lx Ly Lz V
= = (3.9)
8 8
Each of the terms in the series expansion can be considered to excite a resonant
frequency of about fn = ωn /2π = c/λn Hz, with a specific amplitude and phase as
determined by the numerator and denominator terms of (3.8). Because the different
terms in the series expansion can be considered mutually independent, the central
limit theorem can be applied to the real and imaginary parts of pω (q l ), according
to which both quantities can be considered to be random variables obeying a nearly
Gaussian distribution. Thus, according to the theory of probability, |pω (q l )| follows
the Rayleigh distribution. If z denotes pω (q l ), then the probability of finding a sound
pressure amplitude between z and z + dz is given by
π −(πz2 /4)
P (z)dz = e zdz (3.10)
2
Thus, in essence, the distribution of the sound pressure amplitude is independent of
the type of room, volume, or its acoustical properties. The probability distribution is
shown in Fig. 3.1.
The eigenfunction distribution in the z = 0 plane, for a room of dimension 6 m
×6 m ×6 m, and tangential mode (nx , ny , nz ) = (3, 2, 0) is shown in Fig. 3.2.
Finally, the time domain sound pressure, p(r, t), can be found through the Fourier
transform using
 ∞
p(r, t) = pω (r)e−jωt dω (3.11)
−∞

Rooms such as concert halls, theaters, and irregular-shaped rooms deviate from
the wave theory assumed rectangular shape due to the presence of pillars, columns,
balconies, and other irregularities. As such, the methods of wave theory cannot be
readily applied as the boundary conditions are difficult to formulate.
3.3 Modal Equations for Characterizing Room Acoustics at Low Frequencies 53

Fig. 3.1. The sound pressure amplitude density function in a room excited by a sinusoidal
tone.

Fig. 3.2. The eigenfunction distribution, for a tangential mode (3,2,0) over a room of dimen-
sions 6 m ×6 m ×6 m.

3.3.1 Axial, Tangential, Oblique Modes and Eigenfrequencies

By using the expression cos(x) = (ejx +e−jx )/2 in (3.9), the eigenfunction equation
can be written as
54 3 Introduction to Acoustics and Auditory Perception

1  jπ(±(nx xl /Lx )±(ny yl /Ly )±(nz zl /Lz ))


pn (q l ) = e (3.12)
8

where the summation covers the expansion over the eight possible sign combinations
in the exponent. Each of the components, multiplied with the time-dependent expo-
nent ejωt represents a plane wave making an angle αx , αy , and αz with the x, y, and
z axis, respectively, as
     
nx ny nz
cos(αx ) : cos(αy ) : cos(αz ) = ± : ± : ± (3.13)
Lx Ly Lz

If any one of the angles is 90 degrees (i.e., the cosine of the angle is zero), then
the resulting wave represents a tangential mode, or a wave traveling in a plane, that
is orthogonal to the axis that makes an angle of 90 degrees with the plane wave. For
example, if αz = π/2, then the resulting wave travels in the plane defined by the
x-axis and y-axis. If any two of the angles are 90 degrees, then the resulting wave is
called an axial mode that is orthogonal to two axes that make an angle of 90 degrees
with the plane wave. If none of the angles is 90 degrees (i.e., all of the cosine terms
are nonzero) then the resulting wave represents an oblique mode.
The eigenfrequencies, fn , for the enclosure, are related to the eigenvalues, kn , as
c
fn = kn (3.14)

Without going into the derivations (see [12, 11] for details), it can be shown that the
number of eigenfrequencies, Nf , that exist for a limiting frequency, f , in an enclosed
rectangular space is
 3  2  
4π f π f L f
Nf = V + S + (3.15)
3 c 4 c 8 c

where V = Lx Ly Lz , S is the sum of the surface areas (viz., S = 2(Lx Lz + Lx Ly +


Ly Lz )) and L is the sum of all edge lengths with L = 4(Lx + Ly + Lz ). The
eigenfrequency density (i.e., number of eigenfrequencies per Hz) is

dNf f2 π f L
∆f = = 4πV 3 + S 2 + (3.16)
df c 2 c 8c
The lowest ten eigenfrequencies (including degenerate cases) for a room with dimen-
sions 4 m ×3 m ×2 m is shown in Table 3.1.

3.4 Reverberation Time of Rooms


Given a mean sound intensity I(t) due to a source, such as a loudspeaker, transmit-

ting with power Π(t) at time t in a room of volume V and absorption of a = i αi Si
3.4 Reverberation Time of Rooms 55

fn (Hz) nx ny nz
42.875 1 0 0
57.167 0 1 0
71.458 1 1 0
85.75 0 0 1
85.75 2 0 0
95.871 1 0 1
103.06 0 1 1
103.06 2 1 0
111.62 1 1 1
114.33 0 2 0
Table 3.1. The ten lowest eigenfrequencies for a room of dimension 4 m × 3 m × 2 m.

(where αi and Si are the absorption coefficient and surface area of wall i, respec-
tively), then the rate of change of total acoustic energy in the room can be expressed
through the following conservation rule,
d 4V I(t)
= Π(t) − aI(t) (3.17)
dt c
where c is the speed of sound in the medium.
The solution to (3.17) can be written as

c −act/4V t
I(t) = e Π(τ )eacτ /4V dτ (3.18)
4V −∞

If the sound power Π(t) fluctuates slowly, relative to the time constant 4V /ac, then
the intensity I(t) will be approximately proportional to Π(t) as
Π(t)
I(t) ≈ (3.19)
a
Π(t)
≈ 10 log10 + 90 dB above 10−16 watt/cm2
a
if Π(t) is in ergs per second and a is in square centimeters.
In the event that the sound power Π(t) fluctuates in a time short compared to
the time constant 4V /ac, then the intensity will not follow the fluctuations of Π(t),
and if the sound is shut off suddenly at time t = 0, the subsequent intensity can be
expressed using (3.17) as
I(t) = I0 e−(act/4V ) (3.20)
act
Intensity Level = 10 log10 I0 + 90 − 4.34 (dB)
4V
Thus, upon turning off the source, the intensity level drops off linearly at a rate of
4.34act/4V every dB.
56 3 Introduction to Acoustics and Auditory Perception

The reverberation time of the room, which characterizes the time where the en-
ergy of reflections arriving from walls or boundary surfaces is non-negligible, is
defined as the time it takes for the intensity level to drop by 60 dB after the source is
switched off. Thus, if the dimensions of the room are measured in centimeters, then
the reverberation time T60 is given by
4V V
T60 = 60 = 0.161  (3.21)
4.34ac i αi Si

The reverberation time computed through (3.21) is based on geometrical room


acoustics where the walls are considered to be sufficiently irregular so that sound
energy distribution is uniform “throughout” the room. In other words the square
sound pressure amplitude is independent of the distance “R” between the source
and microphone and angles (α, θ) corresponding to an assemblage of plane waves
reflecting from walls. Morse and Ingard [12] state that if the sound is not uniformly
distributed in the room, (3.21) will not be valid and “. . . in fact, the term absorption
coefficient will have no meaning.”
The actual measurement of T60 can be done by the “method of integrated im-
pulse response” proposed by Schroeder [14]. The method uses the following integra-
tion rule to determine an ensemble average of decay curves,g 2 (t), from the room
impulse response, h(t),2 using
 ∞
g 2 (t) = h2 (x)dx (3.22)
t

Subsequently, the result from (3.22) is converted to dB scale and the following ex-
pression is used for computing T60 ,
 −1
∆L
T60 = 60 (3.23)
∆t
where ∆L/∆t is in dB/seconds. Frequently the slope of the decay curve is deter-
mined in the range of −5 dB and −35 dB relative to the steady-state level. In addi-
tion, the time domain for integration of (3.22), in practice, is important. Of course,
an upper limit of integration ∞ is not possible in real-world applications, so a fi-
nite integration interval is chosen. Care should be taken to ensure that the integration
interval is not too long as the decay curve will have a tail which limits the useful
dynamic range, nor should it be too short as it would cause a downward bend of the
curve.
Figure 3.3 shows a room impulse response recorded in a room using a loud-
speaker A, whereas the measured T60 , based on Fig. 3.4 (and using the measured
length L = 8192 samples of the room response as the upper limit for integration), is
found to be approximately 0.25 seconds. The effect of the upper limit of integration
2
Recall from Section 1.1.4 the impulse response is the output of a linear system when the
input is δ(n). In reality, room responses are applying a broadband signal to the room (such
as a logarithmic chirp, noise type sequences, etc.) and measuring the result through a mi-
crophone. More information is provided in a subsequent section.
3.4 Reverberation Time of Rooms 57

Fig. 3.3. A room impulse response measured in a room.

corresponding to 0.0625L is shown in Fig. 3.5, whereas the upper limit of integration
is 0.5L in Fig. 3.6.
Figure 3.7 shows a room impulse response recorded in a room using a loud-
speaker B, and at a different position, whereas the measured T60 , based on Fig. 3.8

Fig. 3.4. The energy decay curve based on the Schroeder integrated impulse response tech-
nique for loudspeaker A.
58 3 Introduction to Acoustics and Auditory Perception

Fig. 3.5. The energy decay curve based on using 0.0625L as an upper limit of integration.

(and using the measured length L of the room response as the upper limit for inte-
gration), is again found to be approximately 0.25 seconds showing reasonable inde-
pendence of the type of loudspeaker used in measuring the room response and the
position where the response was measured.
Finally, the effect of large reverberation is that it degrades the quality of audio
signals such as speech. Thus, to keep high speech quality in rooms, one can design

Fig. 3.6. The energy decay curve based on using 0.5L as an upper limit of integration.
3.4 Reverberation Time of Rooms 59

Fig. 3.7. A room impulse response measured in a room with a loudspeaker B.

the reverberation time to be small by increasing the absorption of the room a. How-
ever, this is in contradiction to the requirement that the transient intensity (3.19) be
kept high. Thus, a compromise is required, during design of rooms, between these
two opposing requirements.

Fig. 3.8. The energy decay curve based on the Schroeder integrated impulse response tech-
nique for loudspeaker B.
60 3 Introduction to Acoustics and Auditory Perception

3.5 Room Acoustics from Schroeder Theory


The sound pressure, pf,i , at location i and frequency f can be expressed as a sum of
direct field component, pf,d,i , and a reverberant field component, pf,rev,i , as given
by

pf,i = pf,d,i + pf,rev,i (3.24)

The direct field component for sound pressure, pf,d,i , of a plane wave, at far field
listener location i for a sound source of frequency f located at i0 can be expressed
as [12]

pf,d,i = −jkρcSf gf (i|i0 )e−jωt


1 jkR
gf (i|i0 ) = e
4πR
R = |i − i0 |2
2
(3.25)

where pf,d (i|i0 ) is the direct component sound pressure amplitude, Sf is the source
strength, k = 2π/λ is the wavenumber, c = λf is the speed of sound (343 m/s) and
ρ is the density of the medium (1.25 kg/m3 at sea level).
The normalized correlation function [100] which expresses a statistical relation
between sound pressures, of reverberant components, at separate locations i and j,
is given by

E{pf,rev,i p∗f,rev,i } sin kRij


  = (3.26)
E{pf,rev,i p∗f,rev,i } E{pf,rev,j p∗f,rev,j } kRij

where Rij is the separation between the two locations i and j relative to an origin,
and E{.} is the expectation operator.
The reverberant-field mean square pressure is defined as

4cρΠa (1 − ᾱ)
E{pf,rev,i p∗f,rev,i } = (3.27)
S ᾱ
where Πa is the power of the acoustic source, ᾱ is the average absorption coefficient
of the surfaces in the room, and S is the surface area of the room.
The assumption of a statistical description for reverberant fields in rooms is
justified if the following conditions are fulfilled [16]: (i) linear dimensions of the
room must be large relative to the wavelength, (ii) average spacing of the reso-
nance frequencies must be smaller than one-third of their bandwidth (this condi-
tion is fulfilled
 in rectangular rooms at frequencies above the Schroeder frequency,
fs = 2000 T60 /V Hz (T60 is the reverberation time in seconds, and V is the vol-
ume in m3 ), and (iii) both source and microphone are in the interior of the room, at
least a half-wavelength away from the walls.
Furthermore, under the conditions in [16], the direct and reverberant sound pres-
sures are uncorrelated.
3.6 Measurement of Loudspeaker and Room Responses 61

3.6 Measurement of Loudspeaker and Room Responses


Measuring loudspeaker and room acoustical responses, and determining its fre-
quency response, is one of the most important aspects in the area of acoustics and
audio signal processing. In fact, even a loudspeaker designer will evaluate the loud-
speaker response in an anechoic room before releasing it for production. An example
of a room impulse response showing the direct path of the sound, the early reflec-
tions, and reverberation is shown in Fig. 3.9.
There are several methods for measuring speaker and/or room acoustical re-
sponses, the popular ones being based on applying a pseudo-random sequence, such
as the maximum length sequence (MLS), to the loudspeaker and deconvolving the
response at a microphone [17], or applying a frequency sweep signal such as the
logarithmic chirp, to the speaker and deconvolving the microphone response.
Müller and Massarani [18] discuss various popular approaches for room acous-
tical response measurement. In this section we briefly discuss room response mea-
surement approaches using logarithmic sweep and the maximum length sequences.

3.6.1 Room Response Measurement with Maximum Length Sequence (MLS)

The MLS-based method for finding the impulse response is based on cross-correlating
a measured signal with a pseudo-random (or deterministic) sequence. The motivation
for this approach is explained through the following derivation. Let x(t) be a station-
ary sound signal having autocorrelation φxx (t), which is applied to the room with
response h(t) through a loudspeaker. Then the signal received at the microphone is

Fig. 3.9. (a) Room impulse response; (b) zoomed version of the response showing direct, early
reflections, and reverberation.
62 3 Introduction to Acoustics and Auditory Perception
 ∞
y(t) = x(t − t )h(t )dt (3.28)
−∞

Forming the cross-correlation, φyx (τ ) between the received signal y(t) and the trans-
mitted signal x(t), we have
 T0 /2  ∞
1
φyx (τ ) = lim x(t + τ − t )h(t )x(t)dt dt
T0 →∞ T0 −T /2 −∞
0
 ∞
= φxx (τ − t )h(t )dt (3.29)
−∞

Now if φxx (τ − t ) = δ(τ − t )3 , then (3.29) results in the cross-correlation being


equal to the room impulse response or φyx (t) = h(t).
More useful than white noise are pseudo-random signals, called MLS, which
have similar properties to white noise, but are binary or two-valued in nature. Such
binary sequences can be easily generated by means of a digital computer and can
be processed rapidly through signal processing algorithms. Specifically, an MLS se-
quence, s(n), of period L = 2n − 1, where n is a positive integer, satisfies the
following relations,


L−1
s(k) = −1
k=0

1 
L−1
1 k = 0, L, 2L, . . .
φss (k) = s(n)s(n + k) = (3.30)
L n=0 − L1 k = 0, L, 2L, . . .

Thus, transmitting an MLS sequence from a loudspeaker to a microphone,


through a room with response h(n), yields the following relations over a period L of
the maximum length sequence,


L−1
y(n) = s(n) ⊗ h(n) = s(n − p)h(p)
p=0


L−1
φyx (k) = s(n)y(n + k)
n=0
 L−1
L−1 
= s(n)s(n + k − p)h(p)
p=0 n=0


L−1
1
L
= φss (k − p)h(p) = h(k) − h(k − p) (3.31)
p=0
L p=1

The first term in (3.31) is the recovered response whereas the second term repre-
sents a DC component that vanishes to zero with a sufficiently large value of L. An
3
Signals satisfying such autocorrelation functions are referred to as white noise signals.
3.6 Measurement of Loudspeaker and Room Responses 63

example of an MLS sequence with n = 3 or L = 7 is: −1,−1,1,−1,1,−1,1. For


practical reasons, the MLS sequence is transmitted repeatedly possibly with an in-
tervening silence interval, and the measured signal is averaged in order to improve
the signal-to-noise ratio (SNR). Also, L is kept sufficiently high so as to prevent any
time aliasing problems in the deconvolved response, where the late reflection (or re-
verberation part) folds back into the early part of the room response. This happens
if the period of the repeatedly transmitted signal is smaller than the length of the
impulse response h(n).
Another stimulus signal constructed using the MLS signal is the inverse repeated
sequence (IRS) [19], and is defined by,

M LS(n) n even, 0 ≤ n < 2L
x(n) = (3.32)
−M LS(n) n odd, 0 < n < 2L

3.6.2 Room Response Measurement with Sweep Signals


Another approach for obtaining the impulse response is via a circular deconvolution
approach where the measured signal is Fourier transformed, divided by the Fourier
transform of the input signal, and the result inverse transformed to get the time do-
main impulse response. Specifically, with F and F −1 representing the forward and
inverse Fourier transform, respectively, and x(t) and y(t) representing the input and
measured signal,
 
F [y(t)]
h(t) = F −1 (3.33)
F [x(t)]
In the case of a linear sweep a constant phase increment is added to x(t) (with N
being the number of samples to be generated), and is given by
x(t) = A cos(φ(t))
φ(t) = φ(t − 1) + ∆φ(t)
∆φ(t) = ∆φ(t − 1) + ψ (3.34)
fstop − fstart
ψ = 2π (3.35)
N fs
The time domain and magnitude response plot (white excitation spectrum) for a lin-
ear sweep is shown in Fig. 3.10.
The time domain and magnitude response (3 dB per octave decay) of a logarith-
mic sweep [20] characterized by

ω1 T  (t/T ) log(ω2 /ω1 ) 
x(t) = sin e −1 (3.36)
log ω2
ω1

is shown in Fig. 3.11.


The advantage of a logarithmic sweep over a linear sweep is the larger SNR at
lower frequencies thereby allowing better characterization of room modes. Advan-
tages of using logarithmic sweep over MLS is the separation of loudspeaker distor-
tion product terms from the actual impulse response in addition to improved SNR.
64 3 Introduction to Acoustics and Auditory Perception

Fig. 3.10. (a) Time domain response of linear sweep; (b) magnitude response of the linear
sweep.

Fig. 3.11. (a) Time domain response of logarithmic sweep; (b) magnitude response of the log
sweep.
3.7 Psychoacoustics 65

Furthermore, it has been shown [21] that in the presence of nonwhite noise the MLS
and IRS methods for room impulse response seem to be most accurate, whereas in
quiet environments the logarithmic sine-sweep method is the most appropriate signal
of choice.

3.7 Psychoacoustics
The perception of sound is an important area and recent systems employing audio
compression techniques use principles from auditory perception, or psychoacoustics,
for designing lower bit-rate systems without significantly sacrificing audio quality.
Likewise, it seems a natural extension that certain properties of human auditory per-
ception (e.g., frequency selectivity) be exploited to design efficient systems, such as
room equalization systems, which aim at minimizing the detrimental effects of room
acoustics.

3.7.1 Structure of the Ear

To understand relevant concepts from psychoacoustics, it is customary to summarize


the structure of the ear. Shown in Fig. 3.12 is the peripheral part of the ear comprising
the outer, middle, and inner ear sections.
The outer ear is composed of the pinna and the auditory canal. The pinna is
responsible primarily for identifying the location of the source sound, particularly
at high frequencies. Considerable variation exists in the conformation of the pinna
and hence different people are able to localize sound differently. Sound travels down
the auditory or ear canal and then strikes the tympanic membrane. The air-filled
middle ear includes the tympanic membrane, the ossicles (malleus, incus, stapes),
their associated muscles and ligaments, and the opening of the auditory tube, which
provides communication with the pharynx as well as a route for infection. Thus,

Fig. 3.12. The structure of the ear.


66 3 Introduction to Acoustics and Auditory Perception

sound vibrations in the ear canal are transmitted to the tympanic membrane, and in
turn are transmitted through the articulations of the ossicles to the attachment of the
foot plate of the stapes on the membrane of the oval window. The ossicles amplify
the vibrations of sound and in turn pass them on to the fluid-filled inner ear.
The cochlea, which is the snail-shaped structure, and the semicircular canals con-
stitute the inner ear. The cochlea, enclosing three fluid-filled chambers, is encased in
the temporal bone with two membranous surfaces exposed at its base (viz., the oval
window and the round window). The foot plate of the stapes adheres to the oval
window, transmitting sound vibrations into the cochlea. Two of the three cochlear
chambers are contiguous at the apex. Inward deflections of the oval window caused
by the foot plate of the stapes compress the fluid in the scala vestibuli; this compres-
sion wave travels along the coils of the cochlea in the scala vestibuli to the apex,
then travels back down the coils in the scala tympani. The round window serves as
a pressure-relief vent, bulging outward with inward deflections of the oval window.
The third cochlear chamber, the scala media or cochlear duct, is positioned between
the scala vestibuli and scala tympani. Pressure waves from sound traveling up the
scala vestibuli and back down the scala tympani produce a shearing force on the hair
cells of the organ of Corti in the cochlear duct. Within the cochlea, hair cell sensitiv-
ity to frequencies progresses from high frequencies at the base to low frequencies at
the apex. The cells in the single row of inner hair cells passively respond to deflec-
tions of sound-induced pressure waves. Thus, space (or distance) along the cochlea is
mapped to the excitation or resonant frequency, and hence the cochlea can be viewed
as an auditory filtering device responsible for selective frequency amplification or
attenuation depending on the frequency content of the source sound.

3.7.2 Loudness Perception

Loudness perception is an important topic as it allows design of systems that take


into account sensitivity to sound intensity. For example, loudness compensation or
control techniques, which compensate for the differences in original sound level and
loudspeaker playback sound level (possibly including loudspeaker and room acous-
tics) based on equal loudness contours, allow “tonally balanced” sound perception
while listening to audio content in home theater, automotive, or movie theater envi-
ronments.
One way to judge loudness level is on a relative scale where the intensity of a
1 kHz tone is fixed at a given sound pressure level and a tone at another frequency
is adjusted by the subject until it sounds equally loud as the 1 kHz tone. The tones
are presented to the subject either via headphones where the microphone probe is
inserted and placed near the eardrum to measure the sound level, or in an anechoic
room (which provides a high degree of sound absorption) in which case the mea-
surement of the sound level is done at a point roughly corresponding to the center
of the listener head position after the listener is removed from the sound field. The
plot of the sound level as a function of frequency for various loudness levels is called
the equal loudness contours [22, 23], and the data which are now an International
Standards Organization (ISO) standard (ISO 226, 1987)[24] are shown in Fig. 3.13.
3.7 Psychoacoustics 67

Fig. 3.13. Equal loudness contours from Robinson and Dadson [23].

The contour for 0 phon (or 0 dB SPL of a 1 kHz tone) represents the minimum
audible field contour and represents on an average (among listeners) the absolute
lower limit of human hearing at various frequencies. Thus, for example, at 0 phon,
human hearing is most sensitive at frequencies between 3 kHz and 5 kHz as these
represent the lowest part of the 0 phon curve. Furthermore, for example, the sound
pressure level (SPL) has to be increased by as much as 75 dB at 20 Hz in order for
a 20 Hz tone to sound equally as loud as the 1 kHz tone at 0 phon loudness level
(or 0 dB SPL). In addition, the rate of growth of loudness level at lower frequencies
is much greater than the middle frequencies, for example, as the SPL at 20 Hz from
0 phon to 90 phon increases by only 50 dB whereas at 1 kHz the SPL increases
by 90 dB. This requirement of a larger increase in SPL at mid frequencies, relative
to lower frequencies, in order to retain the same loudness level difference is the
reason for the larger rate of growth of loudness level at lower frequencies. This can
be observed when human voices are played back at high levels via loudspeakers,
making them “boomy” as the ear becomes more sensitive to lower frequencies than
higher frequencies with higher intensities.
Various SPL meters attempt giving an approximate measure for the loudness of
complex tones. Such meters contain weighting networks (e.g., A, B, C, and RLB) that
weigh the intensities computed in third octave frequency bands with the appropriate
weighting curves before performing summation across frequencies.
The A weighting is based on a 30 phon equal loudness contour for measuring
complex sounds having relatively low sound levels, the B weighting is used for in-
termediate levels and approximates the 70 phon contour, whereas the C weighting is
for relatively high sound levels and approximates the 100 phon contour. The three
weighting network contours are shown in Fig. 3.14. Thus if a level is specified as 105
68 3 Introduction to Acoustics and Auditory Perception

Fig. 3.14. The A, B, and C weighting networks.

dBC then the inverse C weighting (i.e., inverse of the contour shown in Fig. 3.14) is
used for computing the SPL.

3.7.3 Loudness Versus Loudness Level

In the previous section the equal loudness contours were presented which were a
function of the loudness level in phons. Stevens [25] presented some data that derive
scales relating the physical magnitude of sounds to their subjective loudness. In this
process, the subject is asked to adjust the level of a test tone until it has a specified
loudness, either in absolute terms or relative to a standard (e.g., twice as loud, half
as loud, etc.) Stevens derived a closed form expression that relates loudness L to
intensity I through a constant k as

L = kI 0.3 (3.37)

which states that a doubling of loudness is achieved by a 10 dB difference in intensity


level. Stevens defined a “sone”, a unit for loudness, as the loudness of a 1 kHz tone
at 40 dB SPL. A 1 kHz tone at 50 dB SPL then would have a loudness of 2 sones.
Figure 3.15 shows the relation between sones and phons for a 1 kHz tone.

3.7.4 Time Integration

The detection of tones (such as those represented by the absolute threshold of hearing
or the equal loudness contours) is also based on the duration of the stimulus tone.
3.7 Psychoacoustics 69

Fig. 3.15. Loudness in sones versus loudness level in phons.

Fig. 3.16. The detectability of a 1 kHz tone as a function of the tone duration in milliseconds.

The relation between duration of the tone, t, and threshold intensity, I, required for
detection can be expressed as [26],

(I − IL ) × t = k (3.38)

where k is a constant, and IL is a threshold intensity of a long duration tone pulse.


For example, as shown in [27], the detectability of a tone pulse was constant
between 15 to 150 ms, but fell off as the duration increased or decreased beyond
these limits as a function of time as shown in Fig. 3.16 for a 1 kHz tone.
The fall in detectability with longer duration indicates that there is a limit to the
time over which the ear can integrate energy of the stimulus signal, whereas the fall
in detectability at low durations may be connected with the spread of energy over
frequency which occurs for signals of short duration. Specifically, it is hypothesized
70 3 Introduction to Acoustics and Auditory Perception

that the ear can integrate energy over a fairly narrow frequency range and this range
is exceeded for short duration signals.

3.7.5 Frequency Selectivity of the Ear

The peripheral ear acts as a bank of band-pass filters due to the space-to-frequency
transformation induced by the basilar membrane [26]. These filters are known as
auditory filters and have been studied by several researchers [28, 29, 30, 31] and
are conceptualized to have either a rectangular or triangular shape with a simplified
assumption of symmetricity around the center frequency of the auditory filter.
The shape and bandwidth of these filters can be estimated, for example, through
the notched noise approach [32] where the width of the notch of a band-stop noise
spectrum is varied. Figure 3.17 shows a symmetric auditory filter which is cen-
tered on a sinusoidal tone with frequency f0 and a band-stop noise spectrum with a
notch of width 2∆f . By increasing the width of the notch, less noise passes through
the auditory filter and hence the threshold required to detect the sinusoidal tone of
frequency f0 decreases. By decreasing the notch width, more noise energy passes
through the auditory filter thereby making it harder for the sinusoidal tone to be de-
tected and thereby increasing the threshold.
The filter is parameterized in terms of the equivalent rectangular bandwidth
(ERB) and is expressed as a function of the filter center frequency f0 (expressed
in kHz) as

ERB(f0 ) = 24.7(4.37f0 + 1) (3.39)

and is shown in Fig. 3.18.


Another approach for estimating the shape and bandwidth of the auditory filters
is by assuming the noise spectrum to be centered on a sinusoidal tone where the filter
is assumed to be rectangular [28]. Fletcher measured the threshold of the sinusoidal
tone as a function of the bandwidth of the bandpass noise by keeping the overall
noise power density constant. It is generally observed that the threshold of the signal

Fig. 3.17. Estimation of the auditory filter shape or bandwidth with the notched noise ap-
proach.
3.7 Psychoacoustics 71

Fig. 3.18. The equivalent rectangular bandwidth in Hz obtained from (3.39).

increases at first as the bandwidth increases, but then flattens out beyond a critical
frequency such that any additional increase in noise will not affect detectability of the
sinusoidal tone. Fletcher referred to the bandwidth CB where the signal threshold
ceased to increase as the critical bandwidth or Bark scale. The critical bandwidth
can be related through one model which is expressed by

CB(f ) = 25 + 75(1 + 1.4f 2 )0.69 (3.40)

and is shown in Fig. 3.19, where f is in kHz.

Fig. 3.19. The critical band model obtained from Eq. (3.40).
72 3 Introduction to Acoustics and Auditory Perception

Fig. 3.20. Comparison between ERB and critical bandwidth models.

Figure 3.20 shows the differences between the ERB and critical band filter band-
widths as a function of center frequency. It is evident that the ERB-based auditory
filter models have better frequency resolution at lower frequencies than the critical
band-based auditory filter models, whereas the differences are generally not substan-
tial at higher frequencies.

3.8 Summary
In this chapter we have presented the fundamentals of acoustics and sound propaga-
tion in rooms including reverberation time and measurement of such. We have also
presented the concept of a room response and the popular stimulus signals used for
measuring room impulse responses. Finally, concepts from psychoacoustics relating
to perception of sound were presented.
Part III

Immersive Audio Processing


4
Immersive Audio Synthesis and Rendering Over
Loudspeakers

4.1 Introduction
Multichannel sound systems such as those used in movie or music reproduction in
5.1 channel surround sound systems or new formats such as 10.2 channel immersive
audio require many more tracks for content production than the number of audio
channels used in reproduction. This has been true since the early days of monophonic
and two-channel stereo recordings that used multiple microphone signals to create
the final one- or two-channel mixes.
In music recording there are several constraints that dictate the use of multiple
microphones. These include the sound pressure level of various instruments, the ef-
fects of room acoustics and reverberation, the spectral content of the sound source,
the spatial distribution of the sound sources in the space, and the desired perspective
that will be rendered over the loudspeaker system.
As a result it is not uncommon to find that tens of microphones may be used
to capture a realistic musical performance that will be rendered in surround sound.
Some of these are placed close to instruments or performers and others farther away
so as to capture the interaction of the sound source with the environment.
Despite the emergence of new consumer formats that support multiple audio
channels for music, the growth of content has been slow. In this chapter we de-
scribe methods that can be used to automatically generate the multiple microphone
signals needed for a multichannel rendering without having to record using multiple
real microphones, which we refer to as immersive audio synthesis. The applications
of such virtual microphones can be found in both the conversion of older recordings
to today’s 5.1 channel formats, but also to upconvert today’s 5.1 channel content to
future multichannel formats that will inevitably consist of more channels for more
realistic reproduction.

[2000]
c IEEE. Reprinted, with permission, from C. Kyriakakis and A. Mouchtaris, “Vir-
tual microphones for multichannel audio applications”, Proc. IEEE Conf. on Multimedia
and Expo, 1:11–14.
76 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

Immersive audio rendering involves accurate reproduction of three-dimensional


sound fields that preserve the desired spatial location, frequency response, and dy-
namic range of multiple sound sources in the environment. An immersive audio sys-
tem is capable of rendering sound images positioned at arbitrary locations around a
listener. There are two general approaches to building these systems. The first is to
completely surround the listener with a large number of loudspeakers to reproduce
the sound field of the target scene. The second is to reproduce the necessary acoustic
signals at the ears of the listener as they would occur under natural listening condi-
tions. This method, called binaural audio, is applicable to both headphone and, with
some modifications, to loudspeaker reproduction.
In this chapter we describe a general methodology for rendering binaural sound
over loudspeakers that can be generalized to multiple listeners. We present the math-
ematical formulation for the necessary processing of signals to be played from the
loudspeakers for the case of two listeners. The methods presented here can also be
scaled to more listeners. In theory, loudspeaker rendering of binaural signals requires
fewer loudspeakers than the multiple loudspeaker methods proposed by surround
sound systems. This is because binaural rendering relies on the reproduction of the
sound pressures at each listener’s ear. However, binaural rendering over loudspeak-
ers has traditionally been thought of as a single listener solution and has not been
applied to loudspeaker systems intended for multiple listeners. There are two gen-
eral methods for single listener binaural audio rendering that can be categorized as
headphone reproduction and loudspeaker reproduction [33, 34, 35, 36].
Head-related binaural recording, or dummy-head stereophony methods, attempt
to accurately capture and reproduce at each eardrum the sound pressure generated by
sound sources and their interactions with the acoustic environment and the pinnae,
head, and torso of the listeners [37, 38, 39]. Transaural audio is a method that was
developed to deliver binaural signals to the ears of listeners using two loudspeakers.
The basic idea is to filter the binaural signal such that the crosstalk terms from the
loudspeakers to the opposite side ear are reduced so that the signals at the ears of
the listener approach those captured by a binaural recording. The technique was pre-
sented in [41, 42] and later developed fully by Cooper and Bauck [43], who coined
the term “transaural audio”. Previous work in the literature [43, 44, 45, 46] has fo-
cused on both theoretical and practical methods for generalizing crosstalk cancella-
tion filter design using matrix formulations. Cooper and Bauck [46] also discussed
some ideas for developing transaural systems for multiple listeners with multiple
loudspeakers.
Crosstalk cancellation filter design for loudspeaker reproduction systems has
been proposed using least mean squares (LMS) adaptive algorithm methods in sym-
metric or nonsymmetric environments [44, 45, 51, 52]. As the eigenvalue spread of
the input autocorrelation matrix increases, the convergence speed of the LMS adap-
tive algorithm for multichannel adaptation decreases. To solve this problem, algo-
rithms such as discrete Fourier transform (DFT)/LMS and discrete cosine transform
(DCT)/LMS can be used to decorrelate the input signals by preprocessing with a
transformation that is independent of the input signal. In this chapter we present
crosstalk cancellation filter design methods based on the LMS adaptive inverse algo-
4.2 Immersive Audio Synthesis 77

rithm, and the normalized frequency domain adaptive filter (NFDAF) LMS inverse
algorithm [55, 54]. The authors wish to acknowledge Dr. Athanasios Mouchtaris
whose PhD dissertation at the USC Immersive Audio Laboratory formed the basis
for much of the work described regarding synthesis and Dr. Jong-Soong Lim whose
PhD dissertation formed the basis for much of the work on rendering.

4.2 Immersive Audio Synthesis


4.2.1 Microphone Signal Synthesis

The problem of synthesizing a virtual microphone signal from a real microphone


signal recorded at a different position in the room can be formulated as a general
filtering problem. In order to derive the filter, it is first necessary to train the system
using a pair of real microphones at the desired locations. If we call the signals in
these microphones m1 and m2 then it is desirable to construct a filter V that when
applied to m1 results in a signal that is as close as possible to m2 . The difference
between the synthesized signal and the real signal must be as small as possible, both
from an objective measure as well as from a psychoacoustic point of view.
There are several possible methods that can be used to find the necessary filters
that will synthesize the virtual microphone signals. The most common among these
methods may be to use an adaptive filter approach. The drawback of such a choice in
this case is that acoustical performance spaces used for recording may exhibit very
long reverberation times (sometimes longer than two seconds) and this would impose
unreasonably large tap and associated computational and memory requirements for
finite impulse response filters derived from adaptive methods.
In the sections below, we describe algorithms that utilize infinite impulse re-
sponse filters. This is a particularly relevant choice for synthesizing virtual micro-
phones that are used to capture the interactions of sound sources with the acoustical
characteristics of the space and thus are placed at large distances from the sound
sources. In a way, these microphones represent the reverberant field that has been
modeled extensively using comb-filter techniques based on IIR. The resulting virtual
microphone synthesis filters are more computationally efficient than their FIR coun-
terparts.
To ensure that the resulting filters are stable, it is important to follow a design ap-
proach that results in a minimum-phase IIR filter. Using a classical music recording
in a concert hall as an example, we can define the direct signal from the orchestra
on stage as s and m1 and m2 as the recorded microphone signals that are captured
in two microphones. In effect, these signals are the result of convolution of the dry
orchestra signal with the room impulse response for each microphone position. The
method described below is based on an all-pole model of these filters. That is, the
filtering between s and m1 can be modeled as an all-pole filter, resulting in filter A1
and the filtering between s and m2 as another all-pole model resulting in filter A2 .
Then, the desired filter V can be applied to the signal m1 to generate m2p , which is
an estimate of m2 . The filter V = A1 /A2 , is a stable IIR filter.
78 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

We can consider each of the microphone signals to be wide-sense stationary, or


at least, that they can be separated into blocks that are wide-sense stationary. Under
that assumption, we can model each signal as an autoregressive process. If the direct
signal from the source is denoted by S in the frequency domain (s(n) in the time
domain), and the room impulse responses from the source to each microphone are
denoted as V1 and V2 , respectively, then M1 = V1 S and M2 = V2 S.
Then, V1 and V2 can be modeled as all-pole filters of order r resulting in the
following AR models

r
s1 (n) = a1 (i)m1 (n − i) + s(n)
i−1
r
s2 (n) = a2 (i)m2 (n − i) + s(n) (4.1)
i−1

These can be expressed in the z-domain as


S1 (z) 1
V1 (n) = =
S(z) A1 (z)
S2 (z) 1
V2 (n) = = (4.2)
S(z) A2 (z)
in which the denominator terms in the all-pole filters are given by

r
A1 (z) = 1 − a1 (n)z −k
k=1
r
A2 (z) = 1 − a2 (n)z −k (4.3)
k=1

The required filter V , which can be used to synthesize the microphone signal m2
from the reference signal m1 , is
V2 A1
V = = (4.4)
V1 A2
One way to ensure that each virtual microphone filter V is both stable and com-
putationally efficient, is to use linear prediction analysis to design a stable all-pole
filter. With linear prediction, a certain number of past samples in each signal time
domain record are linearly combined to provide an estimate of future samples. For
example, if at time n in the signal m1 , q past samples are considered then the estimate
of the signal at time n can be written as

q
mlp
1 (n) = a(i)m1 (n − k) (4.5)
k=1

The prediction error of this process from the actual microphone signal is then
4.2 Immersive Audio Synthesis 79

e(n) = m1 (n) − mlp


1 (n) (4.6)
which can be written as a transfer function in the z-domain
E(z)  q
=1− a(k)z −k (4.7)
M1 (z)
k=1

In real-world situations, an autoregressive model will not provide an exact model


of the system and will result in a modeling error term given by
e(n) = s(n) + eAR (n) (4.8)
in which eAR (n) is the error that arises from the incorrect modeling of the source.
Minimizing the error e(n) is equivalent to minimizing the error eAR (n). Further-
more, minimization of the error e(n) produces the coefficients a(i) in (4.7) that in
fact are the same as those of filter A1 . This is easily seen if E(z) is substituted by
S(z) in (4.7). These coefficients can be calculated using linear prediction to mini-
mize the mean squared error between m1 (n) and mlp 1 (n) [10, 8]. Linear prediction
is a special case of linear optimum filtering, thus the principle of orthogonality holds.
Accordingly, minimization of the error is equivalent to the error e(n) being orthog-
onal to all the input samples m1 (n − k) from which the error at time n is calculated
(such that k lies in the interval [1, q]); that is,
E(m1 (n − k)e(n)) = 0 (4.9)


q
E(m1 (n − k)(m1 (n) − a(i)m1 (n − i))) = 0 (4.10)
k=1


q
r(−k) = a(i)r(i − k) (4.11)
k=1

in which r(n) is the autocorrelation function of m1 (n). Equation (4.11) makes use of
the fact that the process m1 is wide-sense stationary in the block under consideration.
Finally, because the autocorrelation function is symmetric, and the absolute value of
i − k is in the interval [0, q − 1] (4.11) can be rewritten in matrix form as
⎡ ⎤⎡ ⎤ ⎡ ⎤
r(0) r(1) . . . r(q − 1) a(1) r(1)
⎢ r(1) r(0) . . . r(q − 2) ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ a(2) ⎥ ⎢ r(2) ⎥
⎢ .. .. .. .. ⎥ ⎢ . ⎥ = ⎢ . ⎥ (4.12)
⎣ . . . . ⎦ ⎣ .. ⎦ ⎣ .. ⎦
r(q − 1) r(q − 2) . . . r(0) a(q) r(q)
The coefficients a(i) of the virtual microphone filter can be found from the above
equation by inverting the correlation matrix R. This can be performed very efficiently
using a recursive method such as the Levinson and Durbin algorithm because of the
form of the correlation matrix R and under the assumption of ergodicity.
80 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

4.2.2 Subjective Evaluation of Virtual Microphone Signals

The methods described in the previous section must be applied in blocks of data of
the two microphone signal processes m1 and m2 . A set of experiments was con-
ducted to subjectively verify the validity of these methods. Signal block lengths of
100,000 samples were chosen because the reverberation time of the hall from which
the recordings were obtained is 2 sec. and were sampled at 48 kHz. Experiments
were performed with various orders of filters A1 and A2 to obtain an understanding
of the tradeoffs between performance and computational efficiency. Relatively high
orders were required to synthesize a signal m2 from m1 with an acceptable error
between m2p (the reproduced process) and m2 (the actual microphone signal). The
error was assessed through blind A/B/X listening evaluations. An order of 10,000 co-
efficients for both the numerator and denominator of V resulted in an error between
the original and synthesized signals that was not detectable by the listeners. The per-
formance of the filter was also evaluated by synthesizing blocks from a section of the
signal different from the one that was used for designing the filter. Again, the A/B/X
evaluation showed that for orders higher than 10,000 the synthesized signal was in-
distinguishable from the original. Although such high order filters are impractical for
real-time applications, the performance of this method is an indication that the model
is valid and therefore worthy of further investigation to achieve filter optimization.
In addition to listening evaluations, a mathematical measure of the distance be-
tween the synthesized and the original processes can be found. This measure can be
used during the optimization process in order to achieve good performance and at
the same time minimize the number of coefficients. The difficulty in defining such a
measure is that it must also be psychoacoustically valid. This problem has been ad-
dressed in speech processing in which measures such as the log spectral distance and
the Itakura distance are used [47]. In the case presented here, the spectral character-
istics of long sequences must be compared with spectra that contain a large number
of peaks and dips that are narrow enough to be imperceptible to the human ear. To
approximately match the spectral resolution of the human ear 1/3 octave smoothing
was performed [26] followed by a comparison of the resulting smoothed spectral
cues. The results are shown in Fig. 4.1 where the error match between the spectra of
the original (measured) microphone signal and the synthesized signal are compared.
The two spectra are practically indistinguishable below 10 kHz. Although the error
increases somewhat at higher frequencies, the listening evaluations show that this is
not perceptually significant.

4.2.3 Spot Microphone Synthesis Methods

The method described in the previous section is appropriate for the synthesis of
microphones placed far from the source that capture mostly reverberant sound in
the recording environment. However, it is common practice in music recording to
also use microphones that are placed very close to individual instruments. Synthe-
sis of such virtual microphone signals requires a different approach because these
4.2 Immersive Audio Synthesis 81

Fig. 4.1. Magnitude response error between the approximating and approximated spectra.

signals exhibit quite different spectral characteristics compared to the reference mi-
crophones. These microphones are used, for example, near the tympani or the wood-
winds in classical music so that these instruments can be emphasized in the multi-
channel mix during certain passages. The signal in these microphones is typically
not reverberant because of their proximity to the instruments.
As suggested in the previous section, this problem can be classified as a system
identification problem. The most important consideration in this case is that it is not
theoretically possible to design a generic time-invariant filter that will be suitable for
any recording. Such a filter would have to vary with the temporal characteristics of
the frequency response of the signal. The response is closely related to the joint time
and frequency properties of the reference microphone signals.
The approach that we followed for recreating these virtual microphones is based
on a method used for synthesizing percussive instrument sounds [48]. Thus, the
method described here is applicable only for microphones located near percussion
instruments. According to [48], it is possible to synthesize percussive sounds in a
natural way, by following an excitation/filter model. The excitation part corresponds
to the interaction between the exciter and the resonating body of the instrument and
lasts until the structure reaches a steady vibration, and the resonance part corresponds
to the free vibration of the instrument body. The resonance part can be easily de-
scribed from the frequency response of the instrument using several modeling meth-
ods (e.g., the AR modeling method that was described in the previous paragraph).
Then, the excitation part can be derived by filtering the instrument’s response with
the inverse of the resonance filter. The excitation part is independent of the frequen-
cies and decays of the harmonics of the instrument at a given time (after the in-
strument has reached a steady vibration) so it can be used for synthesizing different
sounds by using an appropriate resonance filter. Therefore, it is possible to derive an
82 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

excitation signal from a recording that contains only the instrument we wish to en-
hance and then filter it with the resonance filter at a given time point of the reference
recording in order to enhance the instrument at that particular time point. It is im-
portant to mention that the recreated instrument does not contain any reverberation
if the excitation part was derived from a recording that did not originally contain any
reverberation.
The above analysis has been successfully tested with tympani sounds. A poten-
tial drawback of this method is that the excitation part depends on the way that the
instrument was struck so it is possible that more than one excitation signal might be
required for the same instrument. Also, for the case of the tympani sounds, it is not
an easy task to define a procedure for finding the exact time points that the tympani
was struck, that is, the points when the enhancement procedure should take place.
Solutions to overcome these drawbacks described are under investigation.

4.2.4 Summary and Future Research Directions

The methods described above are effective for synthesizing signals in virtual mi-
crophones that are placed at a distance from the sound source (e.g., orchestra) and
therefore, contain more reverberation. The IIR filtering solution was proposed ex-
actly for addressing the long reverberation-time problem, which meant long impulse
responses for the filters to be designed. On the other hand, signals from microphones
located close to individual sources (e.g., spot microphones near a particular musical
instrument) do not contain very much reverberation. A completely different prob-
lem arises when trying to synthesize such signals. Placing such microphones near
individual sources with varying spectral characteristics results in signals whose fre-
quency content will depend highly on the microphone positions.
In order to synthesize signals in such closely placed microphones it is necessary
to identify the frequency bands that need to be amplified or attenuated for each mi-
crophone. This can be easily achieved when the reference microphone is relatively
far from the orchestra, so that we can consider that all frequency bands were equally
weighted during the recording. In order to generate a reference signal from such
a distant microphone that can be used to synthesize signals in the nonreverberant
microphones it is necessary to find some method for dereverberating the reference
signal.
One complication with this approach is that we do not know the filter that trans-
forms the signal from the orchestra to the reference microphone. We are investigating
methods for estimating these filters based on a technique for blind channel identifi-
cation using cross-spectrum analysis [49]. The idea is to use the two closely spaced
microphones in the center (hanging above the conductor’s head) as two different ob-
servations of the same signal processed by two different channels (the path from the
orchestra to each of the two microphones). The Pozidis and Petropulu algorithm uses
the phase of the cross-spectrum of the two observations and allows us to estimate the
two channels. Further assumptions, though, need to be made in order to have a unique
solution to the problem. The most important is that the two channels are assumed to
be of finite length. In general, however, they can be nonminimum phase, which is
4.3 Immersive Audio Rendering 83

a desired property. All the required assumptions are discussed in [49] and their im-
plications for the specific problem examined here are currently under investigation.
After the channels have been identified, the recordings can be equalized using the
estimated filters. These filters can be nonminimum phase, as explained earlier, so a
method for equalizing nonminimum phase channels must be used. Several methods
exist for this problem; see, for example, [50]. The result will be not only a derever-
berated signal but an equalized signal, ideally equal to the signal that the microphone
would record in an anechoic environment. That signal could then be used as the seed
to generate virtual microphone signals that would result in multichannel mixes sim-
ulating various recording venues.

4.3 Immersive Audio Rendering


4.3.1 Rendering Filters for a Single Listener

A typical two-loudspeaker listening situation is shown in Fig. 4.2, in which XL and


XR are the binaural signals sent to listener’s ears EL and ER through loudspeakers
S1 and S2 .
The system can be fully described by the following matrix equation
         
EL T1 T2 S1 T1 T2 XL
= · = ·W· (4.13)
ER T3 T4 S2 T3 T4 XR

in which W is the matrix of the crosstalk canceller, and T1 , T2 , T3 , and T4 are


the head-related transfer functions (HRTFs) between the loudspeakers and ears. To
generate a spatially rendered sound image, a rendering filter is required that delivers
the left channel binaural signal XL to EL , and the right channel binaural signal
XR to ER , while simultaneously eliminating unwanted crosstalk terms. If the above
conditions are satisfied exactly, matrix equation (4.13) can be formulated as follows

E=T·W·X=X (4.14)

Fig. 4.2. Geometry and signal paths from input binaural signals to ears that show the ipsilateral
signal paths (T1 , T4 ) and contralateral signal paths (T2 , T3 ).
84 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

In (4.14), the listener’s ear signals and input binaural signals are E = [EL ER ]T ,
X = [XL XR ]T , respectively. The rendering system transfer function matrix T is
 
T1 T2
T= (4.15)
T3 T4

To obtain optimum performance and deliver the desired signal to each ear, the
matrix product of transfer function matrix T and crosstalk canceling weight vector
matrix W should be the identity matrix
 
10
T·W= (4.16)
01

Therefore the generalized rendering filter weight matrix W requires four weight
vectors to produce the desired signal at the ears of a single listener.

Time Domain Adaptive Inverse Control Filter

The weight vector matrix W described above can be implemented using the least
mean squares adaptive inverse algorithm [51]. Matrix equations (4.14) and (4.16)
must be modified based on the adaptive inverse algorithm for multiple channels as
follows [44, 45, 52]
 
W1 W3
E=T·W·X=T· ·X=X (4.17)
W2 W4

The desired result is to find W so that it cancels the crosstalk signals perfectly.
Then the signals E arriving at the ears are exactly the same as the input binaural
signals X. Equation (4.17) can be written as
 
T1 W1 + T2 W2 T1 W3 + T2 W4
E= ·X
T 3 W 1 + T4 W 2 T 3 W3 + T4 W 4
 
T1 W1 XL + T2 W2 XL + T1 W3 XR + T2 W4 XR
= (4.18)
T 3 W 1 X L + T4 W 2 X L + T 3 W 3 X R + T 4 W 4 X R

in which, the diagonal elements T1 W1 + T2 W2 and T3 W3 + T4 W4 are the ipsilateral


transfer functions, and the off-diagonal elements T1 W3 + T2 W4 and T3 W1 + T4 W2
are the contralateral transfer functions (crosstalk terms). All the vectors in (4.18) are
in the frequency domain. Equation (4.18) is then separated into two matrices: one
that is the matrix product of the crosstalk canceller weight vectors and the other with
the remaining terms
 
T1 XL T2 XL T1 XR T2 XR
E= · W = X (4.19)
T3 X L T 4 X L T3 X R T 4 X R

in which W is a column matrix [W1 W2 W3 W4 ]T . The block diagram is shown in


Fig. 4.3.
4.3 Immersive Audio Rendering 85

Fig. 4.3. LMS block diagram for the estimation of the crosstalk cancellation filter with
di (n) = XL (n − m), and di (n) = XR (n − m) for the left and right channels, respectively.

Using the time domain LMS adaptive algorithm, the weight vectors are updated
as follows,
ˆ i (n)),
Wi (n + 1) = Wi (n) + µ(−∇ i = 1, . . . , 4 (4.20)
The positive scalar step size µ controls the convergence rate and steady-state
ˆ
performance of the algorithm. The gradient estimate, ∇(n), is simply the derivative
2
of e (n) with respect to W (n) [56]. Therefore gradient estimates in the time domain
can be found as
ˆ i (n) = −2{e1 (n)[Ti (n) ∗ XL (n)] + e2 (n)[Ti+2 (n) ∗ XL (n)]}
∇ i = 1, 2
ˆ i (n) = −2{e1 (n)[Ti−2 (n) ∗ XR (n)] + e2 (n)[Ti (n) ∗ XR (n)]}
∇ i = 3, 4
(4.21)
in which all input binaural signals and transfer functions are time domain sequences.
The output error is given by
ei (n) = di (n) − yi (n)
= Xi (n − m) − {[W1 (n) ∗ T2i−1 (n) + W2 (n) ∗ T2i (n)] ∗ XL (n)
+ [W3 (n) ∗ T2i−1 (n) + W4 (n) ∗ T2i (n)] ∗ XR (n)} i = 1, 2 (4.22)
in which Xi (n) is

XL (n) i=1
Xi (n) = (4.23)
XR (n) i=2
Figure 4.3 shows that di (n) could simply be a pure delay, say of m samples,
which will assist in the equalization of the minimum phase components of the trans-
fer function matrix in (4.18). The inclusion of an appropriate modeling delay sig-
nificantly reduces the mean square error produced by the equalization process. The
filter length, as well as the delay m, can be selected based on the minimization of the
mean squared error. This method can be used either offline or in real-time according
to the location of the virtual sound source and the position of the listener’s head. The
weight vectors of the crosstalk canceller can be chosen to be either an FIR or an IIR
filter.
86 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

Fig. 4.4. Frequency domain adaptive LMS inverse algorithm block diagrams using overlap-
save method for the estimation of crosstalk canceller weighting vectors based on Fig. 4.3
(i = 1, 2).

Frequency Domain Adaptive Inverse Filter

Frequency domain implementations of the LMS adaptive inverse filter have several
advantages over time domain implementations that include improved convergence
speed and reduced computational complexity. In practical implementations of fre-
quency domain LMS adaptive filters, the input power varies dramatically over the
different frequency bins. To overcome this, the frequency domain adaptive filter
(FDAF) LMS inverse algorithm [59] can be used to estimate the input power in each
frequency bin. The power estimate can be included directly in the frequency domain
LMS algorithm [55]. The adaptive inverse filter algorithm shown in Fig. 4.3 is mod-
ified in the frequency domain using the overlap-save method FDAF LMS inverse
algorithm [54], which is shown in Fig. 4.4.
The general form of FDAF LMS algorithms can be expressed as follows,

W (k + 1) = 2µ(k)X H (k)E(k) (4.24)

in which the superscript H denotes the complex conjugate transpose. The time-
varying matrix µ(k) is diagonal and it contains the step sizes µ1 (k). Generally,
each step size is varied according to the signal power in that frequency bin l. In
the crosstalk canceller implementation described here
  H %
S (k) S H (k)
Wi (k + 1) = Wi (k) + µ × F F −1 i · E1 (k) + 4+i · E2 (k)
Pi (k) P4+i (k)
i = 1, . . . , 4 (4.25)

In (4.25), µ is a fixed scalar and Si (k) is a product of input signal XL or XR and


the transfer function shown in Fig. 4.4. Pi (k) is an estimation of the signal power in
the nth-input signal

Pi (k) = λPi (k − 1) + α|Si (k)|2 (4.26)

in which λ = 1 − α is a forgetting factor. Pi (k) and Si (k) are vectors composed of


N different bins.
4.3 Immersive Audio Rendering 87

Fig. 4.5. Geometry for four loudspeakers and two listeners with the ipsilateral signal paths
(solid lines) and contralateral undesired signal paths (dotted lines).

4.3.2 Rendering Filters for Multiple Listeners

Filters for rendering immersive audio to multiple listeners simultaneously can be


implemented using the filter design methods described above. For the case of two
listeners with four loudspeakers, the crosstalk canceller weighting vectors for the
necessary FIR filters can be determined using the least mean squares adaptive inverse
algorithm [8, 58] in which the adaptation occurs in the sampled time domain and in
the frequency domain. The purpose of virtual loudspeaker rendering for two listeners
is to generate virtual sound sources at two listener’s ears. In order to deliver the
appropriate sound field to each ear, it is necessary to eliminate crosstalk signals that
are inherent in all loudspeaker-based systems (Fig. 4.5).
For the case of two listeners and four loudspeakers there exist 12 crosstalk paths
that should be removed. A method is presented here for implementing such a system
based on adaptive algorithms. Results from four different configurations of adaptive
filter implementations are discussed that can deliver a binaural audio signal to each
listener’s ears for four different combinations of geometry and direction of rendered
image. The basic configuration is shown in Fig. 4.5.

General NonSymmetric Case

The position of each listener’s head is assumed to be at an arbitrary position relative


to each loudspeaker pair S1 − S2 and S3 − S4 . Therefore 16 possible head-related
transfer functions can be generated from each of four loudspeakers to two listeners
(denoted as T1 ∼ T16 as shown in Fig. 4.6).
The purpose of this rendering filter is to deliver left-channel and right-channel
audio signals XL and XR to each listener’s ears, respectively, with both listeners
perceiving the same spatial location for the rendered image. Transfer functions in
Fig. 4.6 are formulated into matrix equations as

E=T·S=T·W·X (4.27)

in which the ear signal matrix E is E = [EL1 ER1 EL2 ER2 ]T , the loudspeaker
signal matrix S is S = [S1 S2 S3 S4 ]T , and the input binaural signal matrix X is
88 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

X = [XL XR ]T . The rendering system head-related transfer function matrix T and


crosstalk cancellation filter W are defined as follows
⎡ ⎤ ⎡ ⎤
T1 T2 T3 T4 T1 T5
⎢ T5 T6 T7 T8 ⎥ ⎢ T2 T6 ⎥
T=⎢ ⎥ ⎢
⎣ T9 T10 T11 T12 ⎦ W = ⎣ T3 T7 ⎦
⎥ (4.28)
T13 T14 T15 T16 T4 T8

From (4.28), the signal paths T1 , T6 , T11 , and T16 are ipsilateral signal paths,
and T2 , T3 , . . . , T15 are undesired contralateral crosstalk signal paths. If the product
of matrices T, W, and X can be simplified as (4.29) for the same side rendered
image, the crosstalk cancellation filter W will be the optimum inverse control filter.
Therefore the desired binaural input signals XL and XR can be delivered to the ears
of each listener without crosstalk.
⎡ ⎤ ⎡ ⎤
10 XL
⎢0 1⎥ ⎢ XR ⎥
E=T·W·X=⎢ ⎥ ⎢
⎣ 1 0 ⎦ X = ⎣ XL ⎦
⎥ (4.29)
01 XR

Equation (4.27) is modified as follows

E = GX (4.30)
⎡ ⎤
T1 W1 + T2 W2 + T3 W3 + T4 W4 T1 W5 + T2 W6 + T3 W 7 + T4 W 8
⎢ T 5 W 1 + T6 W2 + T7 W 3 + T 8 W4 T5 W5 + T6 W6 + T7 W 7 + T8 W 8 ⎥
G=⎢ ⎣ T9 W1 + T10 W2 + T11 W3 + T12 W4
⎥.
T9 W5 + T10 W6 + T11 W7 + T12 W8 ⎦
T13 W1 + T14 W2 + T15 W3 + T16 W4 T13 W5 + T14 W6 + T15 W7 + T16 W8

Equation (4.30) can then be modified as


⎡ ⎤
T1 XL T2 XL T3 XL T4 XL T1 XR T2 XR T3 X R T4 XR
⎢ T5 X L T6 XL T7 XL T8 XL T5 XR T6 XR T7 X R T8 XR ⎥
E=⎢ ⎣ T9 XL T10 XL T11 XL T12 XL T9 XR
⎥·W
T10 XR T11 XR T12 XR ⎦
T13 XL T14 XL T15 XL T16 XL T13 XR T14 XR T15 XR T16 XR

Fig. 4.6. Geometry and transfer functions for four loudspeakers and two listeners (general
nonsymmetric case).
4.3 Immersive Audio Rendering 89

Fig. 4.7. LMS block diagrams for the estimation of crosstalk canceller weighting vectors in
the general nonsymmetric case, with di (n) = XL (n − m) and di (n) = XR (n − m) for the
left and right channels, respectively.

& 'T
= XL XR XL XR (4.31)

in which W is a column matrix [W1 W2 W3 W4 W5 W6 W7 W8 ]T . Figure 4.7 shows a


block diagram of weight vectors generated using the LMS algorithm for the general
non-symmetric case.
The weight vectors of the crosstalk canceller are updated based on the LMS
adaptive algorithm
ˆ i (n))
Wi (n + 1) = Wi (n) + µ × (−∇ i = 1, . . . , 8 (4.32)

In (4.30), the convergence rate of the LMS adaptive algorithm is controlled by


ˆ
step size µ. The gradient estimate,∇(n), is simply the derivative of e2 (n) with re-
spect to W (n). Therefore the gradient estimates in the time domain are
ˆ i (n) = −2{e1 (n)[Ti (n) ∗ XL (n)] + e2 (n)[T4+i (n) ∗ XL (n)]

+e3 (n)[T8+i (n) ∗ XL (n)] + e4 (n)[T12+i (n) ∗ XL (n)]} i = 1, . . . , 4
ˆ
∇i (n) = −2{e1 (n)[Ti−4 (n) ∗ XR (n)] + e2 (n)[Ti (n) ∗ XR (n)]
+e3 (n)[T4+i (n) ∗ XR (n)] + e4 (n)[T8+i (n) ∗ XR (n)]} i = 5, . . . , 8
(4.33)

In (4.33), all input binaural signals and transfer functions are sample sequences
in the time domain. The output error is given by

ei (n) = di (n) − yi (n) = Xi (n − m) − yi (n) i = 1, . . . , 4 (4.34)

in which Xi (n) is

XL (n) i = 1, 3
Xi (n) = (4.35)
XR (n) i = 2, 4
90 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

Fig. 4.8. Frequency domain adaptive LMS inverse algorithm block diagrams using overlap-
save method for the estimation of crosstalk canceller weighting vectors based on Fig. 4.7
(i = 1, . . . , 4).

The filter output y(n) is

yi (n) = [W1 (n) ∗ T4i−3 (n) + W2 (n) ∗ T4i−2 (n) + W3 (n) ∗ T4i−1 (n)
+W4 (n) ∗ T4i (n)] ∗ XL (n)
= [W5 (n) ∗ T4i−3 (n) + W6 (n) ∗ T4i−2 (n) + W7 (n) ∗ T4i−1 (n)
+W8 (n) ∗ T4i (n)] ∗ XR (n) i = 1, . . . , 4 (4.36)

Figure. 4.8 shows the block diagram of the frequency domain adaptive LMS
inverse algorithm for the general nonsymmetric case.
By using the weight vector adaptation algorithm in (4.24),

Wi (n + 1)
 
SiH (k) S H (k)
= Wi (n) + µ × fft ifft · E1 (k) + 8+i · E2 (k)
Pi (k) P8+i (k)
%
H
S16+i (k) H
S24+i (k)
+ · E3 (k) + · E4 (k)
P16+i (k) P24+i (k)
i = 1, . . . , 8 (4.37)

In (4.37), Si (k) and Pi (k) are described in section B for single listener case.

Symmetric Case

Each of the two listeners in this configuration is seated at the center line of each loud-
speaker pair. This implies that several of the HRTFs in this geometry are identical
due to symmetry (assuming that no other factors such as room acoustics influence
the system). Therefore the 16 HRTFs from T1 to T16 can be reduced to just 6 HRTFs
4.3 Immersive Audio Rendering 91

Fig. 4.9. Geometry and transfer functions for four loudspeakers and two listeners with sym-
metry geometry.

(T1 = T6 = T11 = T16 , T2 = T5 = T12 = T15 , T3 = T14 , T4 = T13 , T7 = T10 , and


T8 = T9 ) as shown in Fig. 4.9.
This filter has the same property as the general nonsymmetric case. But the eight
filters (W1 ∼ W8 ) for the general case can be reduced to four filters (W1 ∼ W4 )
using the symmetry property. From (4.27), (4.29), and the symmetry property we
find
⎡ ⎤
10
⎢0 1⎥
Ts · W = ⎢ ⎣1 0⎦
⎥ (4.38)
01

The symmetric rendering system HRTFs matrix Ts is


⎡ ⎤
T1 T2 T3 T4
⎢ T2 T1 T 7 T 8 ⎥
Ts = ⎢ ⎥
⎣ T8 T7 T 1 T 2 ⎦ (4.39)
T4 T3 T 2 T 1

Therefore the crosstalk canceller weighting vector matrix W will be


⎡ ⎤
10
⎢0 1⎥
W = T−1 ⎢
s · ⎣1 0⎦
⎥ (4.40)
01

From (4.40), the matrix inverse is expanded as follows:


⎡ ⎤ ⎡ ⎤
A11 A21 A31 A41 1 0
1 ⎢ A12 A22 A32 A42 ⎥ ⎢ 0 1⎥
W= ⎢ ⎥·⎢ ⎥
det(Ts ) ⎣ A13 A23 A33 A43 ⎦ ⎣ 1 0⎦
A14 A24 A34 A44 0 1
92 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
⎡ ⎤
A11 + A31 A21 + A41
1 ⎢ A12 + A32 A22 + A42 ⎥
= ⎢ ⎥ (4.41)
det(Ts ) ⎣ A13 + A33 A23 + A43 ⎦
A14 + A34 A24 + A44

in which Aij is an adjugate matrix. Based on symmetry, A11 = A44 , A12 = A43 ,
A13 = A42 , A14 = A41 , A21 = A34 , A22 = A33 , A23 = A32 , and A24 = A31 .
Therefore, (4.41) becomes
⎡ ⎤ ⎡ ⎤
A11 + A24 A14 + A21 W1 W 3
1 ⎢ A12 + A23 A13 + A22 ⎥ ⎢ W2 W4 ⎥
W= ⎢ ⎥=⎢ ⎥ (4.42)
det(Ts ) ⎣ A13 + A22 A12 + A23 ⎦ ⎣ W4 W2 ⎦
A14 + A21 A11 + A24 W3 W1

From the above equation, (4.27) can be rearranged as follows.

E = Ts · W · X
⎡ ⎤
T1 XL + T4 X R T2 XL + T3 XR T4 X L + T1 X R T3 XL + T2 XR
⎢ T2 XL + T8 X R T1 XL + T7 XR T8 X L + T2 X R T7 XL + T1 XR ⎥
=⎣⎢ ⎥
T8 XL + T2 X R T7 XL + T1 XR T2 X L + T8 X R T1 XL + T7 XR ⎦
T4 XL + T1 X R T3 XL + T2 XR T1 X L + T4 X R T2 XL + T3 XR
⎡ ⎤ ⎡ ⎤
W1 XL
⎢ W2 ⎥ ⎢ XR ⎥
·⎢ ⎥ ⎢ ⎥
⎣ W3 ⎦ · ⎣ XL ⎦
W4 XR
(4.43)

Figure 4.10 shows a block diagram for generating weight vectors using LMS
algorithms with the symmetry property.
The weight vectors of the crosstalk canceller are updated based on the LMS
adaptive algorithm

Fig. 4.10. LMS block diagrams for the estimation of crosstalk canceller weighting vectors for
the symmetric case.
4.3 Immersive Audio Rendering 93

Fig. 4.11. Frequency domain adaptive LMS inverse algorithm block diagrams using overlap-
save method for the estimation of crosstalk canceller weighting vectors based on Fig. 4.10
(i = 1, . . . , 4).

ˆ i (n))
Wi (n + 1) = Wi (n) + µ × (−∇ i = 1, . . . , 4 (4.44)
In (4.44), the convergence rate of the LMS adaptive algorithm is controlled by
ˆ
step size µ. The gradient estimate,∇(n), is simply the derivative of e2 (n) with re-
spect to W (n). Therefore gradient estimates in the time domain can be written as
ˆ i (n) = −2[e1 (n)C1i (n) + e2 (n)C2i (n) + e3 (n)C3i (n) + e4 (n)C4i (n)]

i = 1, . . . , 4 (4.45)
in which Cij is an element of the ith row and jth column in matrix C
⎡ ⎤
(T1 XL + T4 XR ) (T2 XL + T3 XR ) (T4 XL + T1 XR ) (T3 XL + T2 XR )
⎢ (T2 XL + T8 XR ) (T1 XL + T7 XR ) (T8 XL + T2 XR ) (T7 XL + T1 XR ) ⎥
C=⎢ ⎥
⎣ (T8 XL + T2 XR ) (T7 XL + T1 XR ) (T2 XL + T8 XR ) (T1 XL + T7 XR ) ⎦
(T4 XL + T1 XR ) (T3 XL + T2 XR ) (T1 XL + T4 XR ) (T2 XL + T3 XR )
(4.46)
In (4.43), all input binaural signals and transfer functions are sample sequences in
the time domain. The output error is shown in Fig. 4.10. Figure 4.11 shows the block
diagram of the frequency domain adaptive LMS inverse algorithm for the symmetric
case.
By using the weight vector adaptation algorithm in (4.24) we find
Wi (k + 1)
 
SiH (k) S H (k)
= Wi (k) + µ × F F −1
· E1 (k) + 4+i · E2 (k)
Pi (k) P4+i (k)
%%
H
S8+i (k) H
S12+i (k)
+ · E3 (k) + · E4 (k)
P8+i (k) P12+i (k)
i = 1, . . . , 4 (4.47)
94 4 Immersive Audio Synthesis and Rendering Over Loudspeakers

4.3.3 Simulation Results

Single Listener Case

In this section we describe the performance of crosstalk cancellation filters im-


plemented using the algorithms described in the previous sections. The values of
the delay m and tap size of each FIR filter in the adaptive algorithm were cho-
sen so as to minimize adaptation error and make the FIR filter causal. The train-
ing input data used for each adaptive algorithm consisted of random noise signals
with zero mean and unity variance in the frequency range between 200 Hz and
10 kHz. The performance of the crosstalk canceller was measured using (4.18). It
was found that the desired characteristics in the frequency domain for the ipsilat-
eral and contralateral signal transfer functions require that the magnitude response
of the ipsilateral signal transfer functions in the frequency domain should satisfy
|T1 (ω)W1 (ω) + T2 (ω)W2 (ω)| = 1, and |T3 (ω)W3 (ω) + T4 (ω)W4 (ω)| = 1 for
lossless signal transfer in the expected frequency band. The ipsilateral signal transfer
function should be linear phase: ∠(T1 (ω)W1 (ω) + T2 (ω)W2 (ω)) = exp(−jnω),
and ∠(T3 (ω)W3 (ω) + T4 (ω)W4 (ω)) = exp(−jnω). The magnitude response of the
contralateral signal transfer functions should satisfy |T1 (ω)W3 (ω)+T2 (ω)W4 (ω)| =
0, and |T3 (ω)W1 (ω) + T4 (ω)W2 (ω)| = 0 for perfect crosstalk cancellation. All of
the requirements described above apply to the frequency range between 200 Hz and
10 kHz.
Figure 4.12 shows some typical results for the LMS adaptive inverse algorithm.
The magnitude response of the ipsilateral signal is about 0 dB in the frequency
range between 200 Hz and 10 kHz with linear phase. Therefore the desired signal can
be transferred from loudspeaker to ear without distortion. The magnitude response

Fig. 4.12. Frequency response of crosstalk canceller adapted in the time domain. (a) Magni-
tude response of T1 (ω)W1 (ω) + T2 (ω)W2 (ω); (b) magnitude response of T1 (ω)W3 (ω) +
T2 (ω)W4 (ω); (c) phase response of T1 (ω)W1 (ω) + T2 (ω)W2 (ω).
4.3 Immersive Audio Rendering 95

Fig. 4.13. Frequency response of crosstalk canceller adapted in the frequency domain.
(a) Magnitude response of T1 (ω)W1 (ω) + T2 (ω)W2 (ω); (b) magnitude response of
T1 (ω)W3 (ω) + T2 (ω)W4 (ω); (c) phase response of T1 (ω)W1 (ω) + T2 (ω)W2 (ω).

of the contralateral signal is at least 20 dB below the ipsilateral signal in the same
range. Figure 4.13 presents the result of the normalized frequency domain adaptive
filter inverse algorithm.
The magnitude response of the ipsilateral signal is about 0 dB in the frequency
range between 200 Hz and 10 kHz with linear phase. It has almost the same mag-
nitude response as Fig. 4.12. However, the magnitude response of the contralateral
signal is suppressed more than 40 dB below the ipsilateral signal.

Multiple Listener Case

The experiments in this case were conducted as shown in Fig. 4.5. The tap size of
the measured HRTFs was 256 samples at a sampling rate of 44.1 kHz. A random
noise signal was used for the input of the adaptive LMS algorithm. This signal was
sampled at 44.1 kHz in the frequency bands between 200 Hz and 10 kHz. The filter
coefficients were obtained using LMS in the time and frequency domain as described
above. The performance of the rendering filter for the general nonsymmetric case was
measured based on the matrix equation (4.18). The desired magnitude and phase
response in the frequency domain should satisfy
⎡ ⎤
|A11 | |A12 |
⎢ |A21 | |A22 | ⎥
|M | = ⎢ ⎥
⎣ |A31 | |A32 | ⎦
|A41 | |A42 |
96 4 Immersive Audio Synthesis and Rendering Over Loudspeakers
⎡ ⎤
1 0
⎢0 1⎥
=⎢
⎣1
⎥ 200 Hz ≤ f ≤ 10 kHz (4.48)
0⎦
0 1
where A11 = T1 W1 +T2 W2 +T3 W3 +T4 W4 , A12 = T1 W5 +T2 W6 +T3 W7 +T4 W8 ,
A21 = T5 W1 + T6 W2 + T7 W3 + T8 W4 , A22 = T5 W5 + T6 W6 + T7 W7 + T8 W8 ,
A31 = T9 W1 +T10 W2 +T11 W3 +T12 W4 , A32 = T9 W5 +T10 W6 +T11 W7 +T12 W8 ,
A41 = T13 W1 + T14 W2 + T15 W3 + T16 W4 , and A42 = T13 W5 + T14 W6 + T15 W7 +
T16 W8 . Furthermore, all transfer functions (HRTFs) and weight vectors (crosstalk
canceller coefficients) are in the frequency domain. Defining Mij as an element at
the ith row and jth column in the above magnitude matrix of M. In this matrix, the
desired magnitude response of the ipsilateral and contralateral signals are 1 and 0
respectively. The phase response is
⎡ ⎤
∠A11 ∠A12
⎢ ∠A21 ∠A22 ⎥
∠(P) = ⎢⎣ ∠A31 ∠A32 ⎦

∠A41 ∠A42
⎡ −jnw ⎤
e X
⎢ X e−jnw ⎥
=⎢⎣ e−jnw X ⎦
⎥ 200 Hz ≤ f ≤ 10 kHz (4.49)
X e−jnw
in which X in the matrix indicates “don’t care” because of its small magnitude re-
sponse. For optimum performance, the ipsilateral signals in (4.49) should have linear
phase in the frequency band between 200 Hz and 10 kHz so that there is no phase
distortion. Let’s define Pij as an element at the ith row and jth column in the phase
matrix P. Simulations of (4.48) using LMS adaptation in the time domain are shown
in Fig. 4.14 in the frequency domain.
It can be seen that the frequency response of the ipsilateral signal in equation
(4.48) is very close to 0 dB in the frequency range between 200 Hz and 10 kHz with
linear phase. Therefore the desired ipsilateral signal (input binaural signal) reaches
the ear from the same-side loudspeaker without distortion as desired. The magnitude
response of the undesired contralateral signal is suppressed between 20 dB and 40
dB relative to the ipsilateral signal in the same frequency range. The same results are
shown in the frequency domain in Fig. 4.15.
The frequency response of the ipsilateral signal is nearly identical to the response
in Fig. 4.14. The magnitude response of the contralateral signal is suppressed around
40 dB relative to the ipsilateral signal in the same frequency range.

4.3.4 Summary

We described a general methodology for rendering binaural sound over loudspeakers


that can be generalized to multiple listeners. We presented the mathematical formu-
lation for the necessary processing of signals to be played from the loudspeakers for
4.3 Immersive Audio Rendering 97

Fig. 4.14. Frequency response where the weight vectors were obtained based on the LMS
algorithm in the time domain.

Fig. 4.15. Frequency response where the weight vectors were obtained based on the LMS
algorithm in the frequency domain.

the case of two listeners. The methods presented here can also be scaled to more
listeners.
5
Multiple Position Room Response Equalization

This chapter is concerned with the equalization of acoustical responses, simultane-


ously, at multiple locations in a room. The importance of equalization is well known,
in that it allows (i) delivery of high-quality audio delivered to listeners in a room,
and (ii) improved rendering of spatial audio effects for a sense of audio immer-
sion. Typical applications include home theater, movie theaters, automobiles, and
any loudspeaker based playback environment (headphones, cell phones, etc.). Be-
cause experiencing movies and music is now primarily a group experience (such as
in home theaters, automobiles, and movie theaters), and headphone/earbud acous-
tics vary due to ear coupling effects, it is important to include acoustic variations in
the design of an equalization filter. Thus, an equalization filter designed to compen-
sate for the room effects (viz., multipath reflections) at a single location performs
poorly at other locations in a room. This is because room impulse responses vary
significantly with differing source receiver (viz., listener) positions. A good equal-
ization filter should compensate the effects of multipath reflections simultaneously
over multiple locations in a room. This chapter briefly introduces some traditional
room equalization techniques, and presents in detail a new multiple listener (or mul-
tiple position) equalization filter using pattern recognition techniques. Because the
filter lengths can be large, a popular psychoacoustic scheme described in this chap-
ter allows design of low filter orders, using the pattern recognition technique, for
real-time implementation. Additionally, a room response and equalization visualiza-
tion technique, the Sammon map, is presented to interpret the results. Furthermore,
one of the major factors that affects equalization performance is the reverberation
of the room. In this chapter, the equalization performance of the pattern recognition
method [60] is compared with the well-known root mean square averaging-based
equalization, using the image method [61] for synthesizing responses with varying
reverberation times T60 .

[2006]
c IEEE. Reprinted, with permission, from S. Bharitkar and C. Kyriakakis, “Visu-
alization of multiple listener room acoustic equalization with the Sammon map”, IEEE
Trans. on Speech and Audio Proc., (in press).
100 5 Multiple Position Room Response Equalization

5.1 Introduction
An acoustic enclosure can be modeled as a linear system whose behavior is char-
acterized by a response, known as the impulse response, h(n); n ∈ {0, 1, 2, . . . }.
When the enclosure is a room the impulse response is known as the room impulse
response with a frequency response, H(ejω ). Generally, H(ejω ) is also referred to
as the room transfer function (RTF). The impulse response yields a complete de-
scription of the changes a sound signal undergoes when it travels from a source to
a receiver (microphone/listener) via a direct path and multipath reflections due to
the presence of reflecting walls and objects. By its very definition the room impulse
response is obtained at a receiver (e.g., a microphone) located at a predetermined
position in a room, after the room is excited by a broadband source signal such as
the MLS or the logarithmic chirp signal (described in chapter 3).
It is well established that room responses change with source and receiver loca-
tions in a room [11, 63]. Other reasons for minor variations in the room responses are
due to changes in the room, such as opening/closing of doors and windows. When
these minor variations are ignored, a room response can be uniquely defined by a set

of spatial coordinates, li = (xi , yi , zi ). It is assumed that the source is at an origin
and the the receiver i is at the three spatial coordinates, xi , yi , and zi , relative to a
source in the room.
When an audio signal is transmitted in a room, the signal is distorted by the
presence of reflecting boundaries. One scheme to minimize this distortion, from a
source to a specific position, is to introduce an equalizing filter that is an inverse of
the room impulse response measured between the source and the listening position.
This equalizing filter is applied to the source signal before transmitting it in a room. If
heq (n) is the equalizing filter for room response h(n), then, for perfect equalization
heq (n) ⊗ h(n) = δ(n); where ⊗ is the convolution operator and δ(n) = 1, n =
0; 0, n = 0 is the Kronecker delta function. However, two problems arise due to this
approach: (i) the room response is not necessarily invertible (i.e., it is not minimum
phase), and (ii) designing an equalizing filter for a specific position will introduce
a poor equalization performance at other positions in a room. In other words, the
multiple point equalization cannot be achieved by an equalizing filter that is designed
for equalizing the response at only one location.
A classic multiple location equalization technique is to average the room re-
sponses and invert the resulting minimum-phase part to form the equalizing filter.
Elliott and Nelson [64] propose a least squares method for designing an equalization
filter for a sound reproduction system by adjusting the filter coefficients to mini-
mize the sum of the squares of the errors between the equalized signals at multi-
ple points in a room and the delayed version of an electrical signal applied to a
loudspeaker. In [65], Mourjopoulos proposes a technique of using a spatial equal-
ization library, based on the position of a listener, for equalizing the response at the
listener position. The library is formed via vector quantization of room responses.
Miyoshi and Kaneda [67] present an “exact” equalization of multiple point room
responses. Their argument is based on the MINT (multiple-input/multiple-output in-
verse theorem) which requires that the multiple room responses have uncommon
5.2 Background 101

zeros among them. A multiple point equalization algorithm using common acousti-
cal poles is demonstrated by Haneda et al. [68]. Fundamentally, the aforementioned
multiple point equalization algorithms are based on a linear least squares approach.
Weiss et al. [62] proposed an efficient and effective multirate signal processing-based
approach for performing equalization at the listeners’ ear positions.
Thus, the main objective of room equalization is the formation of an inverse fil-
ter, heq (n), that compensates for the effects of the loudspeaker and room that cause
sound quality degradation at a listener position. In other words, the goal is to satisfy
heq (n) ⊗ h(n) = δ(n), where ⊗ denotes the convolution operator and δ(n) is the
Kronecker delta function. Because it is well established that room responses change
with source (i.e., loudspeaker) and listener locations in a room [11, 63], clearly, due
to the variations in the impulse responses, between positions, equalization has to
be done simultaneously such that the goal is satisfied at all listening positions. In
practice an ideal delta function is not achievable with low filter orders as room re-
sponses are nonminimum-phase. Furthermore, from a psychoacoustic standpoint, a
target curve, such as a low-pass filter having a reasonably high cutoff frequency is
generally applied to the equalization filter (and hence the equalized response) to pre-
vent the played-back audio from sounding exceedingly “bright”. An example of a
low-pass cutoff frequency is the frequency where the loudspeaker begins its high-
frequency roll-off in the magnitude response. Additionally, the target curve may also
be customized according to the size and/or the reverberation time of the room. A
high-pass filter may also be applied to the equalized response, depending on the
loudspeaker size and characteristics (e.g., a satellite channel loudspeaker), in order to
minimize distortions at low frequencies. Examples of environments where multiple
listener room response equalization is used are in home theater (e.g., a multichannel
5.1 system), automobile, movie theaters, and the like.

5.2 Background
To understand the effects of single location equalization on other locations, consider
a simple first-order specular room reflection model as follows (with the assumption
that the response at the desired location for equalization is invertible). Let the impulse
responses, h1 (n) and h2 (n), from a source to two positions 1 and 2 be represented
as

h1 (n) = δ(n) + α2 δ(n − 1); |α2 | < 1


h2 (n) = δ(n) + β2 δ(n − 1)
β2 = α2 (5.1)

This first-order reflection model is valid, for example, when the two positions are
located along the same radius from a source, and each position has a differently
absorbing neighboring wall with negligible higher-order reflections from each wall.
For simplicity, the absorption due to air and the propagation delay nd in samples
(nd ≈ fs r/c; r is the distance, fs is the sampling rate and c is the speed of sound
102 5 Multiple Position Room Response Equalization

which is temperature dependent) is ignored in this model. Ideal equalization at posi-


tion 1 is achieved if the equalizing filter, heq (n), is

heq (n) = (−α2 )n u(n) (5.2)

because heq (n) ⊗ h1 (n) = δ(n). However, the equalized response at position 2 can
be easily shown to be

heq (n) ⊗ h2 (n) = δ(n) − (α2 − β2 )(−α2 )(n−1) u(n − 1) (5.3)

where u(n) = 1, n ≥ 0 is the discrete step function. There are two objective mea-
sures of equalization performance for position 2: (i) frequency domain error function
(used subsequently in the chapter), and (ii) time domain error function. The time do-
main error function is easy to compute for the present problem, and is defined as

1 2 1
I−1 I−1
= e (n) = (δ(n) − heq(n) ⊗ h2 (n))2
I n=0 I n=0

(α2 − β2 )2 
I−1
= (−α2 )(2n−2) (5.4)
I n=1

Clearly, the response at position 2 is unequalized because > 0. A plot of the error
as a function of the distance |α2 − β2 | between the two coefficients, α2 and β2 , that
differentiate the two responses is shown in Fig. 5.1. Hence, the error is reduced at
position 2 if a good equalizer is designed that accounts for the changes in the room
response due to variations in the source and listening positions.

5.3 Single-Point Room Response Equalization


Neely and Allen [69] discuss a method of using a minimum-phase inverse filter,
wherein the minimum-phase inverse filter is obtained by inverting the causal and
stable minimum-phase part of the room impulse response. If the room response is
minimum-phase, then perfect (flat) equalization is achieved. However, if the room
response is nonminimum-phase, then the resulting minimum-phase inverse filter
produces an equalization that contains an audible tone (with a test speech signal).
The paper also proposes a method for determining whether the room response
is minimum-phase or nonminimum-phase. They state that a necessary and suffi-
cient condition for determining whether the room response is nonminimum-phase
is through the Nyquist criterion for determining the nonminimum-phase zeros.
Radlović and Kennedy in [70], propose a minimum-phase and all-pass decom-
position of a room impulse response. The minimum-phase component is obtained by
a combined means of homomorphic processing (cepstrum analysis) and an iterative
algorithm. The authors argue that equalizing the phase response in a mixed phase
room impulse response is important along with the magnitude equalization, because
5.4 Multiple-Point (Position) Room Response Equalization 103

phase distortions can audibly degrade speech. By using the concept of matched fil-
tering they are able to objectively minimize the total equalization error (magnitude
and phase).
Some recent literature on spectral modeling using psychoacoustically motivated
filters for single-position equalization can be found in [71, 72, 73].

5.4 Multiple-Point (Position) Room Response Equalization


Miyoshi and Kaneda [67] present an “exact” equalization of multiple-point room
responses. Their argument is based on the MINT (multiple-input/multiple-output in-
verse theorem) which requires that the multiple room responses have uncommon
zeros among them. Clearly this is a limiting approach, as uncommon zeros between
room responses cannot be guaranteed. The authors take precautions for avoiding
common zeros between room responses in their experiments. This is done by avoid-
ing symmetrical positions of the microphones and loudspeakers that are used to mea-
sure the room responses in a room.
Chang [74] proposes an universal approximator, such as a neural network, for
equalizing the combined loudspeaker and room response system. The authors sug-
gest that the loudspeaker’s nonlinear transfer function necessitates the use of a non-
linear inverse filter (neural network). The authors use a time plot to show the low
equalization error obtained on using a neural network.

Fig. 5.1. Equalization error at position 2 as a function of the “separation” between the re-
sponses at position 1 and 2.
104 5 Multiple Position Room Response Equalization

A technique of inverting a mixed-phase response via a least squares approach is


presented in [75]. Ideally an inverted version of the room response convolved with
the response should provide a Kronecker delta function. However, because room re-
sponses are mixed-phase, the room response can be decomposed into a stable causal
part, and a noncausal part. Because this noncausal part is unavailable for real-time
computation, the author proposes a Levinson algorithm for determining an optimal
inverse filter. The objective function for minimization in the Levinson algorithm is


M −p
J= (δ(k − p) − y(k))2 (5.5)
k=0

where p is the modeling delay, M is the duration of the room response h(k), and
y(k) = h(k) ⊗ hi (k) (⊗ denotes the linear convolution operator, and hi (k) is the
causal and finite duration inverse filter of h(k)).
Elliott and Nelson [64] propose a method for designing an equalization filter for
a sound reproduction system by adjusting the filter coefficients to minimize the sum
of the squares of the errors between the equalized responses at multiple points in a
room and the delayed version of an electrical signal. Basically, the objective function
is expressed as a square of the instantaneous error signal, where the error signal is the
difference between a delayed replica of the electrical signal which is supplied as an
input to a channel with given room response, and an output signal. The disadvantage
of this approach is the relatively limited equalization performance due to the equal
weighting provided to all the responses when designing the equalization filter.
Haneda et al. [68] propose a room response model, the CAPZ model (common
acoustical pole and zero model), that is causal and stable. The authors suggest that
there exist common poles in a room transfer function (i.e., the Fourier transform of
the room impulse response) irrespective of the measurement position of the room
response within a room. A multiple-point equalization filter comprising the common
acoustical poles is then determined via a linear least squares method.
The RMS spatial averaging method is used widely due to its simplicity for com-
puting the equalization filter and the spatial average of measured responses is given
by:


1  N
Havg (e ) =

|Hi (ejω )|2 (5.6)
N i=1
−1 jω
Heq (ejω ) = Havg (e )

where N is the number of listening positions, with responses Hi (ejω ), that are to be
equalized.
The following section presents a novel pattern recognition technique for grouping
responses having “similar” acoustical structure and subsequently forming a general-
ized representation that models the variations of responses between positions. The
generalized representation is then used for designing an equalization filter.
5.5 Designing Equalizing Filters Using Pattern Recognition 105

5.5 Designing Equalizing Filters Using Pattern Recognition


5.5.1 Review of Cluster Analysis in Relation to Acoustical Room Responses
From a broad perspective, clustering algorithms group data that have a high degree
of similarity into classes or clusters having centroids. Clustering techniques typically
use an objective function, such as a sum of squared distances from the centroids,
and seek a grouping (viz., cluster formation) that extremizes this objective func-
tion [76]. In particular, clustering refers to assigning data, such as room responses
{hi (n); i = 1, 2, . . . , M ; n = 0, 1, . . . , N − 1}, from a data universe X d com-
prising a collection of room responses, such that X d is optimally or suboptimally
partitioned into c clusters (1 < c < M, M < N ). The trivial case of c = 1
denotes a rejection of the hypothesis that there are clusters in the data comprising
the room responses, whereas c = M constitutes the case where each data vector

hi = (hi (0), hi (1), . . . , hi (N − 1))T is in a cluster by itself. Upon clustering, the
room responses bearing strong similarity to each other are grouped in the same clus-
ter. The similarity between the room responses is used to determine the cluster cen-
troids, and these centroids are then used as a model for the data in their respective
clusters. A similarity measure widely used in clustering is the Euclidean distance
between pairs of room responses. If the clustering algorithm yields clusters that are
well formed, the Euclidean distance between data in the same cluster is significantly
less than the distance between data in different clusters.

5.5.2 Fuzzy c-means for Determining the Prototype


In the hard c-means clustering algorithm, a room response, hj , can strictly belong to
one and only one cluster. This is accomplished by the binary membership function
µi (hj ) ∈ {0, 1} which indicates the presence or absence of the response hj within a
cluster i.
However, in fuzzy clustering, a room response hj may belong to more than one
cluster by different “degrees”. This is accomplished by a continuous membership
function µi (hj ) ∈ [0, 1]. The motivation for using the fuzzy c-means clustering ap-
proach can be best understood from Fig. 5.2, where the direct path component of the
response associated with position 3 is similar (in the Euclidean sense) to the direct
path component of the response associated with position 1 (because positions 1 and
3 are at same radial distance from the loudspeaker). Furthermore, it is likely that
the reflection components at the position 3 response will be similar to the reflection
components of the position 2 response (due to the proximity of these two positions
relative to each other). Thus, it is clear that if responses at positions 1 and 2 are clus-
tered separately into two different clusters, then the response at position 3 should
belong to both clusters to some degree. Thus, this clustering approach permits an
intuitively reasonable model for centroid formation.
The centroids and membership functions, as given in [77, 76], are determined by
M 2
∗ k=1 (µi (hk )) hk
ĥi =  M 2
k=1 (µi (hk ))
106 5 Multiple Position Room Response Equalization

Fig. 5.2. Motivation for using fuzzy c-means clustering for room acoustic equalization.

⎡ ⎤−1 1
 c
d 2
d2ik
µi (hk ) = ⎣ ik ⎦
= c
d2jk 1 ;
j=1 j=1 d2 jk

d2ik = hk − ĥi 2 i = 1, 2, . . . , c; k = 1, 2, . . . , M (5.7)

where ĥi denotes the ith cluster room response centroid. An iterative optimization
procedure proposed by Bezdek [76] was used for determining the quantities in (5.7).
Care was taken to ensure that the minimum phase room responses were used to form
the centroids so as to avoid undesirable time and frequency domain effects, due to
incoherent linear combination, resulting from using the excess phase parts.
Once the centroids are formed from minimum-phase responses, they are com-
bined to form a single final prototype. One approach to do this is by using the fol-
lowing model,
c M 2 ∗
j=1 ( k=1 (µj (hk )) )ĥj
hf inal = c M (5.8)
2
j=1 ( k=1 (µj (hk )) )

The final prototype (5.8) is formed from a nonuniform weighting of the cluster
membership functions. Specifically, the “heavier” the weight of a cluster j, in terms
of the fuzzy membership functions k=1 (µj (hk ))2 , the larger is the contribution of
M

the corresponding centroid ĥj in the formation of the prototype and the subsequent
multiple position equalization filter.
The multiple listener equalization filter can subsequently be obtained by deter-
mining the minimum-phase component, hmin,f inal , of the final prototype hf inal =
hmin,f inal ⊗ hap,f inal (hap,f inal is the all-pass component), where the minimum-
phase sequence hmin,f inal is obtained from the cepstrum of hf inal . It is noted that
5.5 Designing Equalizing Filters Using Pattern Recognition 107

the final prototype, hf inal , need not be of minimum-phase because the linear combi-
nation of minimum-phase signals need not be minimum-phase.

5.5.3 Cluster Validity Index

One approach to determine the optimal number of clusters c∗ , based on given data,
is to use a validity index κ. One example of a validity index, that is popular in the
pattern recognition literature, is the Xie–Beni cluster validity index κXB [78]. This
index is expressed as

1 
c M

κXB = (µj (hk ))2 ĥj − hk 22 (5.9)
N β j=1
k=1
∗ ∗
β = min ĥi − ĥj 22
i=j

The term included with the double summation is simply the objective function
used in fuzzy c-means clustering, whereas the denominator term β analyzes the inter-
cluster centroid distances. The larger this distance, the better is the cluster separation
and thus the lower is the Xie–Beni index.
Thus, the clustering process involves (i) choosing the number of clusters, c, ini-
tially to be 2; (ii) performing fuzzy clustering and determining the centroid positions
according to Eq. (5.7); (iii) determining κXB Eq. (5.9); (iv) increasing the number
of clusters by unity and performing steps (ii) to (iv) until c = M − 1; and (v) plot-
ting κXB as a function of the number of clusters, where the minima of this plot will
provide the optimal number of clusters, c∗ according to this index. Typically, κXB
returns the optimal number of clusters c∗  M for applications involving very large
data sets [79]. In such cases the plot of κXB versus c increases beyond c∗ and then
monotonically decreases towards c = M − 1. The prototype is then formed from the
centroids for c∗ via 5.8.
It is to be noted that the equalization filter computed simply by using the fuzzy c-
means approach is generally high-order (i.e., at most the length of the room impulse
response). Thus, a technique for mapping the large filter lengths to a smaller length
is introduced in the subsequent section.

5.5.4 Multiple Listener Room Equalization with Low Filter Orders

Linear predictive coding (LPC) [80, 81] is used widely for modeling speech spectra
with a fairly small number of parameters. It can also be used for modeling room
responses in order to form low-order equalization filters.
In addition, in order to obtain a better fit of a low-order model to a room response,
especially in the low-frequency region of the room response spectrum, the concept
of warping was introduced by Oppenheim et al. in [82]. Warping involves the use
of a chain of all-pass blocks, D1 (z), instead of conventional delay elements z −1 , as
shown in Fig. 5.3. With an all-pass filter, D1 (z), the frequency axis is warped and
108 5 Multiple Position Room Response Equalization

Fig. 5.3. The structure for implementing warping.

the resulting frequency response is obtained at nonuniformly sampled points along


the unit circle. Thus, for warping the axis transformation is achieved by

z −1 − λ
D1 (z) = (5.10)
1 − λz −1

The group delay of D1 (z) is frequency-dependent, so that positive values of the


warping coefficient λ yield higher frequency resolutions in the original response
for low frequencies, whereas negative values of λ yield higher resolutions for high
frequencies. The cascade of all-pass filters results in an infinite duration sequence,
hence typically a window is employed that truncates this infinite duration sequence
to a finite duration to yield an approximation.
Smith and Abel [83] proposed a bilinear conformal map based on the all-pass
transformation (5.10) that achieves a frequency warping nearly identical to the Bark
frequency scale (also called the critical band rate) [84, 26]. They found a closed-form
expression that related the warping coefficient of the all-pass transformation to the
sampling frequencies fs ∈ (1 kHz, 54 kHz] that achieved this psychoacoustically
motivated warping transformation. Specifically it was shown that

λ = 0.8517 arctan(0.06583f s/1000) − 0.1916 (5.11)

In the subsequent details, it is to be understood that λ = 0.77 (for an fs of 48


kHz) was used. The warping induced between two frequency axes by Eq. (5.10) is
depicted in Fig. 5.4 for different values of the warping coefficients λ. The frequency
5.6 Visualization of Room Acoustic Responses 109

Fig. 5.4. Frequency warping for various values of λ.

resolution plot for different warping coefficients, λ, is shown in Fig. 5.5. It can be
seen that the warping to the Bark scale for λ = 0.77 gives a “balanced” mapping be-
cause it provides a good resolution at low frequencies while retaining the resolution
at mid and high frequencies (e.g., compare with λ = 0.99). Some recent literature
on spectral modeling using warping can be found in [71, 72, 73].
The general system-level approach for determining the cluster-based multiple
listener equalization filter is shown in Fig. 5.6. Specifically, the room responses are
initially warped to the psychoacoustical Bark scale. As later shown, the Xie–Beni
cluster validity index gives an indication of the number of clusters that are generated
for the given data set (particularly for the case where the number of data samples,
M , is relatively small). Subsequently, the number of clusters is used for performing
clustering in order to determine the cluster centroids and prototype, respectively. The
minimum-phase part of the prototype, having length N  2, is then parameterized
by a low-order model, such as the LPC, for realizable implementation. The inverse
filter is then found from the LPC coefficients, and the reverse step of unwarping is
performed to obtain the filter in the linear domain. The equalization performance
can then be assessed by inspecting the equalized responses along a log frequency
domain.

5.6 Visualization of Room Acoustic Responses


Visualizing information generated by a system is important for understanding its
characteristics. Frequently, signals in such systems are multidimensional and need
to be displayed on a two-dimensional or three-dimensional display for facilitating
110 5 Multiple Position Room Response Equalization

visual analysis. Thus, feature extraction and dimensionality reduction tools have be-
come important in pattern recognition and exploratory data analysis [85]. Dimen-
sionality reduction can be done by generating a lower-dimensional data set, from a
higher-dimensional data set, in a manner to preserve the distinguishing characteris-
tics of the higher-dimensional data set. The goal of dimensionality reduction then is
to allow data analysis by assessment of the clustering tendency, if any, in order to
identify the number of clusters.
The Sammon map is used as a tool for visualizing the relationship between room
responses measured at multiple positions. This depiction of the room responses can
also aid in the design of an equalization filter for multiple position equalization.
Subsequently, the performance of this multiple-listener equalization filter, in terms
of the uniformity of the equalized responses, can also be evaluated using the Sammon
map. This chapter expands on the work from the last chapter, by comparing the Xie–
Beni cluster validity index with the Sammon map, to show that the Sammon map can
be used for determining clusters of room responses when the number of responses is
relatively small as in the present case.

5.7 The Sammon Map


In 1969, Sammon [86] introduced a method to map multidimensional data onto lower
dimensions (e.g., 2 or 3). The main property of the Sammon map is that it retains the
geometrical distances between signals, from multidimensional space, on the two-
dimensional or three-dimensional space [87].

Fig. 5.5. The frequency resolution with different warping coefficients λ.


5.7 The Sammon Map 111

Fig. 5.6. System for determining the the multiple listener equalization filter based on percep-
tual pattern recognition.

Consider {hi (n), n = 0, 1, . . . , N − 1} to be the room responses associated


with locations i = 1, . . . , M (M >= 2), where each of these responses is of dura-
tion N  2 samples. Let dij = hi (n)−hj (n)2 be the Euclidean distance between
the room responses at positions i and j, respectively. Let {ri (l)}, l ∈ {0, 1} be the
location of the image of {hi (n)} on a two-dimensional display. The goal is to posi-
tion the {ri (l)} (i = 1, . . . , M ) onto the display in such a way that all their mutual
Euclidean distances ri (l) − rj (l)2 approximate the corresponding distances dij .
Thus, distances in multidimensional spaces are mapped to approximately equivalent
distances in two dimensions via the Sammon mapping.
The objective function, E, that governs the adaptive Sammon map algorithm to
converge to a locally optimal solution (where distances are approximated) is given
as
1 M 
(dij − ri (l) − rj (l)2 )2
E = M  (5.12)
i=1 j>i dij i=1 j>i
dij

Fundamentally, it is desired to adjust the ri (l) ∈ 2 so as to minimize the ob-


jective function E by a gradient descent scheme. Once a locally optimal solution is
found, the ri (l)s are configured on a two-dimensional plane such that the relative
distances between the different hi (n) are visually discernible. In this chapter, Sam-
mon mapping for M = 6 responses was performed, but the technique can be easily
adapted for more responses.
With the following notation,

M 
φ= dij
i=1 j>i

rp = (rp (0), rp (1))T (5.13)

dpj = rp − rj 2
the gradient descent algorithm, at iteration m, for determining rp is given by
∂E(m)
(m+1) ∂r (l)
(m+1)
rp (l) = (m) rp (l) − α (( 2 p (( ; l = {0, 1} (5.14)
∂ E(m)
( (m+1) ∂r2 (l) (
p
112 5 Multiple Position Room Response Equalization


∂E(m) 2  dpj − dpj
(m+1) ∂r l
=−  (rp (l) − rj (l))
p φ j dpj dpj
p=j

∂ 2 E(m) 2  1 

(m+1) ∂r 2 (l)
=−  (dpj − dpj )
p φ d d
j p=j pj pj


(rp (l) − rj (l))2 dpj − dpj
−  1+ 
dpj dpj

where α was set to 0.3.1


In essence, the Sammon map, which is a nonlinear projection algorithm belong-
ing to the class of metric multidimensional scaling (MDS) algorithms ([88, 89]),
permits visualization of class or cluster distributions of multidimensional data on a
2-D plane. The computational complexity of the map can be fairly high if the num-
ber of data points, M , is large because the objective function (1) is based on O(M 2 )
distances. For small M , as in this chapter, the Sammon mapping imposed negligi-
ble computational requirements and results were obtained in less than half a minute
on a Pentium IV 2.66 GHz. Speedups of the Sammon algorithm can be found, for
example, in [90].

5.8 Results

A listening arrangement is shown in Fig. 5.7, where the microphone locations for
measuring the room responses were at the center of the listener head at each posi-
tion. The distance from the loudspeaker to the listener 2 position was about 7 meters,
whereas the average intermicrophone distance in the listener arrangement was about
1 meter. The room was roughly of dimensions 10 m × 20 m × 6 m. The result-
ing measurement was captured by an omnidirectional flat response microphone and
deconvolved by the SYSid system. The chirp was transmitted around 30 times and
the resulting measurements were averaged to get a higher SNR. The omnidirectional
microphone had a substantially flat magnitude response (viz., the Countryman ISO-
MAX B6 Lavalier microphone). The loudspeaker was a center channel speaker from
a typical commercially available home theater speaker system.
First, the Sammon map is applied to the six psychoacoustically warped responses
for visualizing the responses on a 2-D plane. One of the goals of this step is to see if
the map captures any perceptual grouping between the responses. Subsequently the
fuzzy c-means clustering algorithm on M = 6 is then applied to the psychoacous-
tically warped room responses (each response being a vector of length 8192) and
the optimal number of clusters using the Xie–Beni index (Eq. (5.9) in the previous
chapter) is determined. As shown, the Sammon map, when compared to the Xie–
Beni cluster validity index, gives a clear indication of the number of clusters that are
generated for the given data set.
1
Sammon [86] called this a magic factor and recommended α to be ≈ 0.3 or 0.4.
5.8 Results 113

Fig. 5.7. The experimental setup for measuring M = 6 acoustical room responses at six
positions in a reverberant room.

Figure 5.8 shows the responses at the six listener positions in the time domain.
Clearly, there are significant differences in the responses. Firstly, it can be visually
observed that there is a certain similarity in the time of arrival of the direct path com-
ponent of the responses at positions 1, 2, and 3. Also, noticeable is the path delay
difference between the responses at positions 4, 5, and 6 in relation to the responses
in positions 1, 2, and 3. Figure 5.9 shows the corresponding 1/3 octave smoothed
magnitude responses along a linear frequency axis in Hertz. Figure 5.10 shows the
corresponding 1/3 octave smoothed magnitude responses in the Bark domain. Specif-
ically, the x-axis (viz., the Bark axis) was computed using the expression provided in
[84],
z = 13 tan−1 (0.76f /1000) + 3.5 tan−1 (f /7500)2 (5.15)
where f is the frequency in Hz. A comparison between the plots of Figs. 5.9 and
5.10 shows a transformation effected by the mapping of (5.10) in a sense that low
frequencies are mapped higher which effectively “stretches” the magnitude response.
The Sammon map for the warped responses is shown in Fig. 5.11. The map
shows the relative proximity of the responses at positions 1, 2, and 3 that could be

identified as a group. Table 5.1 below shows all symmetric distances, dij = ri −
rj 2 , between the warped responses i, j, as computed through the Sammon map on
the 2-D plane (i.e., the ith row and jth column element is the distance between the
Sammon mapped coordinates corresponding to the ith and jth position).
From Fig. 5.11 and Table 5.1, a dominant perceptual grouping of responses 1, 2,
and 3 can be seen on the map. This can be confirmed from the distance metrics in Ta-
ble 5.1 especially where the distance between responses 1, 2, and 3 are significantly
114 5 Multiple Position Room Response Equalization

Fig. 5.8. The time domain responses at the six listener positions for the setup of Fig. 5.8.

Fig. 5.9. The corresponding 1/3 octave smoothed magnitude responses along the linear fre-
quency axis obtained from Fig. 5.8.
5.8 Results 115

Fig. 5.10. The 1/3 octave smoothed magnitude responses of Fig. 5.9 along the Bark axis.

Fig. 5.11. The Sammon map for the warped impulse responses.
116 5 Multiple Position Room Response Equalization

Table 5.1. The symmetric distances, dij = ri − rj 2 , between the warped responses i, j, as
computed through the Sammon map on the 2-D plane

Pos 1 2 3 4 5 6
1 0 2.0422 2.1559 2.8549 2.5985 3.0975
2 — 0 2.1774 4.6755 2.9005 4.9133
3 — — 0 3.3448 4.4522 5.1077
4 — — — 0 5.1052 3.4204
5 — — — — 0 3.2431
6 — — — — — 0

lower than the distances between these and the remaining responses. Also, note from
the last column in the table, the response from position 6 is close to responses from
positions 4 and 5 (i.e., distances of 3.42 and 3.24, respectively), whereas responses
from positions 4 and 5 are significantly far apart from each other (distance of 5.1052).
Thus, there are at least three clusters, where the first cluster is formed dominantly of
responses from positions 1, 2, and 3; the second cluster having, dominantly, response
from position 4 and the third cluster having, dominantly, response from position 5.
Finally, response from position 6 could be grouped with at least the distinct clus-
ters having response from positions 4 and 5. In essence, this clustering represents a
grouping of signals using the psychoacoustic scale.
Figure 5.12 shows the plot of the Xie–Beni cluster validity index as a function
of the number of clusters for the warped responses. From the plot it may be inter-
preted that the optimal number of clusters is 5, even though there is no clear c∗ = 5
where the index increases beyond this c∗ and then decreases monotonically towards
c = M − 1 = 5. This inconclusive result was partly resolved by observing the
combined plot comprising the numerator double-sum term and the denominator sep-
aration term, β, in the Xie–Beni index term. Figure 5.13 shows that the largest sep-
aration between cluster centroids is obtained at c = 3 for a reasonably small error,
thereby indicating that c∗ = 3 is a reasonable choice for the number of clusters.
Thus, this procedure validated the results from the Sammon map (viz., Fig. 5.11 and
Table 5.1).
The membership functions, µj (hk ), j = 1, 2, 3; k = 1, . . . , 6 are shown in Table
5.2 (Ci corresponds to cluster i). From the membership function table (viz., Table
5.2) it can be seen that there is a high degree of similarity among the warped re-
sponses in positions 1, 2, and 3 as they are largely clustered in cluster 1, whereas
positions 4 and 5 are dissimilar to the members of cluster 1 and each other (as they
are clustered in clusters 3 and 2, respectively, and have a low membership in other
clusters). Again warped response at position 6 has a similarity to members in all the
three clusters (which is also predicted through the Sammon map of Fig. 5.11 and the
table relating distances on the Sammon map).
In the next step equalization is performed according to Fig. 5.6, with c∗ = 3
and LPC order p = 512, and depict the equalization performance results using the
Sammon map. An inherent goal in this step is to view the equalization performance
5.8 Results 117

Fig. 5.12. The Xie–Beni cluster validity index as a function of the number of clusters for the
warped responses.

results visually, and demonstrate that the uniformity and similarity of the magnitude
responses (unequalized and equalized) can be shown on a 2-D plane using the map.
The LPC order p = 512 was selected as it gave the best results upon equalization and

Fig. 5.13. The numerator (viz., the objective function) and denominator term (viz., the sepa-
ration term) of the Xie–Beni cluster validity index of Fig. 5.12.
118 5 Multiple Position Room Response Equalization

Table 5.2. The membership functions, µj (hk ), j = 1, 2, 3; k = 1, . . . , 6

Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6


C1 0.6724 0.7635 0.6146 0.0199 0.0382 0.3292
C2 0.1821 0.1314 0.2149 0.0185 0.9334 0.3567
C3 0.1456 0.1051 0.1705 0.9617 0.0284 0.3141

furthermore this filter order is practically realizable (viz., the equivalent FIR length
filter can be easily implemented in various commercially available audio processors.)
Figure 5.14 shows the unequalized magnitude responses (of Fig. 5.9) in the
log frequency axis, and Fig. 5.15 depicts the equalized magnitude responses using
c∗ = 3 clusters. Clearly, substantial equalization is achieved at all of the six listener
positions as can be seen by comparing Fig. 5.15 with Fig. 5.14. The equalized mag-
nitude responses were then processed by subtracting the individual means, computed
between 80 Hz and 10 kHz (which is typically the region of interest for equalization
in the room of given size), to give the zero mean equalized magnitude responses.
Under ideal equalization, all of the magnitude responses would be 0 dB between 80
Hz and 10 kHz. Hence, upon applying the Sammon map, all of the ideal equalized
responses would be located at the origin of the Sammon map. Any deviation away
from 0 dB would show up on the map as a displacement away from the origin. If the

Fig. 5.14. The 1/3 octave smoothed unequalized magnitude responses of Fig. 5.9 (shown in
the log frequency domain).
5.8 Results 119

Fig. 5.15. The 1/3 octave smoothed equalized magnitude responses (shown in the log fre-
quency domain for better depiction of performance at low frequencies) using c∗ = 3 clusters.

equalized responses were uniform in distribution, then they would appear in a tight
circle about the origin in the 2-D plane after applying the Sammon map.
Now, applying the Sammon map algorithm to the original magnitude responses
of Fig. 5.9, between 80 Hz and 10 kHz, results in Fig. 5.16. Specifically, the re-
sponses in a 2-D plane for different positions show significant non-uniformity as
these are not located equidistant from the origin. Applying the mean corrected and
equalized responses to the Sammon map algorithm gives the distribution of the
equalized responses on a 2-D plane as shown in Fig. 5.17.
Comparing Fig. 5.16 with Fig. 5.17 shows an improved uniformity among the
responses as many of the responses lie at approximately the same distance from
the origin. Specifically, from Fig. 5.17 it is evident that the distances of equalized
responses 1, 2, 4, and 5 are close to each other from the origin, thereby reflecting
a larger uniformity between these responses. Furthermore, the standard deviation of
the distances of the equalized responses is much smaller than that of the unequalized
responses (viz., 4.64 as opposed to 11.06) indicating a better similarity between the
equalized responses. The improved similarity of the equalized magnitude responses
1, 2, 4, and 5 can be checked by visually comparing the equalized responses in Fig.
5.15. Also, it can be seen that the equalized magnitude responses 3 and 6 are quite
a bit different from each other, and from equalized responses 1, 2, 4, and 5, and this
reflects in the Sammon map as points 3 and 6 substantially offset from a circular
distribution.
120 5 Multiple Position Room Response Equalization

Fig. 5.16. The Sammon map of the unequalized magnitude responses.

Fig. 5.17. The Sammon map of the equalized responses.


5.9 The Influence of Reverberation on Room Equalization 121

5.9 The Influence of Reverberation on Room Equalization


As is known, room equalization is important for delivering high-quality audio in mul-
tiple listener environments and for improving speech recognition rates. Lower-order
equalization filters can be designed at perceptually relevant frequencies through
warping. However, one of the major factors that affects equalization performance
is the reverberation of the room. In this chapter, we compare the equalization per-
formance of the pattern recognition method to the well known root mean square
averaging based-equalization using the image method [61].

5.9.1 Image Method

The room impulse response, p(t, X, X  ), for the image model [61] with loudspeaker
at X = (x, y, z) and microphone at X  = (x , y  , z  ) and room dimensions L =
(Lx , Ly , Lz ) (with walls having absorption coefficient α = 1 − β 2 ) is given as
1
 ∞
 δ[t − (|Rp + Rr |/c)]
p(t, X, X  ) = βx|n−q| βx|n| βy|l−j| βy|l|2 βz|m−k| βz|m|
p=0 r=−∞
1 2 1 1 2
4π|Rp + Rr |
(5.16)
p = (q, j, k)
Rp = (x − x + 2qx , y − y  + 2jy  , z − z  + 2kz  )
r = (n, l, m)
Rr = 2(nLx , lLy , mLz )

The room image model (5.16), thus, can be simulated for different reverberation
times by adjusting the reflection coefficients (viz., the βs), because the Schroeder
 time T60 is related to the absorption coefficients by the equation T60 =
reverberation
0.161V / i Si αi (Si is the surface area of wall i).
Hence, the robustness to reverberation, of different equalization techniques, can
be modeled by varying the reflection coefficients in the image model.

5.9.2 RMS Average Filters

The RMS average filter (as used traditionally for movie theater equalization) is ob-
tained as


1  N
Havg (e ) =

|Hi (ejω )|2 (5.17)
N i=1
−1
Heq (ejω ) = Havg (ejω )

where |Hi (ejω )| is the magnitude response at position i. To obtain lower-order filters,
we used the approach as shown in Fig. 5.6 (but using RMS averaging instead of fuzzy
c-means prototype formation).
122 5 Multiple Position Room Response Equalization

5.9.3 Results

We have compared the pattern recognition and warping-based method to the RMS
averaging and warping-based method, for multiposition equalization, to determine
their robustness to reverberation variations. Ideally, it is required that the equalization
performance does not degrade significantly, when the reverberation time increases,
for (i) a fixed room, and (ii) fixed positions of the listeners in a room. The room
image model allows ease in simulating changes in responses (due to changes in re-
verberation times) thereby allowing the equalization performance of these methods
to be compared.
To quantify the equalization performance, we used the well-known spectral devi-
ation measure, σE , which indicates the degree of flatness of the spectrum. The lower
the measure, the better is the performance. The performance measure is defined as
 


1 P −1
σE = (10 log10 |E(ejωi )|) − B(ejω ) (5.18)
P i=0
P −1
1 
B(ejω ) = (10 log10 |E(ejωi )|)2
P i=0
|E(ejω )| = |H(ejω )||Heq (ejω )|

The image model was simulated for a room of volume Lx × Ly × Lz = 8 m ×8


m ×4 m, the source speaker at (x, y, z) =(0 m, 4 m, 1.5 m) and six listeners arranged
in a rectangular configuration in front of the source with y  ∈ [3 m, 5 m], x ∈ [3
m, 4 m], z  = 1.5 m. The sampling frequency, fs , was set at 16 kHz (wideband
speech/music equalization in 20 Hz to 8 kHz range).
Figure 5.18 shows the reverberation robustness for the equalization using the
pattern recognition method and with 512 FIR taps, whereas Fig. 5.19 shows the re-
verberation robustness using the RMS averaging method with 512 taps. The equal-
ization performance measure, σE , was determined in the 20 Hz to 8 kHz range. The
x-axis, in each plot, corresponds to the reverberation time T60 used in the simulation,
and the y-axis is σE .
It can be observed that the pattern recognition method outperforms the RMS
averaging method due to lower σE for larger T60 s at most of the six listener positions.
Only four curves can be clearly seen, because the absorption coefficients were kept
the same for all walls for a given simulation run. Thus, the symmetry induced by the
relative positioning of the source to the microphones delivered the same performance
for positions 1 and 3, as well as the same performance for positions 4 and 6.
Also, an interesting observation is that σE (an average measure) does not increase
monotonically with increasing T60 for all listener positions for both methods as some
positions are prone to better equalization.
5.10 Summary 123

Fig. 5.18. Performance of proposed method (x-axis: T60 , y-axis: σE ).

5.10 Summary

In this chapter some background on various single position and multiple position
equalization techniques, including the importance of performing multiple position
equalization over single position was presented. Also presented was a pattern recog-
nition method of performing simultaneous multiple listener equalization with low fil-
ter orders, and some comparisons between the RMS averaging equalization method
and the pattern recognition equalization method in terms of reverberation robustness.

Fig. 5.19. Performance of RMS averaging method (x-axis: T60 , y-axis: σE ).


124 5 Multiple Position Room Response Equalization

A technique for visualizing room impulse responses and simultaneous multiple lis-
tener equalization performance using the Sammon map was also presented. The map
is able to display results obtained through clustering algorithms such as the fuzzy
c-means method. Specifically, distances of signals in multidimensional spaces are
mapped onto distances in two dimensions, thereby displaying the clustering behav-
ior of the proposed clustering scheme. Upon determining the equalization filter from
the final prototype, the resulting equalization performance can be determined from
the size and shape (viz., circular shape indicates uniform equalization performance
at all locations) of the equalization map.
6
Practical Considerations for Multichannel
Equalization

Given a multichannel loudspeaker system, the selection of the crossover frequency


between the subwoofer and the satellite speakers is important for accurate (i.e.,
distortion-free), reproduction of playback sound. Presently, many home theater sys-
tems have selectable crossover frequencies, which are part of the bass management
filter capabilities, which are set by the consumer through listening tests. Alterna-
tively, if the loudspeakers are industry certified, the crossover frequency is set at 80
Hz. A desirable feature is that, besides distortion-free sound output from the individ-
ual subwoofer and the satellite speakers, the combined subwoofer and satellite room
acoustical response should exhibit negligible variations around the selected crossover
frequency. In this chapter, we present an automatic crossover frequency selection al-
gorithm based on an objective measure (viz., the spectral deviation measure) for
multichannel home theater applications that allows better control of the combined
subwoofer and satellite response, thereby significantly improving audio quality. Ini-
tially, some results are presented that show the effect of crossover frequency on low
frequency performance. Additional parameter optimization of the bass management
filters is shown to yield improved performance. Comparison between the results from
crossover and all-parameter optimization, of the bass management filters, for mul-
tiposition equalization is presented. As also shown, cascading an all-pass filter, or
adding in time-delays in loudspeaker channels, can provide further improvements to
the equalization result in the crossover region. Alternative techniques for fixing the
crossover blend, using a cascade of all-pass filters, are also presented.

[2004]
c AES. Reprinted, with permission, from S. Bharitkar, “Phase equalization for
multi-channel loudspeaker-room responses”, Proc. of AES 117th Convention, (preprint
6272).
[2005]
c AES. Reprinted, with permission, from S. Bharitkar and C. Kyriakakis, “Compar-
ison between time delay based and nonuniform phase based equalization for multichannel
loudspeaker-room responses,” Proc. of AES 119th Convention, (preprint 6607).
126 6 Practical Considerations for Multichannel Equalization

Fig. 6.1. A 5.1 system.

6.1 Introduction
A room is an acoustic enclosure that can be modeled as a linear system whose be-
havior at a particular listening position is characterized by an impulse response,
h(n); n ∈ {0, 1, 2, . . . } with an associated frequency response or room trans-
fer function H(ejω ). The impulse response yields a complete description of the
changes a sound signal undergoes when it travels from a source to a receiver (micro-
phone/listener). The signal at a listening position consists of direct path components,
discrete reflections that arrive a few milliseconds after the direct sound, as well as a
reverberant field component.
A typical 5.1 system and its system-level description are shown in Figs. 6.1 and
6.2, respectively, where the satellites (left, center, right, left surround, and right sur-
round speakers) are positioned surrounding the listener and the subwoofer may be
placed in the corner or near the edges of a wall. The high-pass (satellite)  and low-
hp
pass (subwoofer) bass management filters, |Hbm,ω (ω)| = 1 − 1/ 1 + (ω/ωc )4
lp
 c
8
and |Hbm,ωc (ω)| = 1/ 1 + (ω/ωc ) , are Butterworth second-order high-pass (12
dB/octave roll-off) and fourth-order low-pass (24 dB/octave roll-off), respectively,
and are designed with a crossover frequency ωc (i.e., the intersection of the cor-
responding −3 dB points) corresponding to 80 Hz. Alternatively, the fourth-order
Butterworth can be implemented as a cascade of two second-order Butterworth fil-
ters which modifies the magnitude response slightly around the crossover region. If
the satellite response rolls off at a second-order rate, then the resulting response ob-
tained through complex summation has a flat magnitude response in the crossover
region. The analysis and techniques presented in this chapter can be modified in a
straightforward manner to include cascaded Butterworth or any other bass manage-
ment filter. Examples of other crossover networks that split the signal energy between
the subwoofer and the satellites, according to predetermined crossover frequency and
6.1 Introduction 127

slopes, can be found in [91, 92, 93]. The magnitude responses of the individual bass
management filters as well as the magnitude of the recombined response (i.e., the
magnitude of the complex sum of the filter frequency responses), are shown in Fig.
6.3. If the satellite response is smooth and rolling of at a second order Butterworth
rate, then the complex summation yields a flat magnitude response in the audio signal
pass-band. In real rooms, the resulting magnitude response from the bass manage-
ment filter set, combined with with the loudspeaker and room responses, will exhibit
substantial variations in the crossover region. This effect can be mitigated by proper
selection of the crossover frequency (and/or the bass management filter orders). In
essence, the bass management filter parameter selection should be such that the sub-
woofer and the satellite channel output be substantially distortion-free with minimal
variations in the crossover region.
The acoustical block diagram for a subwoofer channel and a satellite channel is
shown in Fig. 6.4, where Hsub (ω) and Hsat (ω) are the acoustical loudspeaker and
room responses at a listening position. The resulting net acoustic transfer function,
H(ω), and magnitude response, |H(ejω )|2 , can be written as
hp lp
H(ω) = Hbm,ω c
(ω)Hsat (ω) + Hbm,ω c
(ω)Hsub (ω)
|H(ω)|2 = |A(ω)|2 + |B(ω)|2 + Γ (ω)
|A(ω)|2 = |Hbm,f
hp
c
(ω)|2 |Hsat (jω)|2
|B(ω)|2 = |Hbm,f
lp
c
(ω)|2 |Hsub (ω)|2 (6.1)
Γ (ω) = 2|A||B| cos(φsub (ω) + φlp
bm,ωc (ω) − φsat (ω) − φhp
bm,ωc (ω))

Fig. 6.2. System-level description of the 5.1 multichannel system of Fig. 6.1.
128 6 Practical Considerations for Multichannel Equalization

Fig. 6.3. Magnitude response of the industry standard bass management filters and the recom-
bined response.

Fig. 6.4. Block diagram for the combined acoustical response at a position.

where φhp lp
bm,ωc (ω) and φbm,ωc (ω) are the phase responses of the bass management
filters, whereas φsub (ω) and φsat (ω) are the phase responses of the subwoofer and
room, and satellite and room responses.
However, many of the loudspeaker systems, in a real room, interact with the room
giving rise to standing wave phenomena that manifest as significant variations in the
magnitude response measured between a loudspeaker and a microphone position. As
can be readily observed from (6.1), with an incorrect crossover frequency choice the
phase interactions will show up in the magnitude response as a region with a broad
spectral notch indicating a substantial attenuation of sound around the crossover re-
gion. In this chapter, we show that a correct choice of the crossover frequency will
influence the combined magnitude response around the crossover region.
6.1 Introduction 129

Fig. 6.5. (a) Magnitude response of the subwoofer measured in a reverberant room; (b) mag-
nitude response of the satellite measured in the same room.

As an example, individual subwoofer and satellite (in this case a center channel)
frequency responses (1/3rd octave smoothed), as measured in a room at a sampling
frequency of 48 kHz with a reverberation time T60 ≈ .75 sec, are shown in Figs.
6.5 (a) and (b), respectively. Clearly, the satellite is capable of playing audio below
100 Hz (up to about 40 Hz), whereas the subwoofer is most efficient and generally
used for audio playback at frequencies less than 200 Hz. For example, as shown in
Fig. 6.6, the resulting magnitude response, according to (6.1), obtained by summing
the impulse responses, has a severe spectral notch for a crossover frequency ωc cor-
responding to 60 Hz. This has been verified through real measurements where the
subwoofer and the satellite channels were excited with a broadband stimuli (e.g.,
log-chirp signal) and subsequently deconvolving the net response from the measured
signal.
Although room equalization has been widely used to solve problems in the mag-
nitude response, the equalization filters do not necessarily solve the problems around
the crossover frequency. In fact, many of these filters are minimum-phase and as such
may do little to influence the result around the crossover. As shown in this chapter,
automatic selection of a proper crossover frequency through an objective function al-
lows the magnitude response to be flattened in the crossover region. All-pass-based
optimization can overcome any additional limitations.
130 6 Practical Considerations for Multichannel Equalization

6.2 Objective Function-Based Crossover Frequency Selection


An objective function that is particularly useful for characterizing the magnitude
response is the spectral deviation measure [94, 95]. Given that the effects of the
choice of the crossover frequency are bandlimited around the crossover frequency, it
is shown that this measure is quite effective in predicting the behavior of the resulting
magnitude response around the crossover. The spectral deviation measure, σH (ωc ),
which indicates the degree of flatness of the magnitude spectrum is defined as
 


1 P −1
σH (ωc ) = (10 log10 |H(ωi )| − ∆) 2 (6.2)
P i=0

P −1
where ∆ = 1/P i=0 10 log10 |H(ωi )|, |H(ωi )| can be found from (1), and P is
the number of frequency points selected around the crossover region. Specifically,
the smaller the σH (ωc ) value, the flatter is the magnitude response.
For real-time applications, a typical home theater receiver includes a selectable
(either by a user or automatically as shown in this chapter) finite integer set of
crossover frequencies. For example, typical home theater receivers have selectable
crossover frequencies, in 10 Hz increments, from 20 Hz through 150 Hz (i.e.,
Ω = [20 Hz, 30 Hz, 40 Hz, . . . , 150 Hz]). Thus, although a near-optimal solution
ωc∗ can be found through a gradient descent optimization process by minimizing the
spectral deviation measure with respect to ω (viz., ∂σH (ωc )/∂ωc |ωc =ωc∗ ), this is un-
necessarily complicated. Clearly, the choice of the crossover frequency is limited to

Fig. 6.6. Magnitude of the net response obtained from using a crossover frequency of 60 Hz.
6.2 Objective Function-Based Crossover Frequency Selection 131

Fig. 6.7. Plots of the resulting magnitude response for crossover frequencies: (a) 50 Hz, (b) 60
Hz, (c) 70 Hz, (d) 80 Hz, (e) 90 Hz, (f) 100 Hz, (g) 110 Hz, (h) 120 Hz, (i) 130 Hz.

this finite set of integers (viz., as given in Ω), hence a simpler but yet effective means
to select a proper choice of the crossover frequency, is to characterize the effect of
the choice of each of the selectable integer crossover frequencies on the magnitude
response in the crossover region.
Figure 6.7 shows the resulting magnitude responses, as obtained via (6.1), for
different integer choices of the crossover frequencies from 50 Hz through 130 Hz.
The corresponding spectral deviation values, as a function of the crossover frequency,
for the crossover region around the crossover frequencies are shown in Fig. 6.8.
Clearly, comparing Fig. 6.8 results with the plots in Figs. 6.7, it can be seen that
the spectral deviation measure is an excellent measure for accurately modeling the
performance in the crossover region for a given choice of crossover frequency. The
best crossover frequency is then that which minimizes the spectral deviation measure,
in the crossover region, over the integer set of crossover frequencies. Specifically,

ωc∗ = min σH (ωc ) (6.3)


ωc ∈Ω

In this example 120 Hz provided the best choice for the crossover frequency as it
gave the smallest σH (ωc ).
132 6 Practical Considerations for Multichannel Equalization

6.3 Phase Interaction Between Noncoincident Loudspeakers


In this section, we describe, in particular, the phase interaction between the sub-
woofer and a satellite channel in a multichannel (e.g., 5.1) system. The analysis can
be extended to understand the complex additive interaction between nonsubwoofer
channels.
The nature of the phase interaction can be understood through the complex ad-
dition of frequency responses (i.e., time domain addition) from linear system theory.
Specifically, this addition is most interesting when observed through the magnitude
response of the resulting addition between the subwoofer and satellite speaker. Thus,
given the bass managed subwoofer response as H̃sub (ejω ) and bass managed satellite
response as H̃sat (ejω ), then the resulting squared magnitude response is

|H(ejω )|2 = |H̃sub (ejω ) + H̃sat (ejω )|2


= (H̃sub (ejω ) + H̃sat (ejω ))(H̃sub (ejω ) + H̃sat (ejω ))† (6.4)
= |H̃sub (ω)|2 + |H̃sat (ω)|2
+|H̃sub (ω)||H̃sat (jω)|ej(φsub (ω)−φsat (ω))
+|H̃sub (ω)||H̃sat (jω)|e−j(φsub (ω)−φsat (ω))
= |H̃sub (ω)|2 + |H̃sat (ω)|2
+2|H̃sub (ω)||H̃sat (jω)| cos(φsub (ω) − φsat (ω))

where H̃sat (ejω ) and H̃sub (ejω ) are bass managed satellite and subwoofer channel
room responses measured at a listening position in a room, and A† (ejω ) is the com-
plex conjugate of A(ejω ). The phase responses of the subwoofer and the satellite are

Fig. 6.8. Spectral deviation versus crossover frequency.


6.3 Phase Interaction Between Noncoincident Loudspeakers 133

Fig. 6.9. Combined subwoofer satellite response at a particular listening position in a rever-
berant room.

given by φsub (ω) and φsat (ω), respectively. Furthermore, H̃sat (ejω ) and H̃sub (ejω )
may be expressed as

H̃sat (ejω ) = BMsat (ejω )Hsat (ejω )


H̃sub (ejω ) = BMsub (ejω )Hsub (ejω ) (6.5)

where BMsat (ejω ) and BMsub (ejω ) are the the bass management IIR filters,
whereas Hsat (ejω ) and Hsub (ejω ) are the full-range satellite and subwoofer re-
sponses, respectively.
The influence of phase on the net magnitude response is via the additive term
Λ(ejω ) = 2|H̃sub (ejω )||H̃sat (ejω )| cos(φsub (ω) − φsat (ω)). This term influences
the combined magnitude response, generally, in a detrimental manner when it adds
incoherently to the magnitude response sum of the satellite and the subwoofer.
Specifically, when φsub (ω) = φsat (ω) + kπ, k = 1, 3, . . . , the resulting mag-
nitude response is actually the difference between the magnitude responses of the
subwoofer and the satellite thereby, possibly, introducing a spectral notch around the
crossover frequency. For example, Fig. 6.9 shows an exemplary combined subwoofer
center channel response in a room with reverberation time of about 0.75 seconds.
Clearly, a large spectral notch is observed around the crossover, and one of the rea-
sons for the introduction of this notch is the additive term Λ(ejω ) which adds inco-
herently to the magnitude response sum. Figure 6.10 is a third octave smoothed mag-
nitude response corresponding to Fig. 6.9, whereas Fig. 6.11 shows the effect of the
Λ(ejω ) term clearly exhibiting an inhibitory effect around the crossover region due to
the phase interaction between the subwoofer and the satellite speaker response at the
listening position. The cosine of the phase difference (viz., (φsub (ω)−φsat (ω))) that
134 6 Practical Considerations for Multichannel Equalization

causes the inhibition to the net magnitude response, is shown in Fig. 6.12. Clearly, in-
telligently controlling this Λ(ejω ) term will allow improved net magnitude response
around the crossover.

6.3.1 The Influence of Phase on the Net Magnitude Response

In this brief digression, we briefly explain some of the results obtained in Section
7.2. Specifically, we demonstrate that an appropriate crossover frequency enables
coherent addition of the phase interaction term Γ (ω) with the |A(ω)|2 and |B(ω)|2
terms in Eq. (6.1).
For example, Fig. 6.13(b) shows the Γ (ω) term for crossover frequency ωc corre-
sponding to 60 Hz. Clearly, this term is negative and will contribute to an incoherent
addition in (6.1) around the crossover region (marked by arrows). In contrast, by se-
lecting the crossover frequency to be 100 Hz, the Γ (ω), as shown in Fig. 6.13(a), is
positive around the crossover region. This results in a coherent addition around the
crossover region. These complex addition results are clearly reflected in the plots of
Figs. 6.7(a) and (f) as well as the σH (ωc ) values at 60 Hz and 100 Hz in Fig. 6.8.

6.4 Phase Equalization with All-Pass Filters


6.4.1 Second-Order All-Pass Networks

A second-order all-pass filter, A(z), can be expressed as

Fig. 6.10. The 1/3 octave smoothed combined magnitude response of Fig. 6.9.
6.4 Phase Equalization with All-Pass Filters 135

Fig. 6.11. The influence of Λ(ejω ) on the combined magnitude response.

Fig. 6.12. Plot of the cosine of the phase difference that contributes to the incoherent addition
around the crossover.

(
z −1 − zi† z −1 − zi ((
A(z) = ( (6.6)
1 − zi z −1 1 − zi† z −1 ( jω
z=e
136 6 Practical Considerations for Multichannel Equalization

where zi = ri ejθi is a pole of radius ri and angle θi ∈ [0, 2π). Figure 6.14 shows the
unwrapped phase (viz., arg(Ap (z))) for different ri and θi = 0.25π, whereas Fig.
6.15 shows the group delay plots for the same radii. As can be observed, the closer
the pole is to the unit circle the larger is the group delay (i.e., the larger is the phase
change with respect to frequency). One of the main advantages of an all-pass filter
is that the magnitude response is unity at all frequencies, thereby not changing the
magnitude response of any filter that is cascaded with an all-pass filter.

6.4.2 Phase Correction with Cascaded All-Pass Filters

To combat the effects of incoherent addition of the Λ term, it is preferable to include


this first-order all-pass filter in the satellite channel (e.g., center channel). In contrast,
if the all-pass were to be placed in the subwoofer channel, the net response between
the subwoofer and the remaining channels (e.g., left, right, and surrounds) could be
affected in an undesirable manner. Thus, the all-pass filter is cascaded with the satel-
lite to remove the effects of phase between this satellite and the subwoofer channel
at a particular listening position. Of course, the method can be adapted to include
information of the net response at multiple listening positions so as to optimize the
Λ term in order to minimize the effects of phase interaction over multiple positions.
Now, the net response with an M -cascade all-pass filter, AM (ejω ), in the satellite
channel, can be expressed as
|H(ejω )|2 = |H̃sub (ω)|2 + |H̃sat (ω)|2

Fig. 6.13. Γ (ω) term from (6.1) for crossover frequency (a) 100 Hz, (b) 60 Hz.
6.4 Phase Equalization with All-Pass Filters 137

+2|H̃sub (ω)||H̃sat (jω)| cos(φsub (ω) − φsat (ω) − φAM (ω)) (6.7)

where

M
e−jω − rk e−jθk e−jω − rk ejθk
AM (e ) =

1 − rk ejθk e−jω 1 − rk e−jθk e−jω
k=1

M
(k)
φAM (ω) = φAM (ω) (6.8)
k=1
 
(i) ri sin(ω − θi )
φAM (ω) = −2ω − 2 tan−1
1 − ri cos(ω − θi )
 
r sin(ω + θi )
−2 tan−1
i
1 − ri cos(ω + θi )

and ΛF (ejω ) = 2|H̃sub (ejω )||H̃sat (ejω )| cos(φsub (ω) − φsat (ω) − φAM (ω)). Thus,
to minimize the inhibitory effect of Λ term (or in effect cause it to coherently add to
|H̃sub (ω)|2 + |H̃sat (ω)|2 ), in the example above, one can define an average square
error function (or objective function) for minimization as

1 
N
J(n) = W (ωl )(φsub (ω) − φsat (ω) − φAM (ω))2 (6.9)
N
l=1

where W (ωl ) is a frequency-dependent weighting function.

Fig. 6.14. Plot of the unwrapped phase of a second-order all-pass filter for different values of
the pole magnitude for θ = 0.25π.
138 6 Practical Considerations for Multichannel Equalization

Fig. 6.15. Plot of the group delay of a second-order all-pass filter for different values of the
pole magnitude for θ = 0.25π.

The terms ri and θi (i = 1, 2, . . . , M ) can be determined adaptively by mini-


mizing the objective function with respect to these unknown parameters. The update
equations are
µr
ri (n + 1) = ri (n) − ∇ri J(n)
2
µθ
θi (n + 1) = θi (n) − ∇θi J(n) (6.10)
2
where µr and µθ are adaptation rate control parameters judiciously chosen to guar-
antee stable convergence.
The following relations are obtained.


N
∂φAM (ω)
∇ri J(n) = W (ωl )E(φ(ω))(−1)
∂ri (n)
l=1

N
∂φAM (ω)
∇θi J(n) = W (ωl )E(φ(ω))(−1)
∂θi (n)
l=1
E(φ(ω)) = φsub (ω) − φsat (ω) − φAM (ω) (6.11)

where
∂φAM (ω) 2 sin(ωl − θi (n))
=− 2
∂ri (n) ri (n) − 2ri (n) cos(ωl − θi (n)) + 1
6.5 Objective Function-Based Bass Management Filter Parameter Optimization 139

2 sin(ωl + θi (n))
− (6.12)
ri2 (n) − 2ri (n) cos(ωl + θi (n)) + 1

and
∂φAM (ω) 2ri (n)(ri (n) − cos(ωl − θi (n)))
=− 2
∂θi (n) ri (n) − 2ri (n) cos(ωl − θi (n)) + 1
2ri (n)(ri (n) − cos(ωl + θi (n)))
− 2 (6.13)
ri (n) − 2ri (n) cos(ωl + θi (n)) + 1

During the update process, care was taken to ensure that |ri (n)| < 1 to guarantee
stability. This was done by including a condition where the ri element that exceeded
unity would be randomized. Clearly, this could increase the convergence time, and
hence in the future other methods may be investigated to minimize the number of
iterations for determining the solution.

6.4.3 Results

For the combined subwoofer center channel response shown in Fig. 6.9, the ri and θi
with M = 9 were adapted to get reasonable minimization of J(n). Furthermore,the
frequency dependent weighting function, W (ωl ), for the above example was chosen
as unity for frequencies between 60 Hz and 125 Hz. The reason for this choice of
weighting terms can be readily seen from the domain of the Λ(ejω ) term of Fig. 6.11
and/or the domain of the “suckout” in Fig. 6.10.
The original phase difference function (φsub (ω) − φsat (ω))2 is plotted in Fig.
6.16 and the cosine term, cos(φsub (ω) − φsat (ω)) that adds incoherently is shown in
Fig. 6.12. Clearly, minimizing the phase difference (using the all-pass cascade in the
satellite channel) around the crossover region will minimize the spectral notch. The
resulting all-pass filtered phase difference function, (φsub (ω)−φsat (ω)−φAM (ω))2 ,
from the adaptation of ri (n) and θi (n) is shown in Fig. 6.17 thereby demonstrating
the minimization of the phase difference around the crossover. The resulting all-pass
filtered term, ΛF (ω), is shown in Fig. 6.18. Comparing Figs. 6.11 and 6.18, it can be
seen that the inhibition turns to an excitation to the net magnitude response around
the crossover region. Finally, Fig. 6.19 shows the resulting combined magnitude re-
sponse with the cascade all-pass filter in the satellite channel, and Fig. 6.20 shows
the third octave smoothed version of Fig. 6.19. A superimposed plot, comprising Fig.
6.20 and the original combined response of Fig. 6.10 is depicted in Fig. 6.21. Clearly,
an improvement of about 7 dB around the crossover can be seen.

6.5 Objective Function-Based Bass Management Filter


Parameter Optimization
A typical equalization filter design process involves (i) measuring the loudspeaker
and room responses for each of the satellites and the subwoofer, (ii) storing these
140 6 Practical Considerations for Multichannel Equalization

responses, (iii) designing an equalization filter for each channel loudspeaker (viz.,
warping and LPC based as shown in Fig. 6.22), and (iii) applying individual bass
management filters to each of the equalization filters in a multichannel audio system.
Quantitatively, as an example, the net subwoofer and satellite response at a lis-
tening position, as shown by Fig. 6.23, can be expressed as
−1 jω −1 jω 2
|H(ejω )|2 = |Hsub (ejω )H̃sub (e ) + Hsat (ejω )H̃sat (e )| (6.14)
−1 jω 2 −1 jω 2
= |Hsub (e )H̃sub

(e )| + |Hsat (ejω )H̃sat (e )|
−1 jω −1 jω
+2|Hsub (e )H̃sub (e )||Hsat (e )H̃sat (e )|
jω jω

∗ cos(φsub (ω) + φ̃sub (ω) − φsat (ω) − φ̃sat (ω))


−1 jω −1 jω
where H̃sat (e ) and H̃sub (e ) are bass managed equalization filters for the satel-
lite and subwoofer channel responses measured at a listening position in a room. The
phase responses of the subwoofer and the satellite bass managed filters are given
−1 jω −1 jω
by φ̃sub (ω) and φ̃sat (ω), respectively. Specifically, H̃sat (e ), H̃sub (e ), φ̃sub (ω),
and φ̃sat (ω) may be expressed as
−1 jω −1 jω
H̃sat (e ) = Hωhpc ,N (ejω )Ĥsat (e )
−1 jω −1 jω
H̃sub (e ) = Hωlpc ,M (ejω )Ĥsub (e ) (6.15)
φ̃sat (ω) = −φ̂sat (ω) + φhp
ωc ,N (ω)

φ̃sub (ω) = −φ̂sub (ω) + φlp


ωc ,M (ω)

where the “hat” above the frequency and phase responses, of the subwoofer and
satellite equalization filter, represents an approximation due to the lower order spec-

Fig. 6.16. Plot of the phase difference, (φsub (ω) − φsat (ω))2 , as a function of frequency.
6.5 Objective Function-Based Bass Management Filter Parameter Optimization 141

Fig. 6.17. Plot of the all-pass filtered phase difference, (φsub (ω) − φsat (ω) − φAM (ω))2 , as
a function of frequency. Observe the reduction around the crossover (≈ 80 Hz).

tral modeling via LPC. As is evident from Eqs. (6.14) and (6.15), the crossover region

Fig. 6.18. The influence of the all-pass filtered function ΛF (ejω ) on the combined magnitude
response.
142 6 Practical Considerations for Multichannel Equalization

Fig. 6.19. The combined magnitude response of the subwoofer and the satellite with a cascade
of all-pass filters.

response of |H(ejω )|2 can be further optimized through a proper choice of the bass
management filter parameters (ωc , N, M ).

Fig. 6.20. The 1/3 octave smoothed combined magnitude response of Fig. 6.19.
6.5 Objective Function-Based Bass Management Filter Parameter Optimization 143

Fig. 6.21. A superimposed plot of the original subwoofer and satellite combined magnitude
response and the all-pass filter-based combined magnitude response demonstrating at least
abut 7 dB improvement around the crossover region (≈80 Hz).

An objective function that is particularly useful for characterizing the magni-


tude response is the spectral deviation measure [94, 95]. Given that the effects of
the choice of the bass management parameters (viz., (ωc , N, M )) are bandlimited
around the crossover frequency, it is shown that that this measure is quite effective
in predicting the behavior of the resulting magnitude response around the crossover.
The spectral deviation measure, σH (ωc , N, M ), which indicates the degree of flat-
ness of the magnitude spectrum is defined as
 


1 P −1
σH (ωc , N, M ) = (10 log10 |H(e i )| − ∆)
jω 2 (6.16)
P i=0
P −1
where ∆ = 1/P i=0 10 log10 |H(ejωi )|, |H(ejωi )| can be found from Eq. (6.14),
and P is the number of frequency points selected around the crossover region. Specif-
ically, the smaller the σH (ωc ) value, the flatter is the magnitude response around the
crossover region.

Fig. 6.22. Warping based equalization.


144 6 Practical Considerations for Multichannel Equalization

Fig. 6.23. Simplified block diagram for the net subwoofer and satellite channel signal at a
listener location.

A typical multichannel audio system, such as a home theater receiver, includes


a selectable finite integer set of crossover frequencies, typically in 10 Hz incre-
ments, from 20 Hz through 150 Hz. Thus, although a near-optimal solution ωc∗ (and
N ∗ , M ∗ ) can be found through a gradient descent optimization process by minimiz-
ing the spectral deviation measure (e.g., ∂σH (ωc , N, M )/∂ωc |ωc =ωc∗ ), this is unnec-
essarily complicated. Clearly, the choice of the crossover frequency is limited to a
finite set of integers. Thus, a simple but effective means to select a proper choice
of the crossover frequency, is to characterize the effect of the choice of the selected
crossover frequency on σH (ωc , N, M ) in the crossover region. Similarly, the choice
of N, M can essentially be limited to a finite set of integers. Thus, the effect of vary-
ing (ωc , N, M ), jointly, can be immediately observed on σH (ωc , N, M ).

6.5.1 Results

The full-range subwoofer and satellite responses of Fig. 6.24 were used for obtaining
the corresponding equalization filters with the warping and LPC modeling method
of Fig. 6.22.
The integer bass management parameters to be applied to the equalization fil-
−1 jω −1 jω
ters, Ĥsat (e ) and Ĥsub (e ), were selected from the following intervals: ωc ∈
{50, 150}, N ∈ {1, 5}, M ∈ {1, 4}. Subsequently, for a particular combination of
−1 jω −1 jω
(ωc , N, M ), the bass managed equalization filters, H̃sat (e ) and H̃sub (e ), were
applied to Hsub (e ) and Hsat (e ) to yield the equalized responses Hsub (ejω )∗
jω jω
−1 jω −1 jω
H̃sub (e ) and Hsat (ejω )H̃sub (e ). The equalized subwoofer response was 1/3 oc-
tave smoothed and level matched with the 1/3 octave smoothed equalized satellite
response. The resulting complex frequency response, H(ejω ), and the correspond-
ing magnitude response were obtained. At this point, the spectral deviation measure
σH (ωc , N, M ) was determined, for the given choice of (ωc , N, M ), in the crossover
region (chosen to be the frequency range of 40 Hz and 200 Hz given the choice of
ωc ). Finally, the best choice of bass management filter parameter set, (ωc∗ , N ∗ , M ∗ ),
is then that set which minimizes the spectral deviation measure in the crossover re-
gion. Specifically,

(ωc∗ , N ∗ , M ∗ ) = min σH (ωc , N, M ) (6.17)


ωc ,N,M
6.5 Objective Function-Based Bass Management Filter Parameter Optimization 145

Fig. 6.24. (a) 1/3 octave smoothed magnitude response of the subwoofer-based LRTF mea-
sured in a reverberant room; (b) 1/3 octave smoothed magnitude response of the satellite-based
LRTF measured in the same room.


The lowest spectral deviation measure, σH (ωc , N, M ), was obtained for
(ωc∗ , N ∗ , M ∗ )
corresponding to (60 Hz, 3, 4). Observing the natural full-range 18
dB/octave approximate decay rate (below 60 Hz) of the satellite, in Fig. 6.24(b), it is
evident that this choice of N = 3 (i.e., 18 dB/octave roll-off to the satellite speaker
equalized response) will not cause the satellite speaker to be distorted. If necessary, in
the event that N is not sufficiently high, the next largest σH (ωc , N  , M ) can always
be selected such that N  > N . Of course, other signal-limiting mechanisms may be
employed in conjunction with the proposed approach, and these are beyond the scope
of this chapter. For this choice of the bass management filter parameters (i.e., (60 Hz,
3, 4)), the net magnitude response |H(ejω )|2 (in dB) is shown in Fig. 6.25. Clearly,
the variation in the crossover region (viz., 40 Hz through 200 Hz) is negligible, and

this is reflected by the smallest value found for σH (ωc , N, M ) = 0.45. Thus, the
parameter set (60 Hz, 3, 4) forms the correct choice for the bass management filters
for the room responses of Figs. 6.24.
Further examples, as provided in Figs. 6.26 to 6.28, show the net magnitude re-
sponse |H(ejω )|2 for different choices of (ωc , N, M ) that produce a larger
σH (ωc , N, M ). As can be seen, these “nonoptimal” integer choices of the bass man-
agement filter parameters, as determined from the spectral deviation measure, cause
significant variations in the magnitude response in the crossover region.
146 6 Practical Considerations for Multichannel Equalization

Fig. 6.25. Net magnitude response |H(ejω )|2 (dB) for (60 Hz, 3, 4) with

σH (0.0025π, 3, 4) = 0.45.

Fig. 6.26. Net magnitude response |H(ejω )|2 (dB) for (50 Hz, 4, 4) with
σH (0.0021π, 4, 4) = 0.61.

6.6 Multiposition Bass Management Filter Parameter


Optimization
For multiposition bass management parameter optimization, an average spectral de-
viation measure can be expressed as
6.6 Multiposition Bass Management Filter Parameter Optimization 147

1
L
avg
σH = σH (ωc , N, M ) (6.18)
L j=1 j

where L is the total number of positions equalized during the multiposition equal-
ization step.
In a nutshell the bass management parameters can be optimized by the fol-
lowing steps, (i) perform multiple position equalization on raw responses (i.e., re-
sponses without any bass managed applied to them), (ii) apply the candidate bass
management filters to the equalized subwoofer and satellite responses, parameter-
ized by (ωc , N, M ) with fc ∈ {40, 200} Hz, N ∈ {2, 4}, M ∈ {3, 5}, (iii) per-
form subwoofer and satellite level setting using bandlimited noise and perceptual
avg
C-weighting, (iv) determine the average spectral deviation measure, σH , after per-
forming 1/3 octave smoothing for each of the net (i.e., combined subwoofer and
satellite) responses in the range of 40 Hz and 250 Hz, and (v) select (ωc , N, M ) that
avg
minimizes σH .

6.6.1 Results

As a first example the full-range subwoofer and satellite responses at L = 4 positions


measured in a reverberant room with T60 ≈ 0.75s, as shown in Fig. 6.29, were
used for obtaining the corresponding equalization filters with the pattern recognition,
warping, and LPC modeling method. The integer bass management parameters to be
−1 −1
applied to the equalization filters, Ĥsat (ω) and Ĥsub (ω), and the general steps for

Fig. 6.27. Net magnitude response |H(ejω )|2 (dB) for (130 Hz, 2, 1) with
σH (0.0054π, 2, 1) = 1.56.
148 6 Practical Considerations for Multichannel Equalization

Fig. 6.28. Net magnitude response |H(ejω )|2 (dB) for (90 Hz, 5, 3) with
σH (0.0037π, 5, 3) = 2.52.

Fig. 6.29. An example of full-range subwoofer and satellite responses measured at four differ-
ent positions in a room with T60 ≈ 0.75 s.
6.6 Multiposition Bass Management Filter Parameter Optimization 149

Fig. 6.30. The equalized and bass management filters parameter optimized responses where
the lighter (thin) curve corresponds to all parameter optimization and the thick or darker curve
corresponds to crossover frequency optimization over multiple positions.

avg
determining the parameter set, (ωc , N, M ), that minimizes σH are described in the
preceding section. As a comparison, we also present the results of performing only
crossover frequency optimization [96], but for multiple positions, using the average
spectral deviation measure. The resulting equalized plots are shown in Fig. 6.30.
Comparing the results using the full parameter optimization (lighter curve) with
the crossover frequency optimization over multiple positions, it can be seen that all-
parameter flattens the magnitude response around the crossover region (viz., 40 Hz
through 250 Hz). Specifically, for example, in position 2 a lower Q (i.e., broad) notch
around the crossover region obtained through crossover frequency optimization is
transformed to a high Q (i.e., narrow width) notch by all-parameter optimization.
In addition, as shown via position 4, a very broad and high amplitude undesirable
peak in the magnitude response, obtained from crossover frequency optimization, is
reduced in amplitude and narrowed through all-parameter optimization (red curve).
In fact, Toole and Olive [97] have demonstrated that, based on steady-state mea-
surements, low-Q resonances producing broad peaks in the measurements are more
easily heard than high-Q resonances producing narrow peaks of similar amplitude.
The crossover frequency optimization resulted in a crossover at 90 Hz with the mini-
avg
mum of the average spectral deviation measure, σH , being 0.98. The all-parameter
optimization resulted in the parameter set (ωc , N, M ) corresponding to (80 Hz, 4, 5)
avg
and σH for this parameter set being minimum at 0.89. Also, a comparison between
150 6 Practical Considerations for Multichannel Equalization

Fig. 6.31. The equalized and bass management filters parameter optimized responses where
the thin curve corresponds to all-parameter optimization and the thick curve corresponds to
unequalized responses with the standard bass management filters (80 Hz, 4, 5).

un-equalized responses with bass management set to the standard of (80 Hz, 4, 5)
and the all-parameter optimized and equalized responses is shown in Fig. 6.31.

6.7 Spectral Deviation and Time Delay-Based Correction


The crossover region can be manipulated through the use of a simple delay in each
loudspeaker (i.e., nonsubwoofer) channel. Specifically the output signal y(n) can
be expressed in terms of the input signal x(n), the satellite bass management fil-
ter bmsat (n), the subwoofer bass management filter bmsub (n), the room responses
hsat (n) and hsub (n), and the delay of nd samples (viz., δ(n − nd )),

y(n) = bmsat (n) ⊗ δ(n − nd ) ⊗ hsat (n) ⊗ x(n)


+bmsub (n) ⊗ hsub (n) ⊗ x(n) (6.19)

The frequency domain representation for the resulting response leads to

H(ejω ) = BMsat (ejω )e−jωnd Hsat (ejω )


+BMsub (ejω )Hsub (ejω ) (6.20)

whereas the system representation is shown in Fig. 6.32.


6.7 Spectral Deviation and Time Delay-Based Correction 151

Fig. 6.32. System representation of time delay technique to correct crossover region response.

An objective function that is particularly useful for characterizing the magnitude


response variations is the spectral deviation measure,
 


1  P2
σ|H| (e ) =
jω (10 log10 |H(e )| − ∆)
jωi 2 (6.21)
D
i=P1

P2
where ∆ = 1/P i=P 1
10 log10 |H(ejωi )|, D = (P2 − P1 + 1) and |H(ejωi )| can
be found from (6.20), and P is the number of frequency points selected around the
crossover region. For the present simulations, because the bass management filters
that were selected had a crossover around 80 Hz, P1 and P2 were selected to be bin
numbers corresponding to 40 Hz and 200 Hz, respectively (viz., for an 8192 length
response and a sampling rate of 48 kHz, the bin numbers corresponded to about 6
and 34 for 40 Hz and 200 Hz, respectively).
Thus, the process for selecting the best time delay n∗d is: (i) set nd = 0 (this may
be relative to any delay that may be used for time-aligning speakers such that the
relative delays between signals from various channels to a listening position is ap-
proximately zero), (ii) level match the subwoofer and satellite, (iii) determine (6.19),
(iv) determine (6.21), (v) nd = nd + 1, (vi) perform (ii) to (v) until nd < Nd , and
(vii) select n∗d = min σ|H| (ejω ).
Care should be taken to ensure that (i) the delay nd is not large enough to cause
a perceptible delay between audio and video frames, and (ii) the relative delays be-
tween channels should not be large enough to cause any imaging problems. Further-
more, if nd is set relative to time-aligned delays, then the termination condition can
152 6 Practical Considerations for Multichannel Equalization

be set as Md < nd < Nd (with Md < 0 and Nd > 0) where small negative de-
lays in each channel are allowed, as long as they are not large enough relative to
delays in other channels, to influence imaging detrimentally. In this chapter we have
selected Md = 0 and Nd = 200 which roughly translate to about a 4 ms delay at 48
kHz sampling rate and results are always presented for one loudspeaker and a sub-
woofer case. Future results (as explained in the context below) are in the direction of
joint crossover and time delay optimization so as to have minimal time delay offsets
between channels in a multichannel system.
Furthermore, this technique can be easily adapted to multiposition crossover cor-
rection (results of which are presented subsequently) by defining and optimizing over
an average spectral deviation measure given as

1
L
avg jω
σ|H| (e ) = σ|Hj | (ejω ) (6.22)
L j=1

where L is the total number of positions and σ|Hj | (ejω ) is the narrowband spectral
deviation measure at position j. Additionally, this technique can be cascaded with the
automatic crossover frequency finding method described in [98] (i.e., in conjunction
with a room equalization algorithm).

6.7.1 Results for Spectral Deviation and Time Delay-Based Crossover


Correction

Figure 6.33 shows the full-range satellite and subwoofer response at a listening po-
sition, whereas Fig. 6.34 compares the bass managed response (dash-dot line), with
crossover at 60 Hz, with the spectral deviation-based time delay corrected crossover
region response. The optimal time delay found was n∗d = 142 samples at 48 kHz.
Figure 6.35 compares the correction being done in the crossover region using the
automatic time delay and spectral deviation-based technique but for a crossover of
70 Hz for the same speaker set and same position as that of Fig. 6.33. The optimal
time delay was 88 samples. A minimal 10 Hz difference in the crossover required an
additional 54 sample delay to correct the crossover region response. One possibility
is that this is because the suckout for the 70 Hz case was less deep than the 60 Hz
case and hence needed a smaller time delay for effective crossover correction. This
is further validated by selecting the crossover to be 90 Hz and observing the time
delay correction required. Figure 6.36 shows that a small amount of crossover region
response correction is achieved, for crossover of 90 Hz, given the optimal time delay
n∗d = 40 samples. This time delay is a further reduction of 48 samples over the 70
Hz case as the crossover region response is further optimized when the crossover
frequency is selected at 90 Hz. From the three crossover frequencies for the center-
sub case, 90 Hz is the best crossover frequency as it gives the least amount of suckout
in the crossover region and hence requires a very small time delay of just 40 samples.
Accordingly, it can be inferred that the time delay offsets between channels in
a multichannel setup can be kept at a minimum, but still provide crossover region
correction, by either of the following techniques, (i) first performing a crossover
6.8 Summary 153

frequency search by the crossover finding method to improve the crossover region
response for each channel loudspeaker and subwoofer response and then applying a
relatively smaller time delay correction to each satellite channel to further improve
the crossover response, or (ii) performing a multidimensional search for the best
choice of time delay and the crossover, simultaneously, using the spectral deviation
measure, so as to keep the time delay offsets between channels at a minimum.
Figure 6.37 shows the full-range subwoofer and left surround responses at a lis-
tening position, whereas Fig. 6.38 compares the bass managed response (dash-dot
line), with crossover at 120 Hz, with the spectral deviation-based time delay cor-
rected crossover region response. The optimal time delay n∗d was 73 samples at
48 kHz.

6.8 Summary
In this chapter we presented results that show the effect of a proper choice of
crossover frequency for improving low-frequency performance. Additional param-
eter optimization of the bass management filters is shown to yield improved per-
formance. Comparison between the results from crossover and all-parameter opti-
mization, of the bass management filters, for multiposition equalization is presented.
As was shown, cascading an all-pass filter can provide further improvements to the
equalization result in the crossover region. Alternatively, time delay adjustments can
be made in each loudspeaker channel to correct the crossover region response.

Fig. 6.33. The individual full-range subwoofer and a center channel magnitude response mea-
sured at a listening position in a reverberant room.
154 6 Practical Considerations for Multichannel Equalization

Fig. 6.34. The bass managed combined response as well as time delay and spectral deviation
measure-based corrected crossover response (crossover frequency = 60 Hz).

Fig. 6.35. The bass managed combined response as well as time delay and spectral deviation
measure-based corrected crossover response (crossover frequency = 70 Hz).
6.8 Summary 155

Fig. 6.36. The bass managed combined response as well as time delay and spectral deviation
measure-based corrected crossover response for the left surround (crossover frequency = 90
Hz).

Fig. 6.37. The individual full-range subwoofer and a left surround channel magnitude response
measured at a listening position in a reverberant room.
156 6 Practical Considerations for Multichannel Equalization

Fig. 6.38. The bass managed combined response as well as time delay and spectral deviation
measure-based corrected crossover response for the left surround (crossover frequency = 120
Hz).
7
Robustness of Equalization to Displacement Effects:
Part I

Traditionally, multiple listener room equalization is performed to improve sound


quality at all listeners, during audio playback, in a multiple listener environment
(e.g., movie theaters, automobiles, etc.). A typical way of doing multiple listener
equalization is through spatial averaging, where the room responses are averaged
spatially between positions and an inverse equalization filter is found from the spa-
tially averaged result. However, the equalization performance will be affected if there
is a mismatch between the position of the microphones (which are used for measur-
ing the room responses for designing the equalization filter) and the actual center
of listener head position (during playback). In this chapter, we present results of the
effects of microphone and listener mismatch on spatial average equalization perfor-
mance for frequencies above the Schroeder frequency. The results indicate that, for
the analyzed rectangular listener configuration, the region of effective equalization
depends on (i) distance of a listener from the source, (ii) amount of mismatch be-
tween the responses, and (iii) the frequency of the audio signal. We also present
some convergence analysis to interpret the results.

7.1 Introduction
A typical room is an acoustic enclosure that can be modeled as a linear system whose
behavior at a particular listening position is characterized by an impulse response.
The impulse response yields a complete description of the changes a sound signal
undergoes when it travels from a source to a receiver (microphone/listener). The sig-
nal at the receiver consists of direct path components, discrete reflections that arrive
a few milliseconds after the direct sound, as well as a reverberant field component. In
addition, it is well established that room responses change with source and receiver
locations in a room [11, 63].

[2004]
c ASA. Reprinted, with permission, from S. Bharitkar and C. Kyriakakis, “Robust-
ness of spatial average equalization: A statistical reverberation model approach”, J. Acoust.
Soc. Amer., 116:3491.
158 7 Robustness of Equalization to Displacement Effects: Part I

Fig. 7.1. Examples of room acoustical responses, having the direct and reverberant compo-
nents, measured at two positions a few feet in a room.

Specifically, the time of arrival of the direct and multipath reflections and the en-
ergy of the reverberant component will vary from position to position. In other words,
a room response at position i, pf,i , can be expressed as pf,i = pf,d,i + pf,rev,i ;
whereas the room response at position j, pf,j , can be expressed as pf,j = pf,d,j +
pf,rev,j where pf,d,j is the frequency response for the direct path component, and
pf,rev,j is the response for the multipath component. An example of time domain re-
sponses at two positions, displaced a few feet apart from each other, in a room with
reverberation time of about 0.25 seconds, is shown in Fig. 7.1 along with the direct
component, early reflections, and late reverberant components. Figure 7.2 shows the
corresponding frequency response from 20 Hz to 20 kHz.
One of the goals in equalization is to minimize the spectral deviations (viz., cor-
recting the peaks and dips) found in the magnitude response through an equaliza-
tion filter. This correction of the room response significantly improves the quality of
sound played back through a loudspeaker system. In essence, the resulting system
formed from the combination of the equalization filter and the room response should
have a perceptually flat frequency response.
One of the important considerations is that the equalization filter has to be de-
signed such that the spectral deviations in the magnitude response (e.g., Fig. 7.2) are
minimized simultaneously for all listeners in the environment. Simultaneous equal-
ization is an important consideration because listening has evolved into a group ex-
perience (e.g., as in home theaters, movie theaters, and concert halls). An example
of performing only a single position equalization (by designing an inverse filter for
position 1) is shown in Fig. 7.3. The top plot shows the equalization result at position
7.1 Introduction 159

Fig. 7.2. Magnitude responses of room responses of Fig. 7.1 showing different spectral devia-
tions (from flat) at the two listener positions.

Fig. 7.3. Magnitude responses, upon single position equalization, of responses of Fig. 7.2.
Specifically, the equalization filter is designed to correct for deviations at position 1, but the
equalized response at position 2 is degraded.
160 7 Robustness of Equalization to Displacement Effects: Part I

Fig. 7.4. Magnitude responses, upon spatial average equalization, of responses of Fig. 7.2.
Specifically, the equalization filter is designed to correct for deviations, on an average, at
positions 1 and 2.

1 (which shows a flat response under ideal filter design).1 However, the equalization
performance is degraded at position 2 with the use of this single position filter as can
be seen in the lower plot. For example, comparing Figs. 7.2 and 7.3, it can be seen
that the response around 50 Hz at position 2, after single position equalization, is at
least 7 dB below the response before equalization.
One method for providing simultaneous multiple listener equalization is spa-
tially averaging the measured room responses at different positions, for a given
loudspeaker, and stably inverting the result. The microphones are positioned, during
measurements, at the expected center of a listener’s head. An example of performing
spatial average equalization is shown in Fig. 7.4. Clearly, the spectral deviations are
significantly minimized for both positions through the spatial average equalization
filter.2
Although spatial average equalization is aimed at achieving uniform frequency
response coverage for all listeners, its performance is often limited due to (i) mis-
match between microphone measurement location and actual location for the center
of the listener head, or (ii) variations in listener locations (e.g., head movements).
In this chapter, we present a method for evaluating the robustness of spatial
averaging-based equalization, due to the introduction of variations in room responses

1
In practice, a low-pass filter with a large cutoff frequency (e.g., 10 kHz), depending on the
the direct to reverberant energy, is applied to the equalization filter to prevent audio from
sounding bright.
2
The filter was a finite impulse response filter of duration 8192 samples.
7.2 Room Acoustics for Simple Sources 161

(generated either through (i) or (ii)), for rectangular listener arrangements relative to
a fixed sound source. The proposed approach uses a statistical description for the
reverberant field in the responses (viz., via the normalized correlation functions) in
a rectangular listener configuration for a rectangular room.3 A similar approach is
followed in [70] for determining variations in performance. However, this was done
with a single position equalization in mind and is focused for microphone array ap-
plications (e.g., sound source localization). Talantzis and Ward [99] used a similar
analysis for understanding the effect of source displacements, but this analysis was
also presented for a microphone array setup without spatial average equalization.
The advantage of the proposed approach is that (i) it is based on established the-
ory of the statistical nature of reverberant sound fields [16]; (ii) it can be applied to
a large frequency range above the Schroeder frequency, for typical size rooms, un-
like modal equations which are valid for low frequencies having wavelengths greater
1/3 min[Lx , Ly , Lz ] [12]; and (iii) the computational complexity, due to the approx-
imations, is low.
In the next section we introduce background necessary for the development of
the robustness analysis. Specifically, an introduction is provided for the determinis-
tic direct component, and the statistical reverberant field correlations. Subsequently
we present the mismatch measure for analyzing the effects of mismatch between
microphone (during measurement of room responses) and listener position (during
playback) with a spatial average equalizer. Additionally, convergence analysis of the
equalization mismatch error, for spatial average equalization, is presented at the end
of the section. Results based on simulations for typical rectangular listener arrange-
ments relative to a fixed source are presented for a rectangular configuration as this
is fairly common in large environments (e.g., movie theaters, concert halls) as well
as in typical home theater setups. The analysis can be extended to arbitrary listening
configurations.

7.2 Room Acoustics for Simple Sources


The sound pressure pf,i at location i and frequency f can be expressed as a sum
of the direct field component, pf,d,i , and a reverberant field component, pf,rev,i , as
given by
pf,i = pf,d,i + pf,rev,i (7.1)
The direct field component for sound pressure, pf,d,i , of a plane wave, at far field
listener location i for a sound source of frequency f located at i0 can be expressed
as [12]
pf,d,i = −jkρcSf gf (i|i0 )e−jωt
1 jkR
gf (i|i0 ) = e
4πR
3
A rectangular room is considered because the assumptions for a statistical reverberant
sound field have been verified for this shape room and in practice rectangular-shaped rooms
are commonly found.
162 7 Robustness of Equalization to Displacement Effects: Part I

R2 = |i − i0 |2 (7.2)

where pf,d (i|i0 ) is the direct component sound pressure amplitude, Sf is the source
strength, k = 2π/λ is the wavenumber, c = λf is the speed of sound (343 m/s) and
ρ is the density of the medium (1.25 kg/m3 at sea level).
The normalized correlation function [100] which expresses a statistical relation
between sound pressures, of reverberant components, at separate locations i and j,
is given by

E{pf,rev,i p∗f,rev,i } sin kRij


  = (7.3)
E{pf,rev,i p∗f,rev,i } E{pf,rev,j p∗f,rev,j } kRij

where Rij is the separation between the two locations i and j relative to an origin,
and E{.} is the expectation operator.
The reverberant field mean square pressure is defined as

4cρΠa (1 − ᾱ)
E{pf,rev,i p∗f,rev,i } = (7.4)
S ᾱ
where Πa is the power of the acoustic source, ᾱ is the average absorption coefficient
of the surfaces in the room, and S is the surface area of the room.
The assumption of a statistical description (as given in (7.3), (7.4)) for reverber-
ant fields in rooms is justified if the following conditions are fulfilled [16]. (1) Linear
dimensions of the room must be large relative to the wavelength. (2) Average spac-
ing of the resonance frequencies must be smaller than one-third of their bandwidth
(this condition is fulfilled
 in rectangular rooms at frequencies above the Schroeder
frequency, fs = 2000 T60 /V Hz (T60 is the reverberation time in seconds, and V
is the volume in m3 ). (3) Both source and microphone are in the interior of the room,
at least a half-wavelength away from the walls.
Furthermore, under the conditions in [16], the direct and reverberant sound pres-
sures are uncorrelated.

7.3 Mismatch Analysis for Spatial Average Equalization


7.3.1 Analytic Expression for Mismatch Performance
Function

A performance function, Wf , that is used for analyzing the effects of mismatch, for
spatial average equalization, of room responses is given as

1 
N
W̄f = f,i (r)
N i=1
f,i (r) = E{|p̃f (r)p̄−1 −1 2
f − pf,i p̄f | } (7.5)
7.3 Mismatch Analysis for Spatial Average Equalization 163

In (7.5), f,i (r) represents the equalization error in the r-neighborhood of the
equalized location i having response pf,i (r neighborhood is defined as all points
at a distance of r from location i). The neighboring response, at a distance r from
location i, is denoted by p̃f (r), whereas the spatial average equalization response is
denoted by p̄f . Thus, response p̃f (r) is the response corresponding to the displaced
center of head position of the listener (viz., with a displacement of r). To get an in-
termediate equalization error measure, f,i (r), the expectation is performed over all
neighboring locations at a distance r from the equalized location i. Furthermore, the
final performance function W̄f is the average of all the equalization errors, f,i (r),
in the vicinity of the N equalized locations. In essence, the displacement (distance)
r can be interpreted as a “mismatch parameter,” because a room response measured
at displacement r will be different from the response measured at a nominal loca-
tion i.
For simplicity, in our analysis, we assume variations in responses due to dis-
placements (or mismatch) in a horizontal plane (i.e., the x and y plane). The analysis,
presented in this chapter, can be extended to include displacements on a spherical
surface. Thus, (7.5) can be simplified to yield
⎧( (2 ⎫
⎨( p̃ (r)N pf,i N (( ⎬
( f
f,i (r) = E ( N − N ( (7.6)
⎩( pf,j pf,j ( ⎭
j=1 j=1

An approximate simplification for (7.5) can be done by using the Taylor series
expansion [101]. Accordingly, if g is a function of random variables, xi , with average
n i } = x̄i , then g(x1 , x2 , . . . , xn ) = g(x) can be expressed as g(x) =
values E{x
g(x̄) + i=1 gi (x̄)(xi − x̄i ) + g(x̂), where g(x̂) is a function of order 2 (i.e., all its
partial derivatives up to the first-order vanish at (x̄1 , x̄2 , . . . , x̄n ). Thus, to a zeroth-
order of approximation E{g(x)} ≈ g(x̄).
Hence, an approximation for (7.6) is given as
E{p̃f (r)p̃f (r)∗ − p̃f (r)p∗f,i − p̃f (r)∗ pf,i + pf,i p∗f,i }
f,i (r) ≈ N 2   ∗ (7.7)
j k E{pf,j pf,k }

We use the following identities for determining the denominator of (7.7).

E{pf,j p∗f,k } = E{pf,d,j p∗f,d,k + pf,rev,j p∗f,rev,k } (7.8)


2
|kcρSf | = 4πΠa cρ (7.9)
Πa cρ jk(Rj −Rk )
E{pf,d,j p∗f,d,k } = e (7.10)
4πRj Rk
4cρΠa (1 − ᾱ) sin kRjk
E{pf,rev,j p∗f,rev,k } = (7.11)
S ᾱ kRjk

Rjk = Rj2 + Rk2 − 2Rj Rk cos θjk (7.12)

In summary (7.8) is obtained by using (7.1) and knowing that the reverberant and
direct field components of sound pressure are uncorrelated, (7.9) is derived in [12,
164 7 Robustness of Equalization to Displacement Effects: Part I

p. 311], (7.10) is determined by using (7.2) and (7.9), and (7.11) is determined from
(7.3) and (7.4). In (7.12), which is the cosine law, θjk is the angle, subtended at the
source at i0 , between locations j and k.
Thus, the denominator term in (7.7) is

E{pf,j p∗f,k }
j k
   Πa cρ 4cρΠa (1 − ᾱ) sin kRjk

jk(Rj −Rk )
= e + (7.13)
j
4πRj Rk S ᾱ kRjk
k

Now, the first numerator term in (7.7) is


E{p̃f (r)p̃f (r)∗ } = E{p̃f,d (r)p̃f,d (r)∗ + p̃f,rev (r)p̃f,rev (r)∗ }
E{p̃f,d p̃∗f,d } = |kρcSf |2 E{gf (ĩ|i0 )gf∗ (ĩ|i0 )}
 %
1
= |kρcSf |2 E (7.14)
(4π)2 |R̃|2

where R̃ is the distance from a source at i0 relative to spatial


average equalized loca-
tion i, and is determined by using cosine law (viz., R̃ = Ri2 + r2 − 2Ri r cos θi ,
where θi is the angle subtended at the source between location i and the location in
the r-neighborhood of location i). The result from applying the expectation can be
found by averaging over all locations in a circle in the r-neighborhood of location i
(because for simplicity we have assumed mismatch in the horizontal or the x and y
plane). Thus,
 %  1
1 1 1 d(cos θi )
E = (7.15)
|4π R̃|2 2 (4π)2 −1 Ri2 + r2 − 2Ri r cos θi
Simplifying (7.15) and substituting the result in (7.14) gives
( ( ( (
|kρcSf |2 ( Ri + r ( ( (
E{p̃f,d (r)p̃f,d (r)∗ } = log ( ( = Πa ρc log ( Ri + r ( (7.16)
2
2(4π) Ri r ( Ri − r ( 8Ri rπ ( Ri − r (
4cρΠa (1 − ᾱ)
E{p̃f,rev (r)p̃f,rev (r)∗ } = (7.17)
S ᾱ
The result in (7.16) is obtained by using (7.9), whereas (7.17) is a restatement of
(7.4). Thus,
( (
Πa ρc ( Ri + r ( 4cρΠa (1 − ᾱ)

E{p̃f (r)p̃f (r) } = (
log ( (+ (7.18)
8Ri rπ Ri − r ( S ᾱ
The correlation, E{p̃f,d (r)pf,d,i (r)∗ }, in the direct-field component for the sec-
ond term in the numerator of (7.17) is
 1 jk(√R2 +r2 −2Ri r cos θi −Ri )
1 e i d cos θi Πa ρc 1 sin kr
|kρcSf |2  = 2 (4πR )2 kr
2(4π)2 −1 2 2
Ri Ri + r − 2Ri r cos θi 4πR i i
(7.19)
7.3 Mismatch Analysis for Spatial Average Equalization 165

The reverberant field correlation for the second term in the numerator of (7.7) can be
found using (7.3), and is
4cρΠa (1 − ᾱ) sin kr
E{p̃f,rev (r)p∗f,rev,i } = (7.20)
S ᾱ kr
The third numerator term in (7.7) can be found in a similar manner as compared to
the derivation for (7.19) and (7.20).
The last term in the numerator of (7.7) is computed to yield
Πa ρc 4ρcΠa (1 − ᾱ)
E{pf,i p∗f,i } = + (7.21)
4πRi2 S ᾱ
Equation (7.21) can be obtained by substituting j = k = i in (7.10) and (7.11),
respectively. Substituting the computed results into (7.7), and simplifying by can-
celing certain common terms in the numerator and the denominator, the resulting
equalization error due to displacements (viz., mismatch in responses) is
 ( (   
N2 1 ( Ri + r (
f,i (r) ≈ log (( ( + 2ψ2 + 1 − 1 + 2ψ2 sin kr (7.22)
ψ1 8Ri rπ Ri − r ( 2ψ3 ψ3 kr
  
1 sin kR
ejk(Rj −Rl ) + ψ2
jl
ψ1 =
j
4πR j R l kR jl
l
4(1 − ᾱ)
ψ2 =
S ᾱ
ψ3 = 2πRi2

Rjl = Rj2 + Rl2 − 2Rj Rl cos θjl

Finally, substituting (7.22) into (7.5) yields the necessary equation for W̄f .

7.3.2 Analysis of Equalization Error


In this section, we present an analysis of the behavior of the equalization error at each
listener. This analysis helps in understanding, theoretically, the degradation (from a
“steady-state” perspective) of equalization performance at different listener positions
and at different frequencies.
Throughout the analysis we assume that r < Ri , for small mismatch between
microphone position and center of listener head position relative to the distance be-
tween the microphone and the source. Thus, in (7.22) log |(Ri + r)/(Ri − r)| → 0.
Now, for r/λ > 1, the equalization error (7.22) converges to a steady-state value,
f,i (r):
ss
   
N2 1 1 1
f,i (r) ≈
ss
2ψ2 + = k1 k2 + ∝ 2 (7.23)
ψ1 2ψ3 4πRi2 Ri
because sin kr/kr → 0. This implies that listeners at larger distances will have
lower steady-state equalization errors than listeners closer to the source for a given
wavelength of sound. Primarily, the inverse relationship between ss f,i (r) and Ri , at
steady-state in (7.23), is due to the direct path sound field correlations (viz., 1/2ψ3
obtained from (7.21)) at position i.
166 7 Robustness of Equalization to Displacement Effects: Part I

7.4 Results
We simulated Eq. (7.22) for frequencies above the Schroeder frequency fs = 77 Hz
(i.e., T60 = 0.7 sec, V = 8 m ×8 m ×8 m).
In this setup, we simulated a rectangular arrangement of six microphones, with
a source in the front of the arrangement. Specifically, microphones 1 and 3 were at a
distance of 3 m from the source, microphone 2 was at 2.121 m, microphones 4 and
6 were at 4.743 m, and microphone 5 was at 4.242 m. The angles θ1k in (7.12) were
(45, 90, 18.5, 45, 71.62) degrees for (k = 2, . . . , 6), respectively. Thus, the distances
of the listeners from the source are such that R6 = R4 > R5 > R1 = R3 > R2 .
The equalization error, f,i (r), results are depicted for different listeners in Figs.
7.5 to 7.8 for four frequencies (f = 500 Hz, f = 1 kHz, f = 5 kHz, and f = 10
kHz) as a function of r/λ, where the mismatch parameter 0 ≤ r ≤ 0.7 m (r/λ corre-
sponds to no mismatch condition). Specifically, only the results for listeners 1 and 2
are shown in the top panels because listener 3 is symmetric relative to source/listener
2 (hence the results of listener 1 and 3 are identical). Similarly, only the results for
listeners 4 and 5 are shown in the bottom panels.
We observe the following.
1. It can be seen that the steady-state equalization error at listener 2 is higher than
that at listener 1 (top panel). This follows from Eq. (7.23) (because R1 = R3 > R2 ).
Similar results can be predicted for the equalization errors for listeners 4 and 5 (this
is not immediately obvious in the bottom panels, because R4 is close to R5 ).

Fig. 7.5. f,i (r) for the listeners at different distances from the source, r/λ = 0 corresponds
to the optimal position, f = 500 Hz.
7.4 Results 167

Fig. 7.6. f,i (r) for the listeners at different distances from the source, f = 1 kHz.

2. Furthermore, the nonsteady-state equalization region, for a given equalization


error, is larger (better) for listeners farther from the source. For example, the equal-
ization region is a circle of a radius 0.025λ for listener 2, whereas it is 0.04λ for
listener 1 at f,i (r) = −10 dB and f = 500 Hz. This effect is dominant at lower

Fig. 7.7. f,i (r) for the listeners at different distances from the source, f = 5 kHz.
168 7 Robustness of Equalization to Displacement Effects: Part I

Fig. 7.8. f,i (r) for the listeners at different distances from the source, f = 10 kHz.

frequencies but not easily noticeable at higher frequencies (as can be seen from the
initial rise of the error towards a peak value before reaching a steady-state value).
3. The equalization error shows a sinc(2r/λ) dependance after the initial peak (as
emphasized in Fig. 7.6). This dependence arises from the finite correlation of the
reverberant field before it reaches a negligible value at steady-state.
Finally, Fig. 7.9 summarizes the average equalization error plot (i.e., Wavg of
(7.5)), over all listeners, for frequencies beyond fs and mismatch parameter r ranging
from 0 m to 0.7 m. This measure is a composite measure weighing the equalization
error at all positions equally, and shows that the performance degrades for all fre-
quencies with increasing mismatch or displacement in meters. Also, the degradation,
for small displacement r (of the order of 0.1 meter) is larger for higher frequencies.
For example, it can be seen that the slope of the Wavg curves in the frequency region
around 200 Hz is lower than the slopes of the curves for frequencies around 10 kHz.
Alternate measures with nonuniform weighting, depending on the “importance” of
a listening position, may be used instead. Thus, such a measure could potentially be
used to give an overall picture during comparisons to other approaches of multiple
listener equalization.

7.5 Summary
In this chapter, we analyzed the performance of spatial averaging equalization, in
a multiple listener environment, used during sound playback. As is well known,
room equalization at multiple positions, allows for high-quality sound playback in
7.5 Summary 169

Fig. 7.9. W̄ for various mismatch parameters and frequencies between 20 Hz and 20 kHz.

the room. However, as is typically the case in room equalization, the microphone po-
sitions during measurement of the room response will not necessarily correspond to
the center of head of the listener leading to a frequency-dependent degradation due
to mismatch between the measured response and the actual response corresponding
to the center of listener head during playback. Several interesting observations can
be made from the results, including: (i) the influence of frequency and distance on
the size of equalization region, (ii) the steady-state equalization error being depen-
dent on the distance of the listener from the source, and (iii) the dependence of the
reverberant field correlation on the equalization error. Future goals can be directed
to using the proposed method for comparing different multiple listener equalization
techniques in terms of their robustness to response mismatch.
8
Robustness of Equalization to Displacement Effects:
Part II

In a multiple listener environment, equalization may be performed through mag-


nitude response spatial averaging at expected listener positions. However, the per-
formance of averaging-based equalization, at the listeners, will be affected when
there is a mismatch between microphone and listener positions. In this chapter, we
present a modal analysis approach, targeted at low frequencies, to map mismatch to
an equalization performance metric. Specifically, a closed-form expression is pro-
vided that predicts the equalization performance in the presence of mismatch. The
results, which are particularly valid at lower frequencies where standing wave modes
of the room are dominant, indicate that magnitude average equalization performance
depends on (i) the amount of displacement/mismatch, and (ii) the frequency compo-
nent in the modal response. We have provided validation of the theoretical results,
thereby indicating the usefulness of the proposed analytic approach for measuring
equalization performance due to mismatch effects. We also demonstrate the impor-
tance of average equalization over single listener equalization when considering mis-
match/displacement effects.

8.1 Introduction
In this chapter, we propose a statistical approach, using modal equations, for eval-
uating the robustness of magnitude response average equalization, due to variations
in room responses (due to (i) or (ii)). Specific results are obtained for a particu-
lar listener arrangement in a rectangular room. Modal equations have been used in
the analysis because they accurately model the magnitude response at low frequen-
cies. As is well known, dominant modes, in this low-frequency region, are relatively
harder to equalize than at higher frequencies. In the next section, we introduce the
necessary background used in the development of the proposed robustness analysis.
The subsequent section is devoted to the development of the robustness analysis for
spatial average-based equalization. Results based on simulations for a typical rectan-
gular listener arrangement relative to a fixed source and validation of the theoretical
analysis are presented.
172 8 Robustness of Equalization to Displacement Effects: Part II

8.2 Modal Equations for Room Acoustics


The Green’s function derived from the wave theory for sound fields in an enclosure
is given by [11, 12] as

x −1 Ny −1 Nz −1
 pn (q )pn (q ) N   pn (q l )pn (q o )
pω (q l ) = jQωρ0 l o
= jQωρ0
n
Kn (k 2 − kn2 ) nx =0 ny =0 nz =0
Kn (k 2 − kn2 )
n = (nx , ny , nz ); k = ω/345; q l = (xl , yl , zl )
   2  2 1/2   
2
nx ny nz
kn = π + + pn (q l )pm (q l )dV
Lx Ly Lz V

Kn n = m
= (8.1)
0 n = m

where the eigenfunctions pn (q l ) can be assumed to be orthogonal to each other un-


der certain conditions, with the point source being at q o . The modal equations in
(8.1) are valid for wavelengths, λ, where λ > (1/3) min[Lx , Ly , Lz ] [12]. At these
low frequencies, a few standing waves are excited, so that the series terms in (8.1)
converge quickly.
For a rectangular enclosure with dimensions (Lx , Ly , Lz ), q o = (0, 0, 0), the
eigenfunctions pn (q l ) and eigenvalues Kn in (8.1) are
     
nx πxl ny πyl nz πzl
pn (q l ) = cos cos cos
Lx Ly Lz
pn (q o ) = 1
 Lx    Ly    Lz  
nx πxl
2 2 ny πyl 2 nz πzl
Kn = cos dx cos dy cos dz
0 Lx 0 Ly 0 Lz
Lx Ly Lz V
= = (8.2)
8 8
The eigenfunction distribution in the z = 0 plane, for a room of dimension 6 m
×6 m ×6 m, and tangential mode (nx , ny , nz ) = (3, 2, 0) is shown in Fig. 8.1. Thus,
the large deviation in the eigenfunction distribution, for different modes in the room,
necessitates a multiple listener room equalization method.

8.3 Mismatch Analysis with Spatial Average Equalization


8.3.1 Spatial Averaging for Multiple Listener Equalization

The magnitude response averaging method is popular for performing spatial equal-
ization over a wide area in a room. The magnitude response spatial averaging process
can be expressed in terms of the modal equations (8.1) as
8.3 Mismatch Analysis with Spatial Average Equalization 173

Fig. 8.1. The eigenfunction distribution, for a tangential mode (3,2,0) in the z = 0 plane, over
a room of dimensions 6 m ×6 m ×6 m.


1 
N
pω,avg = |pω (q l )| (8.3)
N
l=1

where N is the number of positions that are to be equalized in the room.


The spatial equalizer, p−1
ω,avg , filters the audio signal before it is transmitted by
a loudspeaker in the room. A block diagram of the multiple listener equalization
process, using averaging, is shown in Fig. 8.2.

8.3.2 Equalization Performance Due to Mismatch


(i)
An intermediate performance function, Wω ( ), that is used for analyzing the ro-
bustness of spatial average equalization to room response variations is given as

Wω(i) ( ) = E{|pω (ν (i) −1 −1 2


)pω,avg − pω (q i )pω,avg | } (8.4)

(i)
where pω (ν ) is the pressure at location ν in the neighborhood of equalized posi-
tion i having pressure pω (q i ) ( -neighborhood is defined as all points at a distance of
from location i), E{.} denotes the statistical expectation operator, and ω = 2πc/λ
(where c = 345m/s).
The intermediate performance measure in (8.4) is defined in such a manner that
when the displacement , about position i (whose response pω (q i ) which is originally
used for determining the spatially averaged equalization filter p−1 ω,avg ) is zero, then
(i)
Wω ( ) = 0. Thus, the performance measure is computed as an average square
174 8 Robustness of Equalization to Displacement Effects: Part II

error between the response at the equalized location and the response at a displaced
location having distance from the equalized location.
Finally, using the intermediate performance function, a generalized average per-
N (i)
formance measure is expressed as Wω ( ) = 1/N i=1 Wω ( ).
For simplicity, we assume variations in responses due to displacements (or mis-
match) only in the horizontal plane (x and y plane). The analysis can be easily ex-
tended to include the mismatch in three dimensions. Thus, simplification of (8.4)
leads to
N 2

Wω(i) ( ) = N 2 / |pω (q l )|
l=1
, -.
I
/ , -.
II
/
× E{pω (ν (i) )p∗ω (ν (i)
)} − E{p ∗
ω (ν (i)
)} pω (q i )

, -.
III
/
IV 
, -. /
(i) ∗ 2
− E{pω (ν )} pω (q i ) + |pω (q i )| (8.5)

We only need to compute the statistics associated with Terms (I), (II), and (III)
(the terms within the expectations) in (8.5), because Term (IV) is a deterministic
quantity.
(i) (i)
Now, E{pω (ν )p∗ω (ν )} is the average over all locations along a circle of ra-
dius from the ith listener location. Assuming the source, all listeners, and each of
the listener displacements are along the same z-plane (viz., z = 0), then (I) in (8.5)
can be simplified as

Fig. 8.2. Spatial average equalization for N listeners in a room.


8.3 Mismatch Analysis with Spatial Average Equalization 175
   
nx πφ(i) ny πφ(i)
 cos cos
x y

j8Qωρ0 Lx Ly
pω (ν (i) ) = (8.6)
V n
(k 2 − kn2 )

(i) ∗ (i) 2
E{pω (ν )pω (ν )} = |ψ1 | (1/ψ2 )ψ3 (8.7)
n,m
8Qωρ0
ψ1 =
V
ψ2 = (k 2 − kn2 )(k 2 − km2
)

(i) (i)
nx πφx ny πφy
ψ3 = E cos cos
Lx Ly

(i) (i)
mx πφx my πφy
× cos cos (8.8)
Lx Ly

(i) (i)
Now, with φx = xi + cos θ and φy = yi + sin θ
 
(i) (i) (i) (i)
nx πφx ny πφy mx πφx my πφy
E cos cos cos cos
Lx Ly Lx Ly
 2π    
1 nx π(xi + cos θ) ny π(yi + sin θ)
= cos cos
2π 0 Lx Ly
   
mx π(xi + cos θ) my π(yi + sin θ)
× cos cos dθ. (8.9)
Lx Ly
Equation (8.9) can be solved using the MATLAB trapz function. However, we
found an approximate closed-form expression to be computationally much faster.
The following expressions were derived from standard trigonometric formulae, using
the first two terms in the polynomial expansion of the cosine function, and the first
term in the polynomial expansion of the sine function because ( /Lx , /Ly , /Lz ) 
1. Thus,
 
(i) (i) (i) (i)
nx πφx ny πφy mx πφx my πφy
E cos cos cos cos
Lx Ly Lx Ly
1
= (A + B + C) (8.10)

where
       
nx πxi ny πyi mx πxi my πyi
A = π cos cos cos cos
Lx Ly Lx Ly

3 1
× 2 − 2y vy − 2x vx + ( 4x u4x + 4y u4y ) − ( 2x 4y u2y vx + 4x 2y u2x vy )
4 8

1 3
+ 2x 2y vx vy + 4x 4y u2x u2y
4 64
176 8 Robustness of Equalization to Displacement Effects: Part II
       
nx πxi mx πxi ny πyi my πyi
B= π 2y uycos cos sin sin
Lx Lx Ly Ly
& 2
0 2 2 2 2
1'
× 2 − 0.5 x mx + nx − 0.5 x ux
       
2 2 nx πxi ny πyi mx πxi my πyi
C = π y x ux uy sin sin sin sin
Lx Ly Lx Ly
       
2 nx πxi mx πxi my πyi ny πyi
+π x ux sin sin cos cos
Lx Lx Ly Ly
& 2
0 2 2 2 2
1'
× 2 − 0.5 y ny + my − 0.5 y uy (8.11)

where
π π
x = √ y = √
2Lx 2Ly
ux = nx mx u y = n y my
vx = (m2x + n2x ) vy = (m2y + n2y ) (8.12)

Thus (8.10) can be substituted in (8.7), and subsequently in (8.5), to determine


Term I.
Now Terms (II) and (III) in (8.5) can be combined to give
2 3 2 3
−E p∗ω (ν (i) ) pω (q i ) − E pω (ν (i) ) p∗ω (q i )
    %
(i) πφ(i)
( (2  pm (q )E cos nxLπφx cos nyL y
( 8Qωρ0 ( i x y
= −2 (( (
( (8.13)
V m,n
(k 2 − k 2 )(k 2 − k 2 )
m n

(i) (i)
Again using φx = xi + cos θ; φy = yi + sin θ, we have
 
(i) (i)
nx πφx ny πφy
E cos cos
Lx Ly
 2π    
1 nx π(xi + cos θ) ny π(yi + sin θ)
= cos cos dθ (8.14)
2π 0 Lx Ly

Thus, upon again using the fact that ( /Lx , /Ly , /Lz )  1, we can solve
(8.14) as
 
(i) (i)
nx πφx ny πφy
E cos cos
Lx Ly
   
1 nx πxi ny πyi 4 0 1 π 5
= cos cos 2π − π 2x n2x + 2y n2y + 2y 2x n2x n2y
2π Lx Ly 4
(8.15)

Substituting (8.15) in (8.13) and subsequently into (8.5) gives Terms II and III.
8.4 Results 177

8.4 Results
8.4.1 Magnitude Response Spatial Averaging

Simulation of Eq. (8.5) was performed for a room of dimensions 6 m × 6 m × 6


m with six positions that were equalized by magnitude response averaging. The six
listening position coordinates, relative to the source at (0 m, 0 m, 0 m), correspond
to (−2.121 m, −2.121 m, 0 m); (0 m, −2.121 m, 0 m); (2.121 m, −2.121 m, 0
m); (−2.121 m, −4.242 m, 0 m); (0 m, −4.242 m, 0 m); (2.121 m, −4.242 m,
0 m). Figure 8.3 shows the configuration that was tested for six listener positions
(depicted with asterisks) sitting in front of a source (depicted by a circle). Results are
obtained for the lower 1/3 octave frequencies because equalization, in this region, is
of greater importance (as large amplitude standing waves in this region are relatively
difficult to correct). The frequencies used for the simulations were (25, 31.5, 40,
50, 63, 80, 125, 160, 200) Hz, where the lower 1/3 octave frequencies were 25 Hz,
31.5 Hz, 40 Hz; the middle 1/3 octave frequencies were 50 Hz, 63 Hz, 80 Hz; and
the remainder comprising the upper 1/3 octave frequencies for the set of frequencies
being considered.
Figure 8.4 shows the magnitude of the sound pressures, from 25 Hz up to 200
Hz at 1/3 octave frequencies, that can be expected for each of the listener positions.
It should be noted that even though f = 200 Hz ⇒ λ = 1.715 m < (1/3)(6) = 2
m, we don’t expect erroneous results at this frequency, because the wavelength at
this frequency is not significantly offset from the wavelength limit imposed by the
condition λ > (1/3) min[Lx , Ly , Lz ]. As can be seen, some listener positions exhibit

Fig. 8.3. Simulated setup for a six-position displacement effects analysis in a rectangular con-
figuration in front of a source.
178 8 Robustness of Equalization to Displacement Effects: Part II

Fig. 8.4. Magnitude of the sound pressures, from 25 Hz up to 200 Hz at 1/3 octave frequencies,
that can be expected for each of the listener positions.

the same response, because these positions are symmetrically located relative to the
source, thereby making the cosine products in the eigenfunction equation (8.2) equal.
Figure 8.5 shows the spatial average of the magnitude responses, whereas Fig.
8.6 shows the equalized responses at the six positions based on direct frequency do-
main inversion. It is clear that these equalized responses will be affected due to (i)
microphone/listener mismatch (e.g., when a microphone is used for measuring the
response at a position, and the actual center of listener head is at a different posi-
tion) and/or (ii) listener head displacement, due to the variations in the eigenfunction
pn (q l ).

8.4.2 Computation of the Quantum Numbers

To get fairly accurate results, it is important to determine a reasonable limit of the


summation in Eq. (8.1) where addition of any other terms will not affect the result
generated by (8.1) (viz., Nx , Ny , and Nz ).
For a rectangular room, observe that the amplitude of the numerator in (8.1)
is bounded by unity (because as per (8.2) the eigenfunctions are a product of
co-sinusoids). The term, 1/(k 2 − kn2 ), will show a peak at integer values of the
quantum numbers nx , ny , nz for a given frequency of interest f = c/λ. Figure
8.7 shows a 3-D plot for the value of 1/(k 2 − kn2 ) as a function of nx , ny for
a 1/3 octave frequency of 63 Hz. From the figure it is clear that the summation,
8.4 Results 179

Fig. 8.5. Spatial average of the magnitude responses.

Nx −1 Ny −1
nx =0 ny =0 pn (ql )/(k 2 − kn2 ), will be accurate if the first ten integers of nx , ny

Fig. 8.6. Equalized responses at the six positions.


180 8 Robustness of Equalization to Displacement Effects: Part II

Fig. 8.7. Effect of quantum numbers on the summation of (8.1).

are used because, 1/(k 2 −kn2 ) ≈ 0 for nx > 10 and ny > 10 for wavelength λ = 5.4
m.1
For the present, we have confirmed that the first 10 integers for the quantum
number tuple, nx , ny , nz , are sufficient
 for all 
the lower 1/3 octave frequencies under
consideration to accurately model nx ny nz pn (ql )/(k 2 − kn2 ).

8.4.3 Theoretical Results


6 (i)
The theoretical results obtained for Wω ( ) = 1/6 i=1 Wω ( ) using the analysis
presented are shown in Figs. 8.8 to 8.10.
Figures 8.8 to 8.10 reveal the effect of displacement or mismatch on the perfor-
mance of averaging equalizers for the present listener setup and room dimensions. In
every situation, as expected, the performance starts to degrade as the displacement
increases. Furthermore, the degradation, as a function of displacement, is the least
for the lower 1/3 octave frequencies and it is largest for the upper 1/3 octave frequen-
cies. Moreover, the degradation rate (i.e., amount of displacement that causes a 10
dB degradation in Wω ( )) is roughly similar for all frequencies.

1
Even though the plot shows significant attenuation of the 1/(k2 − kn2 ) term at nx > 6 and
ny > 6, we have chosen the limit Nx and Ny , liberally, to account for the negligible, but
nonzero, additional terms.
8.4 Results 181

Fig. 8.8. Wω ( ) versus displacement for lower 1/3 octave frequencies.

8.4.4 Validation

This section validates the results obtained from using the closed-form expressions
from the previous section. For this we did the following, for a given 1/3 octave fre-
quency.

Fig. 8.9. Wω ( ) versus displacement for middle 1/3 octave frequencies.


182 8 Robustness of Equalization to Displacement Effects: Part II

Fig. 8.10. Wω ( ) versus displacement for upper 1/3 octave frequencies.

1. Determine the sound pressures, pω (q i ), using (8.1) for the six positions (i =
1, 2, . . . , 6).
2. Determine the average of the sound pressure magnitudes at the six positions.
3. Determine the inverse of the average p−1 ω,avg .
A. For each position i (i = 1, 2, . . . , 6):
A.i. Compute −pω (q i )p−1
ω,avg .
A.ii. Generate 250 positions in a circle at a displacement of with the center
being the listener position i.
A.iii. For the given displacement , determine the sound pressures at the each
of the 250 positions, using (1), for the position i. The sound pressure at each of the
(i)
250 displaced positions can be expressed as pω (ν ) (see Eq. (8.3)).
(i)
A.iv. For each of the 250 displaced positions compute |pω (ν )p−1 ω,avg −
pω (q i )p−1
ω,avg |
2

(i)
A.v. Compute the average of |pω (ν )p−1 −1 2
ω,avg − pω (q i )pω,avg | over all 250
(i)
positions. This is effectively computing the expectation in (3) to obtain Wω ( ).
6 (i)
B. Determine W (in dB) by using Wω ( ) = 1/6 i=1 Wω ( ).
Figures 8.11 to 8.13 show the results for all the frequencies. By comparing Figs.
8.8 to 8.10 with 8.11 to 8.13, it can be seen that the plots are quite similar, thus
confirming the validity of the proposed closed-form expression for characterizing
the equalization performance using spatial averaging due to mismatch/displacement
effects.
8.4 Results 183

Fig. 8.11. Validation of the analytic solution for Wω ( ) versus displacement for lower 1/3
octave frequencies.

8.4.5 Magnitude Response Single-Listener Equalization

In this section we present some results obtained from single listener equalization. It
is well known that single listener equalization may cause significant degradations,

Fig. 8.12. Validation of the analytic solution for Wω ( ) versus displacement for middle 1/3
octave frequencies.
184 8 Robustness of Equalization to Displacement Effects: Part II

Fig. 8.13. Validation of the analytic solution for Wω ( ) versus displacement for upper 1/3
octave frequencies.

in the frequency response, at other listening positions. In fact, the degradation at the
other positions could end up being more than what it would have been if the sin-
gle position were not equalized [102]. Thus, the goal of this section is to further
demonstrate that, besides degradation in frequency response at other listening posi-
tions, single listener equalization will cause significant degradation of equalization
performance in the presence of mismatch.
Figure 8.14 shows the equalization performance Wω ( ), at 40 Hz, if either posi-
tion 2 or position 6 (see Fig. 8.3) was equalized, and Fig. 8.15 shows the equalization
performance at 160 Hz. Specifically, the results in Figs. 8.14 and 8.15 were obtained
by replacing pω,avg with either |pω (q 2 )| or |pω (q 6 )| in (8.4).
It can be immediately observed from Figs. 8.14 and 8.15 that the equalization per-
formance depends on the position being equalized. In this particular listening setup,
the equalization performance is biased favorably towards position 2. Of course this
bias is introduced by a “favorable” weighted eigenfunction distribution (the weights
being the denominator term in (8.1)), and because there is generally no a priori in-
formation on the distribution, there is generally no way of knowing which position
will provide the “most favorable” equalization performance. Thus, a safe equaliza-
tion choice combines the modal equations (or room responses), at expected listener
positions, to get a good equalization performance for the room.
8.5 Summary 185

Fig. 8.14. Single listener equalization results for f = 40 Hz, where equalization is done for
position 6 (dashed line) or position 2 (solid line).

8.5 Summary
In this chapter we presented a statistical approach using modal equations for eval-
uating the robustness of equalization based on magnitude response averaging, due
to the variations in room responses, for a realistic listener arrangement relative to a
source. The simulations were performed for a six-listener setup with a simple source
in a cubic room.
Clearly, there is a degradation in the equalization performance due to displace-
ment effects. Furthermore, this degradation is different for different frequencies (viz.,
generally smaller for relatively lower frequencies as compared to higher frequen-
cies). We have also experimentally confirmed the validity of the proposed closed-
form solution for measuring degradation performance.
Furthermore, we also demonstrated the importance of average equalization over
single listener equalization when considering mismatch/displacement effects.
Finally, an interesting future research direction is the formulation of a percep-
tually motivated performance function and evaluation of the robustness using this
measure.
186 8 Robustness of Equalization to Displacement Effects: Part II

Fig. 8.15. Single listener equalization results for f = 160 Hz, where equalization is done for
position 6 (dashed line) or position 2 (solid line).
9
Selective Audio Signal Cancellation

Selectively canceling signals at specific locations within an acoustical environment


with multiple listeners is of significant importance for home theater, automobile,
teleconferencing, office, industrial, and other applications. The traditional noise can-
cellation approach is impractical for such applications because it requires secondary
sources to “anti-phase” the primary source, or sensors to be placed on the listeners.
In this chapter we present an alternative method for signal cancellation by prepro-
cessing the acoustical signal with a filter known as the eigenfilter [103, 104]. We
examine the theoretical properties of such filters, and investigate the performance
(gain) and tradeoff issues such as spectral distortion. Sensitivity of the performance
as a function of the room impulse response duration (reverberation) modeled in the
eigenfilter is also investigated.

9.1 Introduction
Integrated media systems are envisioned to have a significant impact on the way
media, such as audio, are transmitted to people in remote locations. In media ap-
plications, although a great deal of ongoing research has focused on the problem of
delivering high-quality audio to a listener, the problem of delivering appropriate au-
dio signals to multiple listeners in the same environment has not yet been adequately
addressed.
In this chapter we focus on one aspect of this problem that involves presenting
an audio signal at selected directions in the room, while simultaneously minimizing
the signal at other directions. For example, in home theater or television viewing
applications a listener in a specific location in the room may not want to listen to the
audio signal being transmitted, whereas another listener at a different location would
prefer to listen to the signal. Consequently, if the objective is to keep one listener in

[2003]
c IEEE. Reprinted, with permission, from S. Bharitkar and C. Kyriakakis, “Selec-
tive signal cancellation for multiple-listener audio applications using eigenfilters”, IEEE
Transactions on Multimedia, 5(3):329–338.
188 9 Selective Audio Signal Cancellation

a region with a reduced sound pressure level, then one can view this problem as that
of signal cancellation in the direction of that listener. Similar applications arise in the
automobile (e.g., when only the driver would prefer to listen to an audio signal), or
any other environment with multiple listeners in which only a subset wish to listen
to the audio signal.
Several methods have been proposed in the literature to lower the signal level ei-
ther globally or in a local space within a region. Elliott and Nelson [105] proposed a
global active power minimization technique for reducing the time-averaged acoustic
pressure from a primary source in an enclosure, using a set of secondary source distri-
butions. This least squares-based technique demonstrated that reduction in potential
energy (and therefore sound pressure) can be achieved if the secondary sources are
separated from the primary source by a distance which is less than half the wave-
length of sound at the frequency of interest. It was suggested that this method can be
employed to reduce the cockpit noise in a propeller-powered aircraft. Similarly, Ross
[106] suggested the use of a filter that can minimize the signal power in the lobby of
a building due to a generator outside the lobby by blocking the dominant plane wave
mode with a loudspeaker. The reader is referred to several other interesting tutorial
papers that have been published in active noise control [107, 108, 109]. Other ex-
amples could include head-mounted reference sensors using adaptive beamforming
techniques [110].
In this chapter, the problem of signal cancellation is tackled by designing objec-
tive functions (criteria) that aim at reducing the sound pressure levels of signals in
predetermined directions. A first objective criterion is designed for maximizing the
difference in signal power between two different listener locations that have differ-
ent source and receiver response characteristics. Thus, one application of this sys-
tem lies in an environment having conflicting listening requirements, such as those
mentioned earlier (e.g., automobiles, home environment). The filter, known as the
eigenfilter, that is derived by optimizing the objective function, operates on the raw
signal before being linearly transformed by the room responses in the direction of
the listeners. Such filters aim at increasing the relative gain in signal power between
the two listeners with some associated tradeoffs such as: (i) spectral distortion that
may arise from the presence of the eigenfilter, and (ii) the sensitivity of the filter
to the length of the room impulse response (reverberation). Further issues that can
be researched, and which are beyond the scope of this chapter, include human per-
ception of loudness, as well as perceptual aspects, such as coloration, and speech
intelligibility.
The organization of this chapter is as follows. In the next section, we derive the
required eigenfilter from a proposed objective function, and prove some of the theo-
retical properties of such filters. We provide experimental results for the performance
(and tradeoff) of the eigenfilters in two situations: (i) using a synthesized room im-
pulse response with a speech excitation, and (ii) using an actual room impulse re-
sponse with a stochastic excitation. We also investigate the performance differences
that are observed when using a minimum-phase model for the room response. We
conclude the chapter by discussing some future research directions for selective sig-
nal cancellation using eigenfilters.
9.2 Traditional Methods for Acoustic Signal Cancellation 189

9.2 Traditional Methods for Acoustic Signal Cancellation


We divide the existing methods for acoustic signal cancellation as belonging to either
(a) physical interfaces for acoustical signal cancellation, or (b) loudspeaker-based
interfaces for acoustical signal cancellation in relatively large enclosures. Before in-
troducing the different methods, we define two broad terms for signal cancellation.
Definition 1. An active sound control technique is a method for attenuating an un-
wanted acoustical signal (disturbance) by the introduction of controllable “secondary
sources”, whose outputs are arranged to interfere destructively with the disturbance.
Definition 2. A passive sound control technique is a method for attenuating an
unwanted acoustical signal by the introduction of physical barriers of certain surface
density [127].
All of the methods in this proposal belong to either or both of the aforemen-
tioned broad definitions for acoustic signal cancellation. Thus, typically the physical
interfaces in (a) above are part of the passive sound control strategy; whereas, the
loudspeaker-based interfaces are part of the active sound control strategy.

9.2.1 Passive Techniques

Simple Cotton Wad

This is a well-known and by far the cheapest passive sound control method for ab-
sorbing unwanted audio signals. In this method, an uninterested listener places a
cotton ball inside each ear to limit the intensity (increase the attenuation) of the
sound signal that enters the ear canal and strikes the ear drums. The disadvantage
of this method is that the attenuation of the wad of cotton decreases with decrease
in frequency. So, the wad of cotton is not quite good for passively canceling acous-
tic signals at low frequencies. Moreover, the insertion of the cotton ball can cause
discomfort.

Ear Defenders

Ear defender [128] is a term used to designate a device that introduces attenuation
of sound between a point outside the head and the eardrum. There are two types,
namely, (i) the cushion type, and (ii) the insert type. The cushion type is similar to
a pair of headphones with soft cushion ear pads. The cushion types are heavy and
cumbersome. The insert type is a form of a plug that is pushed into the ear canal.
Soft plastics and synthetic rubbers are the commonly used materials for the insert
type defenders. A good ear defender will introduce an attenuation of 30 to 35 dB
from 60 Hz to 8 kHz. However, the insertion may cause discomfort. Furthermore,
they need to be custom made depending on the size of the ear.
190 9 Selective Audio Signal Cancellation

Acoustic Barriers

Simply put, an acoustic barrier is a glorified wall. The aim of building a barrier is
to redirect acoustic signal power generated by a source away from an uninterested
listener. To be effective at this, the barrier must be constructed from “heavy” material
(i.e., having a high surface density). Clearly, this is prohibitive in a room or an
automobile. Moreover, a reasonable sized wall provides something of the order of 10
dB of attenuation in acoustic signal pressure levels. Attenuation of 20 dB or more is
almost impossible to achieve with a simple barrier.

Sound Absorption Materials

Sound absorption can be achieved by introducing porosity in a material. Examples of


simple sound absorbers are clothing material and an open window. Another example
of a sound absorber (albeit ineffective) is a wall [11, p. 139]. The absorption of
sound energy by the wall manifests as vibrational energy of the wall. This vibrational
energy is then reradiated to the outside. The absorption α for a simple wall is given
as
 2
2cρ0
α= (9.1)

where M denotes the mass per unit area of the wall, ω = 2πf is the angular fre-
quency in rad/s, c is the speed of sound in air, and ρ0 is the static density of air.
Clearly, lower frequencies are better absorbed relative to high frequencies.

9.2.2 Active Techniques

Presence of Secondary Loudspeakers

There are several variants in this approach containing at least one secondary loud-
speaker. Historically, the fundamental concept was first presented in a patent granted
to Lueg [130], wherein Lueg suggested using a loudspeaker for canceling a one-
dimensional acoustic wave, where a source generates a primary acoustic waveform
p(x, t) = p(x)ejωt ,1 expressed as an instantaneous sound pressure, in a duct (the
solid line indicates the primary waveform). A microphone located farther down-
stream in the duct detects the acoustic signal. The output from the microphone is
used to drive a loudspeaker, a secondary source, after being manipulated by a con-
troller. The output from the loudspeaker is another acoustic signal s(x, t) = s(x)ejωt
indicated by the dotted line. The loudspeaker is positioned and the controller is de-
signed in a manner such that the secondary source s(x, t) generates a signal that is
of the same amplitude but opposite in phase as the primary source. That is,
1
The decomposition of the acoustic wave into its time-dependent and frequency-dependent
components has its origins in the solution to the one-dimensional wave equation expressed
as ∂ 2 p(x, t)/∂x2 − ∂ 2 p(x, t)/c20 ∂t2 = 0, with c0 being the speed of the acoustic wave in
the given medium.
9.3 Eigenfilter Design for Conflicting Listener Environments 191

s(x) = −p(x) (9.2)


The two acoustic signals are essentially designed to interfere destructively, which
significantly attenuates the sound wave propagating downstream, relative to the sec-
ondary source, in the duct. However, what happens upstream, relative to the sec-
ondary source, is a completely different issue. It can be easily shown that the re-
sulting magnitude of the pressure, upstream (x < 0) of the secondary loudspeaker
(assuming a secondary loudspeaker located at x = 0) when p(x) = Ae−jkx
(k = 2π/λ is the wavenumber and λ is the wavelength), is |r(x)| = |p(x) + s(x)| =
2A| sin(kx)|. Thus, the resulting absolute pressure, upstream, is twice as much when
x/λ = −0.5(n + 1/2), n ∈ {0, 1, 2, . . . }. For this reason global control strategies
are used to compensate for such effects when using secondary speakers, assuming
one wishes to control tonal disturbances over a large region.
Active control strategies for three-dimensional wavefronts emphasize optimizing
some objective criteria such as total power (sum of powers of primary and secondary
wavefronts). The interested reader is referred to [129] for details of various optimiza-
tion approaches, using multiple secondary speakers, in active sound control.

9.2.3 Parametric Loudspeaker Array


Recently new technologies in loudspeaker design have resulted in a parametric ap-
proach employing a grid of transducers that generate ultrasonic signals. These signals
show up being very focussed in the loudspeaker directivity pattern. Sometimes the
phrases “Audio Spotlight Devices” or “HyperSonic Sound Systems” are used when
referring to them.2 In essence, these loudspeakers are able to focus a narrow “beam”
of sound in the direction of a specific listener.

9.3 Eigenfilter Design for Conflicting Listener Environments


9.3.1 Background
In this chapter, we primarily address the issue of designing eigenfilters for single
source and dual listener environments (see Fig. 9.1). It is well established from linear
system theory that

P −1
yi (n) = hi (k)x(n − k) + vi (n) i = 1, 2 (9.3)
k=0

where x(n) is the primary signal transmitted by a source, such as a loudspeaker;


yi (n) is the signal received at listener Ri ; hi is the room transmission characteristic
or room impulse response (modeled as a finite impulse response) between the source
and listener Ri ; and vi is additive (ambient) noise at listener Ri . In a reverberant envi-
ronment, due to multipath effects, the room responses vary with even small changes
in the source and receiver locations [119, 63, 11], and in general h1 (n) = h2 (n).
2
The interested reader is directed to http://www.atcsd.com, of American Technology Cor-
poration for a white paper on this technology.
192 9 Selective Audio Signal Cancellation

Fig. 9.1. The source and receiver model.

One method of modifying the transmitted primary signal x(n) is to preprocess


the source signal by a filter, called the eigenfilter, before transmitting it through the
environment.

9.3.2 Determination of the Eigenfilter


Under our assumption of modeling the listeners as point receivers we can set up
the problem as shown in Fig. 9.1, where wk ; k = 0, 1, . . . , M − 1 represent the
coefficients of the finite impulse response filter to be designed. For this problem,
we assume that the receivers are stationary (i.e., the room impulse response for a
certain (C, R) is time-invariant and linear, where C and R represent a source and a
receiver), and the channel (room) impulse response is deterministic at the locations of
the two listeners. The listening model is then simply related to (9.3), but the resulting
transmitted primary signal is now filtered by wk . Thus, the signal yi (n) at listener
Ri , with the filter wk present, is

M −1
yi (n) = hi (n) ⊗ wk x(n − k) + vi (n) i = 1, 2 (9.4)
k=0

where ⊗ represents the convolution operation. With this background, we view the
signal cancellation problem as a gain maximization problem (between two arbitrary
receivers); we can state the performance criterion as
2 2
1 σy2 (n) λ σy1 (n)
J(n) = max − −ψ (9.5)
w 2 σv22 (n) 2 σv21 (n)

in which we would like to maximize the signal-to-noise ratio (or signal power) in
the direction of listener 2, while keeping the power towards listener 1 constrained at
10ψdB /10 (where ψdB = 10 log10 ψ). In (9.5), σy2i (n) /σv2i (n) denotes the transmit-
ted signal to ambient noise power at listener Ri with yi (n) as defined in (9.4). The
quantity λ is the well-known Lagrange multiplier.
9.3 Eigenfilter Design for Conflicting Listener Environments 193

It is interesting to see that, when x(n) and v(n) are mutually uncorrelated, the
two terms in the objective function (9.5) are structurally related to the mutual infor-
mation between the source and listeners R2 and R1 , respectively, under Gaussian
noise assumptions [103].
Now observe that,


M −1
y1 (n) = h1 (n) ⊗ wk x(n − k) + v1 (n) (9.6)
k=0

where h1 (n) is the room response in the direction for the listener labeled 1. Let
w = (w0 , w1 , . . . , wM −1 )T , and x(n) = (x(n), x(n − 1), . . . , x(n − M + 1))T ;
then (9.6) can be expressed as

y1 (n) = h1 (n) ⊗ wT x(n) + v1 (n)


= h1 (n) ⊗ z(n) + v1 (n) (9.7)

L−1
= h1 (p)z(n − p) + v1 (n)
p=0

where z(n) = wT x(n). We assume that the zero mean noise and signal are real and
statistically independent (and uncorrelated in the Gaussian case). In this case signal
power in the direction of listener 1 is
L−1 L−1 

2
σy1 (n) = E h1 (p)h1 (q)z(n − p)z (n − q) + σv21 (n)
T

p=0 q=0

 L−1
L−1 
= h1 (p)h1 (q)(wT Rx (p, q)w) + σv21 (n) (9.8)
p=0 q=0

where w ∈ M , Rx (p, q) ∈ M ×M , and

Rx (p, q) = E{x(n − p)xT (n − q)}


x(n − l) = (x(n − l), . . . , x(n − l − M + 1))T (9.9)

Similarly,

 S−1
S−1 
σy22 (n) = h2 (p)h2 (q)(wT Rx (p, q)w) + σv22 (n) (9.10)
p=0 q=0

Solving ∇w J(n) = 0, will provide the set of optimal tap coefficients. Hence from
(9.5), (9.8), and (9.10), we obtain

1 
S−1 S−1
∂J(n)
= 2 h2 (p)h2 (q)Rx (p, q)w∗
∂w σv2 (n) p=0 q=0
194 9 Selective Audio Signal Cancellation

λ  L−1
L−1 
− h1 (p)h1 (q)Rx (p, q)w∗ = 0 (9.11)
σv21 (n) p=0 q=0

where w∗ denotes the optimal coefficients. Let


 S−1
S−1 
A= h2 (p)h2 (q)Rx (p, q)
p=0 q=0

 L−1
L−1 
B= h1 (p)h1 (q)Rx (p, q) (9.12)
p=0 q=0

By assuming equal ambient noise powers at the two receivers (i.e., σv22 (n) = σv21 (n) ),
(9.11) can be written as
(
∂J(n) ((
= (B −1 A − λI)w∗ = 0 (9.13)
∂w (w=w∗
The reason for arranging the optimality condition in this fashion is to demonstrate
that the maximization is in the form of an eigenvalue problem (i.e., the eigenval-
ues corresponding to the matrix B −1 A), with the eigenvectors being w∗ . There are
in general M distinct eigenvalues for the M × M matrix, B −1 A, with the largest
eigenvalue corresponding to the maximization of the ratio of the signal powers be-
tween receiver 2 and receiver 1. The optimal filter that yields this maximization is
given by
w∗ = eλmax [B −1 A] (9.14)
where eλmax [B −1 A] denotes the eigenvector corresponding to the maximum eigen-
value λmax of B −1 A. An FIR filter whose impulse response corresponds to the
elements of an eigenvector is called an eigenfilter [115, 8]. Finally, the gain between
the two receiver locations can be expressed as
σy22 (n) w∗ T Aw∗
GdB = 10 log10 = 10 log10 (9.15)
σy21 (n) w∗ T Bw∗
Clearly it can be seen from (9.14) that the optimal filter coefficients are deter-
mined by the channel responses between the source and the two listeners. The de-
grees of freedom for the eigenfilter are the order M of the eigenfilter.
Fundamentally, by recasting the signal cancellation problem as a gain maximiza-
tion problem, we aim at introducing a gain of G dB between two listeners, R1 and
R2 . This
√ G dB gain is equivalent to virtually positioning listener R1 at a distance
that is 10GdB /10 the distance of listener R2 from a fixed sound source C.3 This is
depicted in Fig. 9.2, where R1 (solid head)
√ is experiencing signal power levels that
he would expect if he were positioned at 10GdB /10 (indicated by the dotted head).
3
Strictly speaking, in the free field, the gain based on the inverse square law is expressed
as Q = 10 log10 r12 /r22 (dB), where r1 , r2 are the radial distances of listeners R1 and R2
from the source.
9.3 Eigenfilter Design for Conflicting Listener Environments 195

Fig. 9.2. The effect of gain maximization.

9.3.3 Theoretical Properties of Eigenfilters

Some interesting properties of the proposed eigenfilter emerge under wide-sense sta-
tionary (WSS) assumptions. In this section we derive some properties of eigenfilters
for selective signal cancellation, which we then use in a later section.
In signal processing applications, the statistics (ensemble averages) of a stochas-
tic process are often independent of time. For example, quantization noise exhibits
constant mean and variance, whenever the input signal is “sufficiently complex.”
Moreover, it is also assumed that the first-order and second-order probability density
functions (PDFs) of quantization noise are independent of time. These conditions
impose the constraint of stationarity. Because we are primarily concerned with sig-
nal power, which is characterized by the first-order and second-order moments (i.e.,
mean and correlation), and not directly with the PDFs, we focus on the wide-sense
stationarity aspect. It should be noted that in the case of Gaussian processes, wide-
sense stationarity is equivalent to strict-sense stationarity, which is a consequence of
the fact that Gaussian processes are completely characterized by the mean and vari-
ance. Below, we provide some definitions, properties, and a basic theorem pertaining
to eigenfilter structure for WSS processes.
Property 1 : For a WSS process x(n), and y(n) with finite variances, the matrix
Rx (p, q) is Toeplitz, and the gain (9.15) can be expressed as

|W ∗ (ejω )|2 |H2 (ejω )|2 Sx (ejω )dω
GdB = 10 log10 2π (9.16)

|W ∗ (ejω )|2 |H1 (ejω )|2 Sx (ejω )dω

where rx (k) ∈ Rx (k) and Sx (ejω ) form a Fourier transform pair, and h1 (n) and
h2 (n) are stable responses. Moreover, because we are focusing on real processes in
196 9 Selective Audio Signal Cancellation

this chapter, the matrix Rx (k) is a symmetric matrix, with

rx (k) = rx (−k) (9.17)

Property 2 : Toeplitz matrices belong to a class of persymmetric matrices. A p×p


persymmetric matrix Q satisfies the following relation [115],

Q = JQJ (9.18)

where J is a diagonal matrix with unit elements along the northeast-to-southwest


diagonal. Basically, premultiplying (postmultiplying) a matrix with J exchanges the
rows (columns) of the matrix.
The eigenfilter design in the WSS case requires the inversion of a scaled Toeplitz
matrix (via the room response), and multiplication of two matrices. We investigate
these operations briefly through the following properties.
Property 3 : A scaling term, c, associated with a persymmetric matrix leaves its
persymmetricity unaltered. This can be easily seen as follows,

JcQJ = cJQJ = cQ (9.19)

Property 4 : Linear combination of persymmetric matrices yields a persymmetric


matrix.

J[c1 Q1 + c2 Q2 ]J = c1 JQ1 J + c2 JQ2 J = c1 Q1 + c2 Q2 (9.20)

Hence, from the above properties, the matrices A and B (in (9.12)) are persymmetric.
Property 5 : The inverse of a persymmetric matrix is persymmetric.

Q = JQJ
Q−1 = (JQJ)−1 = J−1 Q−1 J−1 = JQ−1 J (9.21)

Property 6 : The product of persymmetric matrices is persymmetric.

Q1 Q2 = JQ1 JJQ2 J = JQ1 Q2 J

where we have used the fact that JJ = J2 = I. Thus, B −1 A is persymmetric.

Theorem 9.1. The roots of the eigenfilter corresponding to a distinct maximum


eigenvalue, lie on the unit circle for a Toeplitz Rx (p, q) = Rx (k).

Proof : Because the matrix B −1 A is persymmetric, based on properties 2 to 5, we


can incorporate a proof similar to the one given in [113] as proof.
Property 7 [112] : If Q is persymmetric with distinct eigenvalues, then Q has
p/2 symmetric eigenvectors, and p/2 skew symmetric eigenvectors, where x
(x) indicates the smallest (largest) integer greater (less) than or equal to x.
A persymmetric matrix is not symmetric about the main diagonal, hence the
eigenvectors are not mutually orthogonal. However, in light of the present theory
we can prove the following theorem.
9.3 Eigenfilter Design for Conflicting Listener Environments 197

Theorem 9.2. Skew-symmetric and symmetric eigenvectors for persymmetric matri-


ces are orthogonal to each other.
Proof : Let

V1 = {w : Jw = w}
V2 = {w : Jw = −w} (9.22)

Now,

Jν 1 = ν 1 ν 1 ∈ V1 (9.23)

then with ν 2 ∈ V2 we have,

ν T2 Jν 1 = ν T2 ν 1 (9.24)

But,

Jν 2 = −ν 2 ⇒ ν T2 J = −ν T2 (9.25)

using the fact the JT = J. Substituting (9.25) into (9.24) results in

−ν T2 ν 1 = ν T2 ν 1 ⇒ ν T2 ν 1 = 0 (9.26)

which proves the theorem.


Property 8 : From the unit norm property of eigenfilters (w∗ 2 = 1), and Par-
sevals relation, we have

|W ∗ (ejω )|2 dω = 2π (9.27)

Property 9 [112] : The eigenvectors associated with B −1 A satisfy either,



w symmetric
Jw = (9.28)
−w skew-symmetric.

Theorem 9.3. The optimal eigenfilter (9.14) is a linear phase FIR filter having a
constant phase and group delay (symmetric case), or a constant group delay (skew-
symmetric case).
Proof :

∗ w∗ (M − 1 − m) symmetric
w (m) = m = 0, 1, . . . , M − 1
−w∗ (M − 1 − m) skew-symmetric
(9.29)

because J, in property 9, exchanges the elements of the optimal eigenfilter.


In the following section we discuss the results of the designed eigenfilter for a
speech source.
198 9 Selective Audio Signal Cancellation

9.4 Results
The degrees of freedom for the eigenfilter in (9.14), is the order M . Variabili-
ties such as (i) the choice for the modeled duration (S, L) for the room responses
(9.12), (ii) the choice of the impulse response (i.e., whether it is minimum-phase
or nonminimum-phase), and (iii) variations in the room response due to listener (or
head) position changes affect the performance (gain). We study (i) and (ii) in the
present chapter with the assumption of L = S. The choice for the filter order and
the modeled impulse response duration affects the gain (9.15) and distortion (defined
later in this section) of the signal at the microphones. Basically, a lower duration re-
sponse used for designing the eigenfilter will reduce the operations for computing the
eigenfilter, but may affect performance. In summary, the length of the room response
(reverberation) modeled in the design of the eigenfilter affects the performance and
this variation in performance is referred to as the sensitivity of the eigenfilter to the
length of the room response.

9.4.1 Eigenfilter Performance as a Function of Filter


Order M

In this experiment, the excitation, x(n), was a segment of speech signal obtained
from [120]. The speech was an unvoiced fricated sound /S/ as in “sat” obtained from
a male subject and is shown in Fig. 9.3.
As is well known, this sound is obtained by exciting a locally time-invariant,
causal, stable vocal tract filter by a stationary uncorrelated white noise sequence,
which is independent from the vocal tract filter [114]. The stability of the vocal tract

Fig. 9.3. The speech signal segment for the unvoiced fricative /S/ as in sat.
9.4 Results 199

Fig. 9.4. Impulse responses for the front and back positions.

filter is essential, as it guarantees the stationarity of the sequence x(n) [122]. The
impulse responses were generated synthetically from the room acoustics simulator
software [123]. The estimation of these responses was based on the image method
(geometric modeling) of reflections created by ideal omnidirectional sources, and re-
ceived by ideal omnidirectional receivers [61]. For the present scenario the modeled
room was of dimensions, 15 m × 10 m × 4 m. The source speaker was at (1 m, 1 m,
1 m) from a reference northwest corner. The impulse response for the “front” micro-
phone located at (4.9 m, 1.7 m, 1 m) relative to the reference, was denoted as h2 (n),
and the “back microphone” located at (4.5 m, 6.4 m, 1 m) had impulse response
measurement h1 (n). The two responses are plotted as positive pressure amplitudes
in Fig. 9.4 (ignoring the initial delay). This situation is similar to the case for listen-
ers in an automobile, where the front left speaker is active, and the relative gain to be
maximized is between the front driver and the back passenger.
A plot of the gain (9.15) as a function of the filter order for the aforementioned
signal and impulse responses is shown in Fig. 9.5.
Firstly, a different microphone positioning will require a new simulation for com-
puting (9.14), and determining the performance thereof. Secondly, larger duration fil-
ters increase the gain, but affect the signal characteristics at the receiver in the form
of distortion. Basically, a distortion measure is an assignment of a nonnegative num-
ber between two quantities to assess their fidelity. According to Gray et al. [124], a
distortion measure should satisfy the following properties: (1) it must be meaningful,
in that small and large distortions between the two quantities correspond to good and
bad subjective quality, (2) it must be tractable and should be easily tested via mathe-
matical analysis, and (3) it must be computable (the actual distortions in a real system
200 9 Selective Audio Signal Cancellation

Fig. 9.5. Eigenfilter performance (gain) as a function of the eigenfilter order M .

can be efficiently computed). The proposed distortion measure is evaluated in terms


of an Lp , (p = 1) norm on (−π, π) [125] and models the variation in the received
spectrum at listener 2,1 due to the presence of the eigenfilter, over the natural event,
that of the absence of the filter. We use the L1 norm due to its ease of analysis and
computation for the current problem. Before presenting the results for the distortion
against filter order, we prove that the average spectrum error (stated in terms of the
spectral local matching property [121]) EM is constant for any eigenfilter order.
Theorem 9.4. The spectrum error EM defined in terms of the spectral match is
6 6
6 Sŷ (ejω ) 6
EM = 6 6
6 Sy (ejω ) 6 = 1, ∀M (9.30)
1

for an M th order eigenfilter, and

Sŷ (ejω ) = |H2 (ejω )|2 |WM (ejω )|2 Sx (ejω ) = |WM (ejω )|2 Sy (ejω ) (9.31)

where Sŷ (ejω ), Sy (ejω ) are the spectra associated with the presence and absence
of the eigenfilter,
M −1 respectively (an equivalent model is shown in Fig. 9.6), and
WM (ejω ) = i=0 wi e−jωi .
Proof: From the L1 definition, we have
 π ( (
( Sŷ (ejω ) ( dω
EM = ( ( (9.32)
( jω (
−π Sy (e ) 2π
1
The evaluation of the distortion at listener 1 is not important, because the intention is to
“cancel” the signal in her direction.
9.4 Results 201

Fig. 9.6. Equivalent spectral model in the direction of listener 2 using the eigenfilter wk .

From (9.28), (9.31), and (9.32) it can be seen that


 π

EM = |WM (ejω )|2 =1 (9.33)
−π 2π

It is interesting to observe that a similar result can be established for the linear
prediction spectral matching problem [121]. Also, when the FIR eigenfilter is of the
lowest order with M = 1, and w0 = 1, then the impulse response of the eigenfilter
is w(n) = δ(n), and E1 is unity (observe that with w(n) = δ(n) we have h2 (n) ⊗
δ(n) = h2 (n)).
An interpretation of (9.33) is that irrespective of the filter order (M > 1), the
average spectral ratio is unity, which means that in terms of the two spectra, Sŷ (ejω )
will be greater than Sy (ejω ) in some regions, and less in other regions, such that
(9.33) holds.
The log-spectral distortion dM (Sŷ (ejω ), Sy (ejω )) for an eigenfilter of order M
on an L1 space is defined as
6 6
dM (Sŷ (ejω ), Sy (ejω )) = 6log Sy (ejω ) − log Sŷ (ejω )61
6 6
= 6log Sŷ (ejω )/Sy (ejω )61
6 6
= 6log |WM (ejω )|2 61
 π
( (
= (log |WM (ejω )|2 ( dω (9.34)
−π 2π

It can be easily shown that dM (Sŷ (ejω ), Sy (ejω )) ≥ 0, with equality achieved when
the eigenfilter is of unit order with w0 = 1. In Fig. 9.7, we have computed the
distortion (9.34), using standard numerical integration algorithms, as a function of
the filter order for the present problem. Figure 9.8 summarizes the results from Fig.
9.9 and Fig. 9.7, through the gain-distortion constellation diagram. Thus depending
on whether a certain amount of distortion is allowable, we can choose a certain point
in the constellation (distortionless performance is obtained for the point located along
the positive ordinate axis in the constellation).
Clearly, there is an improvement in the gain-to-distortion ratio with the increase
in filter order (e.g., from Fig. 9.8, M = 400 gives a gain-to-distortion ratio of
101.6 /9.8 ≈ 4, whereas M = 250 gives the gain-to-distortion ratio as 3). Also,
for example, with filter order M = 400, the relative gain between the two locations
is as much as 16 dB. This ideally corresponds to a virtual position of listener 1, for
whom the sound cancellation is relevant, to be at a distance that is four times as far
from a fixed source as the other listener (listener 2).
202 9 Selective Audio Signal Cancellation

Fig. 9.7. Eigenfilter distortion as a function of the eigenfilter order M .

9.4.2 Performance Sensitivity as a Function of the Room Response Duration

From Eqs. (9.12), (9.14), and (9.15) we see that the eigenfilter performance can be
affected by (i) the room response duration modeled in the eigenfilter design, as well
as (ii) the nature of the room response (i.e., whether it is characterized by an equiv-

Fig. 9.8. Gain-to-distortion constellation space. Distortionless performance is obtained along


the positive ordinate axis.
9.4 Results 203

alent minimum phase model). In summary, a short duration room response if used
in (9.12), for determining (9.14), will reduce the computational requirements for
designing the eigenfilter. However, this could reduce the performance because the
eigenfilter does not use all the information contained in the room responses. This
then introduces a performance tradeoff. The question then is, can an eigenfilter (9.14)
be designed with short duration room
response (for savings in computation) in the A and B matrices in (9.12), but yet
does not cause the performance (9.15) to be affected. Of course, care should be taken
to evaluate the performance in that the A and B matrices in (9.15) should have the
full duration room responses.
To understand this performance tradeoff, we design the eigenfilter of length
M < L (L being the actual duration of the room impulse responses in the two direc-
tions), based on windowing both room responses with the window being rectangular
and having duration P < L. We then analyze the performance (9.15) of the filter
to increasing room response length. Basically the goal of this experiment is, can
we design an eigenfilter with sufficiently short room responses (in (9.14)) without
compromising the performance? To answer this question, the following procedure is
adopted.
(a) Design the eigenfilter ŵ∗ ∈ M ×1 for a shortened room response duration
P < L,

ŵ∗ = eλmax [B̂ −1 Â] (9.35)

with

P −1 P
 −1
 = h2 (p)h2 (q)Rx (p, q)
p=0 q=0


P −1 P
 −1
B̂ = h1 (p)h1 (q)Rx (p, q) M ≤P <L (9.36)
p=0 q=0

where the hat above the matrices in (9.36) denotes an approximation to the true
quantities in (9.12), and the corresponding eigenfilter (9.35) is the resulting
approximation (due to reduced duration P < L) to (9.14). We have included the
constraint M ≤ P < L to keep the order of the eigenfilter low (reduced
processing), for a given real room response duration L = 8192, as explained below.
(b) Evaluate the performance (9.15) of the filter with the true matrices A and B (9.12)
containing the full duration room responses.
We consider the performance when we select the responses according to (a)
hi (n) = hi,min (n) ⊗ hi,ap (n), and b) hi (n) = hi,min (n); i = 1, 2; where hi,min (n)
and hi,ap (n) are the minimum-phase and all-pass components of the room responses.
The impulse responses h1 (n) and h2 (n) (comprising 8192 points) were obtained in
a highly reverberant room from the same microphones.
204 9 Selective Audio Signal Cancellation

Fig. 9.9. M = 64; (a) P = 64; (b)P = 128; (c)P = 512.

Impulse Response hi (n) = hi,min (n) ⊗ hi,ap (n); i = 1, 2

In Fig. 9.9, we show the performance of the eigenfilter design as a function of the
length of the impulse response. The length of the FIR filter was M = 64. The
performance in each subplot as a function of the impulse response increments is
shown, where we chose ∆P = {0} ∪ {2k : k ∈ [7, 12], k ∈ I}, where I denotes
the integer set. Thus, Fig. 9.9(a) represents an eigenfilter of length M = 64 de-
signed with duration P , of the windowed impulse response, to be 64 (after removing
the pure delay). The second performance evaluation, marked by an asterisk, is at
P + ∆P = 64 + 27 = 192. In Fig. 9.10 and Fig. 9.11, we show the sensitivity of
the eigenfilter for filter lengths M = 128 and M = 256 for various windowed room
impulse responses.
From the figures, we confirmed a better gain performance with increased filter
length. By considering a larger duration room impulse response in the eigenfilter
design, we lower the gain relatively but improve its evenness (flatness), Ideally, we
want a small duration filter length (relative to the length of the room responses)
with a large gain and uniform performance (low sensitivity to the length of the room
impulse response).

Impulse Response hi (n) = hi,min (n); i = 1, 2

In Figs. 9.12 to 9.14, we show the performance of the eigenfilter for various win-
dowed room responses and with different filter lengths. The performance (in terms
of uniformity and level of the gain) is better than the nonminimum-phase impulse
response model. We need to investigate this difference in the future.
9.5 Summary 205

Fig. 9.10. M = 128; (a) P = 128; (b)P = 256; (c)P = 512.

9.5 Summary
There is a proliferation of integrated media systems that combine multiple audio and
video signals to achieve tele-immersion among distant participants. One of the key
aspects that must be addressed is the delivery of the appropriate sound to each local

Fig. 9.11. M = 256; (a) P = 256; (b) P = 512.


206 9 Selective Audio Signal Cancellation

Fig. 9.12. Performance for minimum-phase room impulse response models. M = 64; (a)
P = 64; (b) P = 128; (c) P = 512.

participant in the room. In addition, sound intended for other participants or originat-
ing from noise sources must be canceled. In this chapter we presented a technique for
canceling audio signals using a novel approach based on information theory. We ad-

Fig. 9.13. Performance for minimum-phase room impulse response models. M = 128; (a)
P = 128; (b) P = 256; (c) P = 512.
9.5 Summary 207

Fig. 9.14. Performance for minimum-phase room impulse response models. M = 256; (a)
P = 256; (b) P = 512.

dressed this technique as the eigenfilter method, because the filter was derived based
on maximizing the relative power between the two listeners in an acoustic enclo-
sure. We also derived some of its theoretical properties (e.g., linear phase). For fixed
room responses, we investigated (i) performance (gain) tradeoff to distortion, (ii) sen-
sitivity of the performance to modeled room impulse duration. Our findings, for the
present channel conditions, indicate that increasing the filter order improves the gain-
to-distortion ratio. Thus, depending on the application, a suitable order filter may be
chosen from the gain-to-distortion constellation diagram or from the sensitivity re-
sults. Furthermore, our findings for a particular scenario indicate that by extracting
the minimum-phase component we get a better performance (in terms of uniformity
and level of the gain) than the nonminimum-phase impulse response model.
In summary, this chapter addressed a fairly new application area, and clearly not
all answers are contained. Hence, future directions include research in the following
areas.
(a) The distortion measure that is introduced in Eq. (9.34) is easy to compute
and is well known in the literature. Of course speech intelligibility is affected by
a change in the frequency spectrum (this change in the spectrum is computed in
the form of the distortion measure), and large changes will result in a degradation
in speech intelligibility. To determine how large is large, as a next step one could
perform speech intelligibility tests for consonants, for example, using a “confusion
matrix”.
(b) Investigation of the characteristics of gain zones, regions in space around the
microphones which have a gain improvement of at least 10 dB in SPL, and view-
208 9 Selective Audio Signal Cancellation

ing them from the acoustical physics viewpoint. Also the evaluation of the loudness
(which is frequency-dependent) criteria using eigenfilters is a topic for research.
(c) Performing psychoacoustical/subjective measurements. In this chapter, we
have addressed the effects of prefiltering an audio signal, objectively, through the
spectral distortion measure. Subjective (double-blind) listening tests need to be per-
formed for investigating the perceptual coloration of the transmitted signals.
(e) Investigation of the effects on gain-to-distortion by designing LPC filters to
approximate the room transfer functions.
(f) Alternate objective functions can be evaluated (viz., those that minimize the
SPL at one position but keep the sound quality at other positions as high as possible).
References

1. Mitra S (2001), Digital Signal Processing: A Computer Based Approach. McGraw-Hill.


2. Oppenheim A, Schafer R (1989), Discrete Time Signal Processing. Prentice-Hall.
3. Porat B (1996), A Course in Digitial Signal Processing, John Wiley & Sons.
4. Churchill R, Brown J (1989), Complex Variables and Applications, McGraw-Hill.
5. Rabiner L, Crochiere R (1983), Multirate Digital Signal Processing, Prentice-Hall.
6. Vaidyanathan PP (1993), Multirate Systems and Filter Banks, Prentice-Hall.
7. Mitra S, Kaiser JF (1993), Handbook for Digital Signal Processing, John Wiley & Sons.
8. Haykin S (1996), Adaptive Filter Theory, Prentice-Hall.
9. Hayes MH (1996), Statistical Digital Signal Processing and Modeling, John Wiley &
Sons.
10. Rabiner L, Juang B-H (1993), Fundamentals of speech recognition, Prentice-Hall.
11. Kuttruff H (1991), Room Acoustics, Elsevier Applied Science.
12. Morse PM, Ingard KU (1986), Theoretical Acoustics, Princeton Univ. Press.
13. Widrow B., Hoff ME Jr. (1960), IRE WESCON Conv. Rec., Part 4:96–104.
14. Schroeder MR (1979), J. Acoust. Soc. Amer., 66:497–500.
15. Cook RK, Waterhouse RV, Berendt RD, Edelman S, and Thompson MC (1955), J.
Acoust. Soc. Amer., 27(6):1072–1077.
16. Schroeder MR (1962), J. Acoust. Soc. Amer., 34(12):1819–1823.
17. Schroeder MR (1975), J. Acoust. Soc. Amer., 57:149–150.
18. Müller S, Massarani P (2001), J. Audio Eng. Soc., 49(6):443–471.
19. Dunn C, Hawksford MO (1993), J. Audio Eng. Soc., 41:314–335.
20. Farina A (Apr. 2000), 108th Conv. of Audio Eng. Soc. (preprint 5093).
21. Stan G-B, Embrechts J-J, and Archambeau D (2002), J. Audio Eng. Soc., 50(4):249–262.
22. Fletcher H, Munson WA (1933), J. Acoust. Soc. Amer., 5:82–108.
23. Robsinson DW, Dadson RS (1956), J. Appl. Phys., 7:166–181.
24. Intl. Org. for Standardization (1987), ISO-226.
25. Stevens SS (1972), J. Acoust. Soc. Amer., 51:575–601.
26. Moore BCJ (2000), An Introduction to the Psychology of Hearing, Academic Press.
27. Stephens SDG (1973), J. Sound Vib., 37:235–246.
28. Fletcher H (1940), Rev. Mod. Phys., 12:47–65.
29. Patterson RD, Moore BCJ (1986), in Frequency Selectivity in Hearing (Ed. Moore BCJ),
Academic Press.
30. Hamilton PM (1957), J. Acoust. Soc. Amer., 29:506–511.
31. Greenwood DD (1961), J. Acoust. Soc. Amer., 33:484–501.
210 References

32. Patterson RD (1976), J. Acoust. Soc. Amer., 59:640–654.


33. Gierlich HW (1992), Appl. Acoust., 36:219-243.
34. Wightman FL, Kistler DJ (1989), J. Acoust. Soc. Amer., 85(2):858–867.
35. Wightman FL, Kistler DJ (1989), J. Acoust. Soc. Amer., 85(2):868-878.
36. Bauer BB (1961), J. Audio Eng. Soc., 9(2):148–151.
37. Moller H (1992), Appl. Acoust., 36:171–218.
38. Asano F, Suzuki Y, and Sone T (1990), J. Acoust. Soc. Amer., 88(1):159–168.
39. Middlebrooks JC, Green DM (1991), Annu. Rev. Psychol., 42:135–159.
40. Hebrank J, Wright D (1974), J. Acoust. Soc. Amer., 56(6):1829-1834.
41. Schroeder MR, Atal BS (1963), IEEE Conv. Record, 7:150–155, 1963.
42. Schroeder MR, Gottlob D, and Siebrasse KF (1974), J. Acoust. Soc. Amer., 56:1195–
1201.
43. Cooper DH, Bauck J (1989), J. Audio Eng. Soc., 37(1/2):3–19.
44. Nelson PA, Hamada H, and Elliott SJ (1992), IEEE Trans. Signal Process., 40:1621–
1632.
45. Lim J-S, Kyriakakis C (2000), 109th Conv. of Audio Eng. Soc. (preprint 5183).
46. Cooper DH, Bauck J (1996), J. Audio Eng. Soc., 44:683–705.
47. Itakura F, Saito S (1970), Elect. and Comm. in Japan, 53A:36–43.
48. Laroche J, Meillier JL (1994), IEEE Trans. on Speech and Audio Proc., 2.
49. Pozidis H, Petropulu AP (1997), IEEE Trans. on Sig. Proc., 45:2977–2993.
50. Moulines E, Duhamel P, Cardoso JF, and Mayrargue S (1995), IEEE Trans. on Sig. Proc.,
43:516–525.
51. Widrow B, Walach E (1995), Adaptive Inverse Control, Prentice-Hall.
52. Mouchtaris A, Lim J-S, Holman T, and Kyriakakis C (1998), Proc. IEEE Mult. Sig. Proc.
Wkshp. (MMSP ’98).
53. Bershad NJ, Feintuch PL (1986), IEEE Trans. Acoust., Speech Sig. Proc., ASSP-34:452–
461.
54. Ferrara ER Jr. (1980), IEEE Trans. Acoust., Speech Sig. Proc., ASSP-28(4):474–475.
55. Bershad NJ, Feintuch PL (1986), IEEE Trans. Acoust., Speech Sig. Proc., ASSP-34:452–
461.
56. Widrow B, McCool JM (1976), IEEE Trans. Antennas and Propagation, AP-24(5):615–
637.
57. Narayan SS, Peterson AM, and Narasimha MJ (1983), IEEE Trans. Acoust. Speech. Sig.
Proc., ASSP-31(3):609–615.
58. Horowitz LL, and Senne KD(1981), IEEE Trans. Circuits and Syst., CAS-28(6):562–
576, 1981.
59. Narayan SS, Peterson AM, and Narasimha MJ (1983), IEEE Trans. Acoust. Speech. Sig.
Proc., ASSP-31(3):609–615.
60. Bharitkar S, Kyriakakis C (2003), Proc. 37th IEEE Asilomar Conf. on Sig. Syst. Comp.,
1:546–549.
61. Allen JB, Berkley DA (1979), J. Acoust. Soc. Amer., 65:943–950.
62. Weiss S, Rice G, and Stewart RW (1999), IEEE Wkshp. Appl. Sig. Proc. Audio and
Acoust., 203–206.
63. Mourjopoulos J (1985), J. Sound & Vib., 102(2):217–228.
64. Elliott SJ, Nelson PA (1989), J. Audio Eng. Soc., 37(11):899–907.
65. Mourjopoulos J (1994), J. Sound & Vib., 43(11).
66. Haneda Y, Makino S, and Kaneda Y (1994), IEEE Trans. on Speech and Audio Proc.,
2(2):320–328.
67. Miyoshi M, Kaneda Y (1988), IEEE Trans. Acoust. Speech and Signal Proc., 36(2):145–
152.
References 211

68. Haneda Y, Makino S, and Kaneda Y (1997), IEEE Trans. on Speech and Audio Proc.,
5(4):325–333.
69. Neely S, Allen J (1979), J. Acoust. Soc. Amer., 66(1):165–169.
70. Radlović B, Kennedy R (2000), IEEE Trans. on Speech and Audio Proc., 8(6):728–737.
71. Karjalainen M, Piirilä E, Järvinen A, and Huopaniemi J (1999),J. Audio Eng. Soc., 47
(1/2):15–31.
72. Karjalainen M, Härmä A, Laine UK, and Huopaniemi J (1997), Proc. 1997 IEEE Wkshp.
on Appl. Signal Proc. Audio and Acoust. (WASPAA ’97).
73. Härmä A, Karjalainen M, Savioja L, Välimäki V, Laine UK, and Huopaniemi J (2000),
J. Audio Eng. Soc., 48(11):1011–1031.
74. Chang PR, Lin CG, and Yeh BF (1994), J. Acoust. Soc. Amer., 95(6):3400–3408.
75. Mourjopoulos J, Clarkson P, and Hammond J (1982), Proc. ICASSP, 1858–1861.
76. Bezdek J (1981), Pattern recognition with fuzzy objective function algorithms, Plenum.
77. Dunn JC (1973), J. Cybern., 3:32–57.
78. Xie XL, Beni G (1991), IEEE Trans. on Pattern Analysis and Mach. Intelligence, 3:841–
846.
79. Pal NR, Bezdek JC (1995), IEEE Trans. on Fuzzy Syst., 3(3):370–379.
80. Markel JD, Gray, AH Jr. (1976), Linear Prediction of Speech, Springer-Verlag.
81. Alku P, Bäckström T (2004), IEEE Trans. on Speech and Audio Proc., 12(2):93–99.
82. Oppenheim A, Johnson D, and Steiglitz K (1971), Proc. IEEE, 59:299–301.
83. Smith JO, Abel JS (1999), IEEE Trans. on Speech and Audio Proc., 7(6):697–708.
84. Zwicker E, Fastl H (1990), Psychoacoustics: Facts and Models, Springer-Verlag.
85. Fukunaga K (1990), Introduction to Statistical Pattern Recognition, Academic Press.
86. Sammon, JW Jr. (1969), IEEE Trans. on Computers., C-18(5):401–409.
87. Kohonen T (1997), Self-Organizing Maps, Springer.
88. Torgerson WS (1952), Psychometrika, 17:401–419.
89. Young G, Householder AS (1938), Psychometrika, 3:19–22.
90. Pȩkalska E, Ridder D, Duin RPW, and Kraaijveld MA (1999), Proc. ASCI’95 (5th An-
nual Int. Conf. of the Adv. School for Comput. & Imag.), 221–228.
91. Woszczyk W (1982), Proc. of 72nd AES Conv., preprint 1949.
92. Lipshitz S, Vanderkooy J (1981), Proc. of 69th AES Conv., preprint 1801.
93. Thiele N (2001), Proc. of 108th AES Conv., preprint 5106.
94. Bharitkar S, Kyriakakis C (2003), IEEE Wkshp. on Appl. Signal Proc. Audio and Acoust.
(WASPAA ’97).
95. Radlović B, Kennedy R (2000), IEEE Trans. on Speech and Audio Proc., 8(6):728–737.
96. Bharitkar S, Kyriakakis C (2005), Proc. IEEE Conf. on Multimedia and Expo.
97. Toole FE, Olive SE (1988), J. Audio Eng. Soc., 36(3):122–141.
98. Bharitkar S, Kyriakakis C (2005), Proc. 13th Euro. Sig. Proc. Conf. (EUSIPCO).
99. Talantzis F, Ward DB (2003), J. Acoust. Soc. Amer., 114:833–841.
100. Cook RK, Waterhouse RV, Berendt RD, Edelman S, and Thompson MC (1955), J.
Acoust. Soc. Amer., 27(6):1072–1077.
101. Kendall M, Stuart A (1976), The Advanced Theory of Statistics, Griffin.
102. Bharitkar S (2004), Digital Signal Processing for Multi-channel Audio Equalization and
Signal Cancellation, Ph.D Thesis, University of Southern California, Los Angeles (CA).
103. Bharitkar S, Kyriakakis C (2000), Proc. IEEE Conf. on Mult. and Expo.
104. Bharitkar S, Kyriakakis C (2000), Proc. IEEE Int. Symp. on Intell. Signal Proc. and
Comm. Syst.
105. Nelson PA, Curtis ARD, Elliott SJ, and Bullmore AJ (1987), J. Sound and Vib.,
117(1):1–13.
212 References

106. Ross CF (1981), J. Sound and Vib., 74(3):411–417.


107. Williams JEF (1984), Proc. Royal Soc. of London, A395:63–88.
108. Elliott SJ, Nelson PA (1993), IEEE Signal Proc. Mag., 12–35.
109. Guicking D (1990), J. Acoust. Soc. Amer., 87:2251–2254.
110. Buckley KM (1987), IEEE Trans. Acoust., Speech, and Sig. Proc., ASSP-35:249–266.
111. American National Standards Methods for the Calculation of the Articulation Index,
S3.5-1969 (American National Standards Institute, New York.)
112. Cantoni A, Butler P (1976), IEEE Trans. on Comm., 24(8):804–809.
113. Robinson E (1967), Statistical Communication and Detection, Griffin.
114. Rabiner L, Gold B (1993), Theory and Application of Digital Signal Processing,
Prentice-Hall.
115. Makhoul J (1981), IEEE Transactions on Acoust., Speech, and Sig. Proc., ASSP-
29:868–872.
116. Yule GU (1927), Philos. Trans. Royal Soc. London, A226:267–298.
117. Söderström T, Stoica P (1983), Instrumental Variable Methods for System Identification,
Springer-Verlag.
118. Schroeder MR (1954), Acustica, 4:594–600.
119. Doak PE (1959), Acustica, 9(1):1–9.
120. Childers DG (2000), Speech Processing and Synthesis Toolboxes, John Wiley.
121. Makhoul J (1975), Proc. IEEE, 63(4):561–580.
122. Orfanadis SJ (1985), Optimum Signal Processing, Macmillan.
123. Mourjopoulos J (2000), Room Acoustics Simulator v1.1, Wireless Communications
Laboratory, University of Patras, Greece.
124. Gray RM, Buzo A, Gray AH Jr., and Matsuyama Y (1980), IEEE Trans. on Acoust.
Speech and Sig. Proc., ASSP-28(4):367–376.
125. Ash RB (1972), Real Analysis and Probability, Academic Press.
126. Hayes M (1996), Statistical Digital Signal Processing and Modeling, John Wiley.
127. Snyder SD (2000), Active Noise Control Primer, Springer.
128. Olson HF (1957), Acoustical Engineering, Van Nostrand.
129. Elliott SJ (2001), Signal Processing for Active Control, Academic Press.
130. Lueg P (1936), Process for silencing sound oscillations, US. Pat. No. 2,043,416.
Index

z-transform, 14 Circular convolution, 24


z-plane, 16 Cluster analysis, 105
5.1 system, 126 Cluster validity index, 107
Cochlea, 66
A-weighting, 67 Contralateral transfer function, 84
Absorption coefficient, 55 Convolution, 7
Acoustic wave equation, 50 Critical Bandwidth, 71
Active noise control, 188 Crossover frequency, 125, 130
Active power minimization, 188 Crosstalk canceller, 76, 83
Adaptive filters, 35
Aliasing, 18 Decay curves, 56
All-pass coefficient, 108 Decibel, 10
All-pass filters, 37, 134 Decimation, 20, 22
All-pass systems, 12 Deconvolution, 63
All-pass warping, 108 Delta function, 4
All-Pole filter, 44 Desired response specification, 27
Anti-aliasing filter, 20 DFT and DTFT relationship, 23
Audio signal cancellation, 188 Direct field component, 60, 161
Autoregressive and moving average Directivity function, 50
(ARMA) process, 8 Discrete Fourier Transform (DFT), 23
Axial mode, 54 Discrete time Fourier transform, 9
Distortion measure, 199
B-weighting, 67
Bark, 71 Ear, 65
Bark scale, 108 Eigenfilter, 187, 192
Bartlett window, 31 Eigenfilters, 191
Bass management, 125 Eigenfrequencies, 54
Bilinear transform, 24 Eigenfunction, 52, 172
Binaural sound, 76 Eigenvalue, 52, 172
Blackman window, 33 Eigenvector, 194
Butterworth filter, 11, 37, 126 Elliptic filter, 40
Equal loudness contours, 66
C-weighting, 67 Equalization error, 163, 165
Chebyshev filter, 38 Equalization filter, 99
214 Index

equalization robustness, 157, 171 Log-spectral distortion, 201


Equalization robustness, 172 Loudness, 68
Equivalent Rectangular Bandwidth (ERB), Loudness level, 66
70 Loudness perception, 66
Expander, 21 Loudspeaker and room impulse response, 61
Low-pass filter, 27
FDAF-LMS, 86
Filter design, 27 Magnitude response, 10
Finite impulse response (FIR) filter, 27 Magnitude response averaging, 172
FIR filter advantages, 28 MATLAB, 8
FIR filter disadvantages, 28 Maximally flat, 28
FIR least squares filter, 29 Maximum length sequence, 61, 62
FIR linear phase filters, 29, 197 Membership function, 105
FIR Type 1 linear phase filter, 29 Microphone and listener position mismatch,
FIR Type 2 linear phase filter, 29 157
FIR Type 3 linear phase filter, 29 Minimax error, 28
FIR Type 4 linear phase filter, 29 Minimization of Lp error, 28
FIR windows, 30 Minimum audible field, 67
Frequency domain adaptive filter (FDAF), Minimum-phase systems, 11
86 Mismatch analysis, 171, 173
Frequency response, 9 Mismatch parameter, 163, 168
Frequency selectivity, 70 Modal analysis, 171
Fricatives, 198 Modal equation, 51, 172
Fuzzy clustering, 105 Modeling delay, 85
Fuzzy c-means, 105 Multichannel surround, 75
Multiple position equalization, 100, 103, 160
Gain maximization, 192
Green’s function, 51, 172 Nyquist rate, 18, 20

Hamming window, 31 Oblique mode, 54


Hann window, 31
Head related transfer function (HRTF), 83 Parametric filter, 40
Parseval’s relation, 197
IIR filters, 36 Phase equalization, 134
Impulse response, 7 Phase interaction between subwoofer and
Infinite duration impulse response (IIR) satellite, 132
filter, 27 Phase interaction between subwoofer and
Interpolation, 22 satellite response, 128
Inverse filter, 101 Phase response, 10
Inverse repeated sequence (IRS), 63 Phon, 67
Inverse square law, 194 Pole-zero representation, 11
Ipsilateral transfer function, 84 Psychoacoustically motivated warping, 108
Psychoacoustics, 65
Kaiser window, 33
Quantum numbers, 178
Linear phase systems, 14
Linear predictive coding (LPC) filter, 44, 78, Reconstruction, 16, 18
107 Rectangular window, 31
Linear system, 6 Reference pressure, 51
Linear- and time-invariance, 6 Region of convergence (ROC), 15
Index 215

Rendering binaural signals, 76 Sound pressure amplitude distribution in a


Rendering filter, 83 room, 52
Resampling, 22 Sound pressure level, 51
Reverberant field component, 60, 161 Sound propagation, 49
Reverberation time, 54 Spatial average equalization, 160
Room equalization, 99 Spatial averaging, 157, 172
Room image model, 121, 199 Spectral deviation measure, 130
Room impulse response, 56, 61, 100, 191 Spectral deviations, 158
Room reflection model, 101 Speed of sound, 49
Room transfer function, 100 SPL metering, 67
Root-mean-square (RMS) average Standing wave, 51
equalization, 121, 172 Sweep signal, 63
Root-mean-square (RMS) averaging, 104
Tangential mode, 54
Sammon map, 110 Target curve, 101
Taylor series expansion, 163
Sampling, 4, 16, 17
Time delay-based crossover correction, 150
Sampling frequency, 4
Time integration, 68
Sampling period, 4
Time-invariance, 6
Sampling rate increase, 21
Transaural audio, 76
Sampling rate reduction, 19
Transfer function decomposition, 13
Schroeder frequency, 60, 161, 162, 166
Transfer function representation, 10
Schroeder integrated impulse response
method, 56 Upconversion, 75
Shelving filter, 40
Signal power, 193 Virtual microphone signal synthesis, 77
Sinc function, 168
Sinc interpolation, 19 Warped room impulse responses, 112
Single position equalization, 102 Warping coefficient, 108
Sone, 68 Wavelength of sound, 50
Sound intensity, 54 Wavenumber, 50
Sound power, 55
Sound pressure, 50 Xie–Beni index for cluster analysis, 107

You might also like