Real-Time DSP Implementation of Audio Crosstalk Cancellation Using Mixed Uniform Partitioned Convolution
Real-Time DSP Implementation of Audio Crosstalk Cancellation Using Mixed Uniform Partitioned Convolution
Real-Time DSP Implementation of Audio Crosstalk Cancellation Using Mixed Uniform Partitioned Convolution
VenkataRao
NVK Mahalakshmi
Abstract
For high fidelity sound reproduction, it is necessary to use long filter coefficients in audio crosstalk
cancellation. To implement these long filters on real-time DSP processors, conventional overlap
save technique suffers from more computational power as well as processing delay. To overcome
these technical problems, mixed uniform partitioned convolution technique is proposed. This
method is derived by combining uniform partitioned convolution with mixed filtering technique.
With the proposed method, it is possible to perform audio crosstalk cancellation even at the order
of ten thousand filter taps with less computations and short processing delay. The proposed
technique was implemented on 32-bit floating point DSP processor and design was provided with
efficient memory management to achieve optimization in computational complexity. The
computational comparison of this method with conventional methods shows that the proposed
technique is very efficient for long filters.
Keywords: Convolution, Crosstalk Cancellation, FFT, Mixed filtering, Partitioned Convolution,
Overlap Save Method.
1. INTRODUCTION
3D audio systems have the potential to be used in many spatial audio applications such as home
theatre entertainment, gaming, teleconference and remote control. To reproduce the realistic
spatial audio, the challenging task of any 3D audio system is to have the ability to reproduce
spatial reverberation characteristics and spatial audio pattern at the desired locations. This could
be achieved by binaural synthesis and audio cross-talk cancellation (CTC). In 1983, head related
transfer function (HRTF) technology was developed to transform the sound field of a particular
location to the head by convolving the sound with appropriate pair of HRTF functions.
Headphones have excellent spatial characteristics such as channel separation and equalization,
but they are inconvenient and little bit cumbersome to use when more number of listeners is
enjoying the audio. An alternative to HRTF technology is conventional stereo loudspeaker system
located exactly in front of the listener. In this case, transmission path equalization is obtained by
inverting acoustic transfer function matrix between the two loudspeakers and the two ears of
listener, which is called crosstalk cancellation and is particularly required to cancel the unwanted
crosstalk from each speaker to the opposite ear [1][2][3][4].
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
118
To obtain such equalization for transmission path, the impulse responses of 2x2 system inversion
matrix (haa(n), hab(n), hba(n), hbb(n) as shown in Fig.1) may last for several hundreds of
milliseconds, which leads to the requirement of thousands of FIR filter coefficients as impulse
responses [5]. Due to this, the implementation of these long filters on real-time DSP processors
requires more computational power. To overcome the complexity issues, it is essential to develop
new implementation techniques without compromising for performance.
Historically, time domain convolution is well known technique. Even though this method is the
original method, it won't be preferred for long filters, in general, as it suffers from more
computational power. On other hand, overlap save & overlap add methods are frequency domain
methods and are efficient to handle the computational complexity problems in real-time
implementation. In these methods, the length of FFT is derived as N = L+M-1, which must be a
power of 2 due to the usage of FFT, where L and M are frame size and filter length respectively.
For the case of L=256 & M=8192, N becomes 8447 and could be chosen as 16384 by adding
additional zeros. Due to additional zeros, FFT size increases and hence, the increase in
computational power. In addition to this, the additional zeros cause the delay in output response
at least by M [5][6][7].
To overcome delay issues, in 1988, Vetterli proposed running convolution based on multi-rate
methods, in which impulse response is divided into bi-orthonormal filter banks continuously till
minimum FFT Size reaches such as equivalent to frame size, L. After bi-orthonormal filtering
process, required interpolation techniques will be applied to obtain the final filtered signal. In this,
delay issue was solved but it suffers from computational complexity as filtering process could be
performed for every sub filter bank and this technique involves more buffering of data[8][9].
Uniform partitioned convolution is a kind of technique where computational complexity as well as
delay issues is resolved. If this technique is applied individually to each filter of Fig.1, the
implementation complexity is huge and internal DSP memory may not hold all required buffers,
particularly for long filters [10][11][12][13][14][15][16][17].
To avoid such problems, uniform partitioned convolution is combined with mixed filtering in this
paper and presented as a new proposed algorithm to reduce computational complexity as well as
processing delay. With efficient memory management and the properties of FFT, the proposed
technique is very good choice for audio CTC for long filters.
This paper is organized as follows. Section 2 provides the review of mixed filtering and uniform
partitioned convolution. Later the combination of these two techniques is explained as proposed
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
119
Y (k ) = Ya (k ) + jYb (k ) = X a (k )H a (k ) + X b (k )H b (k )
(1)
where
H a (k ) = H aa (k ) + j H ab (k )
H b (k ) = H ba (k ) + j H bb (k )
The computational complexity of equation (1) includes one FFT computation with decomposition
(to evaluate Xa(k) and Xb(k) ), complex frequency multiplication and one IFFT. The real and
imaginary components of IFFT output yield ya(n) and yb(n) respectively [5].
2.2 Uniform Partitioned Convolution
In this method, the length of impulse response is uniformly partitioned into small lengths so that
overlap save method is applied to each partitioned impulse response and finally adding all
outputs of partitioned filters yield the convolved output. Fig. 2 shows the signal processing
involved in this method [10][11].
Let h(n) of length M be the impulse response and frame length be L. Fig. 2A shows the time delay
line filter. Let h0(n), h1(n),..., hm-1(n) where m = M/L be the partitioned impulse responses, which
are obtained by dividing the impulse response length by L so that the length of each partitioned
impulse response becomes L. Fig. 2B shows the application of overlap save method to each
partitioned impulse response, where FFT/IFFT size is equal to 2L. After first frame is processed,
nd
L samples of IFFT output are transmitted as filtered output. For 2 frame, first frame will be
nd
delayed by L samples and provided as input to 2 partitioned filter. Now overlap save method is
nd
applied to both partitioned filters and IFFT outputs are summed to yield filtered output of 2
frame. This process is continued till last partitioned filtering process. Instead of finding FFT and
IFFT for time delayed frames and frequency multiplied outputs, it is better to optimize the
structure with single FFT and IFFT as shown in Fig. 2C just by delaying the FFT outputs. When
nd
st
nd
2 frame arrives, FFT output of 1 frame becomes the input 2 partitioned filter. The complex
outputs of all partitioned frequency multipliers are added and a single IFFT is applied to the
complex sum. From the IFFT output, L samples are transmitted as overall filtered output.
FFT size of 2L means that the appended zero samples are L in size so that the processing delay
is L samples in worst case whereas overlap save method for original impulse response produces
at least M samples delay. Hence this method provides less delay compared to that of overlap
save method.
Computational complexity of this method is explained as follows. For each frame, one FFT and
one IFFT of size 2L are required. The frequency multiplier length is 2L. Such frequency multipliers
are m = M/L and hence complex multiplications of 2L.M/L = 2M are needed. All frequency
multipliers have to be added before providing as input to IFFT and hence 2L (m-1) = 2(M - L)
complex additions are required.
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
120
3. PROPOSED ALGORITHM
The proposed algorithm combines both the methods mentioned in Section 2. To proceed for the
proposed algorithm, let us partition the impulse responses, ha(n) & hb(n) of equation (1). The time
domain equivalents of Ha(k) & Hb(k) are given by haa(n) + j hab(n) & hba(n)+j hbb(n) respectively.
The length of these impulse responses are M. By partitioning these into m parts, where m = M/L,
the resultant partitioned impulse responses are given by
ha (n )
= {ha ,0 (n ) , ha ,1 (n ) , ha , 2 (n ) ,...., ha ,m 1 (n ) }
= {haa ,0 (n ) + j hab ,0 (n ) , haa ,1 (n ) + j hab ,1 (n ), ...., haa ,m 1 (n ) + j hab ,m 1 (n ) }
hb (n )
= {hb ,0 (n ) , hb ,1 (n ) , hb , 2 (n ) ,...., hb ,m 1 (n ) }
= {hba ,0 (n ) + j hbb , 0 (n ) , hba ,1 (n ) + j hbb ,1 (n ), ...., hba ,m 1 (n ) + j hbb ,m 1 (n ) }
The length of each partitioned sequence now becomes L. As per overlap save method, FFTs of
these partitioned responses found out by appending L zeros to each impulse response. The
frequency equivalents, Ha,0(k), Ha,1(k),... Ha,m-1(k) and Hb,0(k), Hb,1(k),... Hb,m-1(k) are obtained in
this way. Once partitioned FFT coefficients are found, the rest of the algorithm is based on the
application of uniform partitioned convolution approach as per equation (1). Instead of using two
IFFTs, it is better to apply single IFFT to the complex frequency sum, [Xa(k)Ha(k)+Xb(k)Hb(k)].
IFFT output provides the outputs ya(n) & yb(n) in complex form.
The z-domain equivalent of equation (1) is given by
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
121
Y (z )
= Ya ( z ) + j Yb ( z ) = X a (z ) H a (z ) + X b ( z ) H b (z )
= X a (z ) H a ,0 (z ) + z L H a ,1 ( z ) + ... + z (m 1)L H a , m 1 (z ) +
X b ( z ) H b ,0 ( z ) + z L H b ,1 ( z ) + ... + z (m 1)L H b ,m 1 (z )
m 1
= X a (z )
m 1
iL
H a ,i ( z ) + X b ( z )
i =0
iL
H b ,i ( z )
i =0
m 1
[X (z )z ]H (z ) + [X (z )z ]H (z )
iL
iL
a ,i
( 2)
b ,i
i =0
y (n )
m 1
= y a (n ) + j y b (n ) = z 1
X a ( z )z iL H a ,i (z ) + X b (z )z iL H b ,i (z )
i =0
(3)
The block diagram of the proposed algorithm was shown in Fig. 3. The steps involved in
proposed algorithm are as follows.
a. Partition the long impulse responses and find the complex frequency equivalents as stated
above.
st
b. Receive the 1 frames of inputs xa(n) & xb(n) and store them in memory buffers of length
2L each.
Note: Initially memory buffers contain zeros and filling of input frames into memory buffers
is based on overlap save method.
st
c. Find out FFT of overlapped 1 frame and store complex FFT outputs, Xa(k) & Xb(k) in
separate buffers.
Note: Reference [7] could be followed to find Xa(k) & Xb(k).
d. Perform complex frequency multiplication between frequency partitioned coefficients and
frequency delayed input frames. Each time one frequency multiplication is performed, the
resultant complex output is added to previous multiplier output so that complex sum will be
provides as input to IFFT.
e. Evaluate step (d) for both of inputs xa(n) & xb(n) and add the corresponding complex sums
to yield [Xa(k)Ha(k)+Xb(k)Hb(k)].
f. Now apply IFFT to the output in step (e) and transmit real & imaginary parts of complex
IFFT output as ya(n) & yb(n) respectively.
Note: As per overlap save method, only L valid samples will transmitted from IFFT output.
g. Repeat steps (b) to (f) for each new frame.
FIGURE 3: Block diagram of Mixed Uniform Partitioned Convolution to obtain CTC outputs
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
122
Computational complexity
Complex
Complex
Multiplications
Additions
0.5 O(2L)
O(2L)+2. 2L =
O(2L) + 4L
Xa(k)Ha(k)
2L.M/L=2M
2(M-L)
Xb(k)Hb(k)
2L.M/L=2M
2(M-L)
Y(k)
2L
y(n)
0.5 O(2L)
O(2L)
Remarks
FFTs of xa(n) & xb(n) could be found with
single FFT and decomposition[7]
Each Partitioned frequency multiplication
requires 2L multiplications. Such
partitions are M/L and hence 2M complex
multiplications are needed. All partitioned
multiplier outputs are to be added, which
requires 2(M-L) complex additions
st
nd
When 2 frame arrives, the FFT values of 1 frame need not be disturbed. FFT of 2 frame will
st
be stored in 2L locations previous to 1 frame. The buffers are accessed in circular fashion and
complex frequency multiplication could be performed in same way as explained earlier with the
associated partitioned complex frequency coefficients. This procedure will be continued in this
way for every new frame received.
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
123
By assuming real & imaginary buffers as circular buffers, intermediate copying routines could be
minimized for complex frequency multiplication as well as FFT and IFFT evaluation.
FIGURE 4: Efficient memory management in SHARC processors for Mixed Uniform Partitioned
Convolution.
3.3 Efficient Complex Frequency Multiplication
The sum of complex multiplications, i.e. [Xa(k)Ha(k)+Xb(k)Hb(k)], was implemented with
multiplication and add instructions in parallel with data move operations efficiently on SHARC
processor. This piece of code was provided here with 8 instructions inside the loop. The loop
counter size is 2L, which means that this code is valid for one partitioned filter and the same code
is called m times for all partitioned filters. The following analogy was made for easy
understanding.
Xa(k)
Ha(k)
Xb(k)
Hb(k)
a1+jb1
c1+jd1
a2+jb2
c2+jd2
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
124
In above piece of code, y1 & y2 are of size 2L each, which hold the real and imaginary outputs
obtained by summing the complex frequency multiplication outputs of all partitioned filters. These
two buffers are provided as inputs to IFFT evaluation.
Mega Peak
Cycle count
512
1024
1536
2048
2560
3072
3584
4096
4608
0.02137
0.02628
0.0312
0.03611
0.04103
0.04594
0.05086
0.05577
0.06069
0.0656
0.07052
0.07543
0.08035
0.08526
0.09018
0.09509
0.18577
5120
5632
6144
6656
7168
7680
8192
8704
Filter
Length, M
9216
9728
10240
10752
11264
11776
12288
12800
13312
13824
14336
14848
15360
15616
16128
16384
Mega Peak
Cycle count
0.19573
0.20579
0.21565
0.22561
0.23762
0.24698
0.25532
0.26490
0.27398
0.28536
0.29621
0.30453
0.31762
0.32203
0.33018
0.33987
TABLE 2: Proposed Algorithm - Mega Peak Cycle counts for frame length, L=256 and variable filter lengths.
FIGURE 5: Computational complexity comparison. X-axis represents filter length, M. Y-axis represents
Mega Peak cycle count. The above results are valid for frame length of L=256.
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
125
The cycle counts were calculated on SHARC processor by varying the filter length for fixed block
size, L=256. Table 2 shows the computational complexity details along with that of Mixed filtering
with overlap save method. Fig. 5 shows the comparison of computation cycles between Mixed
filtering with overlap save method (Results were taken from Reference [7] for this method) and
proposed method.
In the graph, one can observe that slight increase in mega peak cycle count for filter length of
M=8704. For the cases of more than M=8704, the required internal DSP memory is not able to
hold all the required buffers such as delayed FFT values of input as well as FFT coefficients. To
handle this, all the contents of these buffers will be written into external storage device such as
SDRAM. From SDRAM, these buffers could be read whenever needed using Direct Memory
Access. To avoid too many cycles due to this memory transfer, DMA was performed in
background with DSP core process. Due to all these processing, some extra cycles are needed
compared to the actual signal processing and hence the mega peak cycle count was increased
for these cases.
Similarly, mega peak cycle count is constant in Mixed filtering with overlap save method for few of
the cases, for example M=2048 to 3584. This is obviously expected because in all these cases,
FFT length is 4096 and all the operations are based on this factor. This is where the proposed
method is advantageous than the overlap save method.
The experimental results clearly indicate that the proposed method provides more savings in
computational complexity by following the efficient design as explained sub-sections 3.2 and 3.3.
The advantage of proposed algorithm is that FFT computational complexity is a function of frame
length unlike on coefficient length as in overlap save method. By segmenting filtering process into
M/L parts, one can achieve attractive computational savings and also savings in memory usage.
6. REFERENCES
[1]
M. Otani and S. Ise, Fast calculation system specialized for head-related transfer function
based on boundary element method, Journal of Acoustical Society of America, Vol. 119,
2006, No. 5, pp 2589-2598
[2]
Kirkeby ole, Rubak Per, Nelson Philip A. and Farina Angelo, Design of Crosstalk
cancellation Networks by using Fast deconvolution in AES 15, May 1999, pp 9900-9905
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
126
[3]
Lentz Tobias and Scmitz Oliver, Adaptive Cross-talk cancellation system for a moving
st
listener in AES 21 International Conference Proc., June 2002. Paper No. 00134
[4]
Linwang, Fuliang Yin and zhe Chen, A Stereo Crosstalk cancellation system based on
common- acoustical pole/zero model, AES, August 2010
[5]
[6]
John G. Proakis and Dimitris G. Manolakis, Digital Signal Processing Principles, Algorithms
rd
and Applications, 3 Edition, Page No. 430 to 476
[7]
[8]
Jason R. VandeKieft, April 30, 1998, Computational improvements to linear convolution with
multi-rate filtering methods http://mue.music.miami.edu/thesis/jvandekieft/jvtitle.htm.
[9]
M.Vetterli, Running FIR and IIR Filters using Multi-rate Filter Banks, IEEE transactions on
Acoustics, Speech and Signal Processing, May 1988, Vol. 36, No.5.
rd
Edition, published on
[10] Eric Battenbaerg and Rimas Avizienis. Implementing Real-time Partitioned Convolution
th
Algorithms on Conventional Operating Systems, Proc. of 14 Int. Conference on Digital
Audio Effects, Paris, France, Sept 19-23, 2001.
[11] Anders Torger and Angelo Farina, Real-time Partitioned Convolution for Ambiophonic
Surround Sound, IEEE Workshop on applications of Digital Signal Processing to Audio and
Acoustics 2001, New Paltz, New York, W2001-4.
[12] Garcia Guillermo, Optimal Filter Partition for efficient Convolution with short input/output
th
delay in AES 113 International Conference Proc., October 2002, pp. 2660.
[13] WG Gardiner, Efficient Convolution without input-output delay, Journal of AES, Vol. 43, No.
3, 1995, pp. 127-136.
[14] J. Hurchalla, A time distributed FFT for efficient low latency convolution, AES Convention
129, November 2010, Paper No.8257
[15] J. Hurchalla, Low latency convolution in one dimension via two dimensional convolutions-An
intuitive approach, AES Convention 125, October 2008, Paper No. 7634.
[16] E. Armelloni, C. Giottoli and A. Farina, Implementation of Real-time partitioned convolution
on a DSP board, IEEE Workshop on Applications of Signal processing to Audio and
Acoustics, October 19-22, 2003, New Paltz, NY.
[17] Eric Battenberg, David Wessel & Juan Colmenares, Advances in the Parallelization
of Music and Audio Applications, ebookbrowse.com/wessel-parlab-retreat-winter-2010-pptd59199484
[18] Analog Devices Inc., ADSP-214xx SHARC Processor Hardware Reference Manual, Rev
0.3, Part Number 82-000469-01, July 27, 2010.
Signal Processing: An International Journal (SPIJ), Volume (6) : Issue (4) : 2012
127