Correlation: Jonathan Y. Stein 2000 John Wiley & Sons, Inc. Print ISBN 0-471-29546-9 Online ISBN 0-471-20059-X
Correlation: Jonathan Y. Stein 2000 John Wiley & Sons, Inc. Print ISBN 0-471-29546-9 Online ISBN 0-471-20059-X
Correlation
Our study of signal processing systems has been dominated by the concept of convolution, and we have somewhat neglected its close relative the correlation. While formally similar (in fact convolution by a symmetric FIR filter can be considered a correlation as well), the way one should think about the two is different. Convolution is usually between a signal and a filter; we think of it as a system with a single input and stored coefficients. Crosscorrelation is usually between two signals; we think of a system with two inputs and no stored coefficients. The difference may be only in our minds, but nonetheless this mind-set influences the way the two are most often used. Although somewhat neglected we werent able to get this far without mentioning correlations at all. We have already learned that crosscorrelation is a measure of similarity between two signals, while autocorrelation is a measure of how similar a signal is to itself. In Section 5.6 we met the autocorrelation for stochastic signals (which are often quite unlike themselves), and in Section 6.13 we used the crosscorrelation between input and output signals to help identify an unknown system. Correlations are the main theme that links together the present chapter. We first motivate the concept of correlation by considering how to compare an input signal to a reference signal. We find that the best signal detector is the correlator. After formally defining both crosscorrelation and autocorrelation and calculating some examples, we prove the important WienerKhintchine theorem, which relates the autocorrelation to the power spectral density (PSD). Next we compare correlation with convolution and discover that the optimal signal detector can be implemented as a matched filter. The matched filter was invented for radar and a digression into this important application is worthwhile. The matched filter is good for signal detection, but for cleaning up a partially unknown signal we need the Wiener filter, which is also based on correlations.
349
350
CORRELATION
There is also a close connection between correlation and prediction. Linear predictive coding is crucial in speech processing, and we present it here in preparation for our later studies. The Wiener-Khintchine theorem states that correlations are second-order entities. Although these are sufficient for a wide variety of tasks, we end this chapter with a short introduction to the more general higher-order signal processing.
9.1
Signal Comparison
and Detection
A signal detector is a device that alerts us when a desired signal appears. Radar and sonar operate by transmitting a signal and detecting its return after having being reflected by a distant target. The return signal is often extremely weak in amplitude, while interference and noise are strong. In order to be able to reliably detect the presence of the return signal we employ a signal detector whose output is maximized when a true reflection appears. Similar signal detectors are employed in telephony call progress processing, medical alert devices, and in numerous other applications. Envision a system with a single input that must sound an alarm when this input consists of some specified signal. It is important not to miss any events even when the signal is weak compared to the noise, but at the same time we dont want to encourage false alarms (reporting detection when the desired signal was not really there). In addition, we may need to know as accurately as possible precisely when the expected signal arrived. The signal to be detected may be as simple as a sinusoid of given frequency, but is more often a rather complex, but known signal. It is evident that signal detection is closely related to signal comparison, the determination of how closely a signal resembles a reference signal. Signal comparison is also a critically important element in its own right, for example, in digital communications systems. In the simplest of such systems one of several basic signals is transmitted every T seconds and the receiver must determine which. This can be accomplished by building signal detectors for each of the basic signals and choosing the signal whose respective detectors output is the highest. A more complex example is speech recognition, where we may build detectors for a multitude of different basic sounds and convert the input audio into a string of best matches. Generalization of this technique to images produces a multitude of further applications, including optical character recognition.
351
From these examples we see that comparison and detection are essentially the same. The simplest detector is implemented by comparing the output of a comparator to a threshold. Complex detectors may employ more sophisticated decision elements, but still require the basic comparison mechanism to function. Signal detection and comparison are nontrivial problems due to the presence of noise. We know how to build filters that selectively enhance defined frequency components as compared to noise; but how do we build a system that selectively responds to a known but arbitrary reference signal? Our first inclination would be to subtract the input signal sn from the desired reference rn, thus forming an error signal en = rn - sn. Were the error signal to be identically zero, this would imply that the input precisely matches the reference, thus triggering the signal detector or maximizing the output of the signal comparator. However, for an input signal contaminated by noise %a = rn + vn, we can not expect the instantaneous error to be identically zero, but the lower the energy of the error signal the better the implied match. So a system that computes the energy of the difference signal is a natural comparator. This idea of using a simple difference is a step in the right direction, but only the first step. The problem is that we have assumed that the input signal is simply the reference signal plus additive noise; and this is too strong an assumption. The most obvious reason for this discrepancy is that the amplitude of the input signal is usually arbitrary. The strength of a radar return signal depends on the cross-sectional area of the target, the distance from the transmitter to the target and the target to the receiver, the type and size of the radar antenna, etc. Communications signals are received after path loss, and in the receiver probably go through several stages of analog amplification, including automatic gain control. A more reasonable representation of the input signal is
sn= Am + U,
where A is some unknown gain parameter. In order to compare the received signal sn with the reference signal rn it is no longer sufficient to simply form the difference; instead we now have to find a gain parameter g such that rn - gsn is minimized. We can then use the energy of the resulting error signal
En = m$rn - $I%>
as the final match criterion. How can we find this g? Assuming for the
352
CORRELATION
1 A
(9.1) (9.2)
rn - gqJ2 = 0
c
n
r~-2gCTnSn+g2CS~=0
n n
(9 .3)
Here ET is the energy of the reference signal, ES is the energy of the input signal, and CTS= C, r n sn is the crosscorrelation between the reference and the input. Among all input signals of given energy the correlation is maximal exactly when the energy of the difference signal is minimal. Now, from equation (9.1) we can deduce that
C r2 Er g2=-=F c
n sn s
Gs = l/z%
(9.4)
in the absence of noise. When the input signal does not precisely match the reference, due to distortion or noise, we have lCrSI < +dm. The crosscorrelation CTSis thus an easily computed quantity that compares the input signal to the reference, even when the amplitudes are not equal. A comparator can thus be realized by simply computing the correlation, and a signal detector can be implemented by comparing it to dm (e.g., requiring Unfortunately we have not yet considered all that happens to the reference signal before it becomes an input signal. In addition to the additive noise and unknown gain, there will also usually be an unknown time shift. For communications signals we receive a stream of signals to compare, each offset by an unknown time delay. For the radar signal the time delay derives
353
from the round-trip time of the signal from the transmitter to the target and back, and is precisely the quantity we wish to measure. When there is a time shift, a reasonable representation of the input signal is
Sn- - Arn+m + vn
Vn
where A is the gain and m < 0 the time shift parameter. In order to compare the received signal sn with the reference signal rn we can no longer simply compute a single crosscorrelation; instead we now have to find the time shift parameter m such that
is maximal. How do we find m? The only way is to compute the crosscorrelation Crs(m) for all relevant time shifts (also called time lags) m and choose the maximal one. It is this
EXERCISES 9.1.1 Formulate the concept of correlation in the frequency domain starting from spectral difference and taking into account an arbitrary gain of the spectral distribution. What happens if we need to allow an arbitrary spectral shift? 9.1.2 Give a complete algorithm for the optimal detection of a radar return sn given that the transmitted signal rn was sent at time Ti, returns are expected to be received before time T2, and the correlation is required to be at least y. Note that you can precompute E,. and compute Es and CTB(rn)in one loop. 9.1.3 Design an optimal detector for the V.34 probe signal introduced in exercise 2.6.4. The basic idea is to perform a DFT and implement a correlator in the frequency domain by multiplying the spectrum by a comb with 21 pass-bands (of suitable bandwidth). However, note that this is not independent of signal strength. You might try correcting this defect by requiring the correlation to be over 80% of the total signal energy, but this wouldnt work properly since, e.g., answer tone (a pure 2100 Hz tone) would trigger it, being one of the frequenciesof the probe signal. What is wrong? How can this problem be solved?
354
CORRELATION
9.2
Crosscorrelation
and Autocorrelation
The time has come to formally define correlation. Definition: crosscorrelation The crosscorrelation between two real signals x and y is given by
Cxy(~) 3 J ~(t)y(t -03 T)O% A D C,,(m) 5 n=-00 XnYn-m (9.5)
There is an important special case, called autocorrelation, when y is taken to be x. It might seem strange to compare a signal with itself, but the lag in equation (9.5) means that we are actually comparing the signal at different times. Thus autocorrelation can assist in detecting periodicities. Definition: autocorrelation The autocorrelation of a real signal s is given by C&) = Im
-cm
s(t)s(t - ~)dt
D
n=--00
Sn%-m
(9 . 6)
cs (7) ~(7) = m
where r or m is called the lug.
A D
(9.7)
These definitions are consistent with those of Section 5.6 for the case of stationary ergodic signals. In practice we often approximate the autocorrelation of equation (5.22) by using equation (9.6) but with the sum only over a finite amount of time. The resulting quantity is called the empirical autocorrelation. The correlation is also somewhat related to the covariance matrix of vector random variables, and strongly related to the convolution, as will be discussed in the next section. Before discussing properties of the correlations, lets try calculating a few. The analog rectangular window
s(t) 1
=
355
Figure 9.1: The autocorrelation of an analog rectangularly shaped signal. In (A) the signal is depicted while the autocorrelation is in (B). Note that the autocorrelation is symmetric and has its maximal value at the origin.
c&) = /- s(t)s(t - ~)dt= L;(;::, dt= (2 - 171) (9 .8) -00 - ,depicted in Figure 9.1.B. In that figure we see several features that are readily shown to be more general. The autocorrelation is symmetric around time lag zero, and it takes on its maximum value at lag zero, where it is simply the energy ES. The autocorrelation is also wider than the original signal, but attacks and decays more slowly. Had we used an inverted rectangle (which differs from the original signal by a phase shift) -1 ItI < 1 s(t) = 0 else { we would have found the same autocorrelation. Indeed the generalization of autocorrelation to complex signals,
- T)dt
C&n) f
can be shown to be phase blind (unchanged by multiplying s by a common phase factor). What is the autocorrelation of the periodic square wave o(t)? Generalizing our previous result we can show that the autocorrelation is a periodic
356
CORRELATION
triangular wave of the same period. This too is quite general-the autocorrelation of a periodic signal is periodic with the same period; and since the lag-zero autocorrelation is a global maximum, all lags that are multiples of the period have globally maximal autocorrelations. This fact is precisely the secret behind using autocorrelation for determining the period of a periodic phenomenon. One looks for the first nonzero peak in the autocorrelation as an indication of the period. The same idea can be used for finding Fourier components as well; each component contributes a local peak to the autocorrelation. As our final example, lets try a digital autocorrelation. The signal b, is assumed to be zero except for n = 1. . . 13 where it takes on the values f 1. . ..0.0,+1,+1,+1,+1,+1,-1,-1,+1,+1,-1,+1,-1,+1,0,0 ,... (9.10)
Its autocorrelation is easily computed to be C(0) = 13, C(m) = 0 for odd r-n in the range -13 < m < 13, C(m) = 1 for even nonzero m in this range, and all other autocorrelations are zero. We see that the autocorrelation is indeed maximal at m = 0 and symmetric, and in addition the highest nonzerolag correlations are only 1. Signals consisting of 44 values with this last property (i.e., with maximal nontrivial autocorrelation of v or less) are called Barker codes, and are useful for timing and synchronization. There is no known way of generating Barker codes and none longer than this one are known. The definitions for autocorrelation or crosscorrelation given above involve integrating or summing over all times, and hence are not amenable to computation in practice. In any case we would like to allow signals to change behavior with time, and thus would like to allow correlations that are defined for finite time durations. The situation is analogous to the problem that led to the definition of the STFT, and we follow the same tactic here. Assuming a rectangular window of length N, there are N terms in the expression for the zero lag, but only N - 1 terms contribute to the lag 1 correlation slsa + szsr + . . . + SN.-~SN-~, and only N - m terms in the lag m sum. So we define the short-time autocorrelation
&x%-m
(9.11)
where now the zero lag is the power rather than the energy. This quantity is often called the unbiased empirical autocorrelation when it is looked upon as a numerical estimate of the full autocorrelation.
THEOREM
357
9.2.1 What is the connection between autocorrelation defined here for deterministic signals and the autocorrelation we earlier defined for stochastic signals (equation (5.22))? 9.2.2 What is the crosscorrelation between a signal s(t) and the impulse s(t)? 9.2.3 Compute and draw the crosscorrelation between two analog rectangular signals of different widths. 9.2.4 Compute and draw the crosscorrelation between two analog triangular signals. 9.2.5 Show that CyZ(m) = C&,(-m). 9.2.6 Prove that the autocorrelation is symmetric and takes its maximum value at the origin, where it is the energy. Show that IcZy(m)( 5 1. 9.2.7 Can you find Barker codes of length 5, 7, and ll? What are their autocorrelations? 9.2.8 What is the proper generalization of crosscorrelation and autocorrelation to complex signals? (Hint: The autocorrelation should be phase independent.) 9.2.9 Prove that the autocorrelation of a periodic signal is periodic with the same period. 9.210 Prove that zero mean symmetric signals have zero odd lag autocorrelations. 9.2.11 Assume gn = x,+1. What are the connections between CZy(m), C,(m) and c,(m)? 9.2.12 Derive the first few autocorrelation values for sn = Asin(wn + 4). 9.2.13 Generalize the previous exercise and derive the following expression for the general autocorrelation of the sinusoid. G(m) = (wn+m) A2 = yj- cos(wm)
9.3
The Wiener-Khintchine
Theorem
The applications of correlation that we have seen so far derive from its connection with the difference between two signals. Another class of applications originate in the relationship between autocorrelation and power spectrum (see Section 4.5), a relationship known as the Wiener-Khintchine Theorem.
358
CORRELATION
The PSD of a signal is the absolute square of its FT, but it is also can be considered to be the FT of some function. Parsevals relation tells us that integrating the PSD over all frequencies is the same as integrating the square of the signal over all times, so it seems reasonable that the iFT of the PSD is somehow related to the square of the signal. Could it be that the PSD is simply the FT of the signal squared? The DC term works because of Parseval, but what about the rest? We dont have to actually integrate or sum to find out since we can use the connection between convolution and FT of a product FT(zy) = X * Y (equation (4.18) or (4.46)). Using the signal s for both z and y we see that the FT of s2(t) is S*S = $S(w - Q)S(Q)dO, which is not quite the PSD lSl2 = S*S = S(-w)S(w) (for real signals), but has an additional integration. We want to move this integration to the time side of the equation, so lets try s *s. From equation (4.19) or (4.47) we see that the FT of s * s is S2(w) which is even closer, but has both frequency variables positive, instead of one positive and one negative. So we need something very much like s * s but with some kind of time variable inversion; that sounds like the autocorrelation! So lets find the FT of the autocorrelation. FT (c,(t)) = = = FT O s(r)s(r - t)d7) (s -00 00 00 S(T)S(T - t)dT I -00 (s -cm
ciwtdt
The PSD at last! We have thus proven the following celebrated theorem. The Wiener-Khintchine Theorem The autocorrelation Cs (t) and the power spectrum S(w) are an FT pair.
Although we proved the theorem for deterministic analog signals, it is more general. In fact, in Section 5.7 we used the Wiener-Khintchine theorem as the definition of spectrum for random signals.
359
As a corollary to the theorem we can again prove that the autocorrelation is phase blind, that is, independent of the spectral phase. Two signals with the same power spectral density but different spectral phase will have the same autocorrelation function, and hence an infinite number of signals have the same autocorrelation. Methods of signal analysis that are based on autocorrelation can not differentiate between such signals, no matter how different they may look in the time domain. If we need to differentiate between such signals we need to use the higher-order statistics of Section 9.12.
EXERCISES
9.3.1 The period of a pure sinusoid is evident as a peak in the autocorrelation and hence its frequency is manifested as a peak in the power spectrum. This is the true basis for the connection between autocorrelation and PSD. What can you say about the autocorrelation of a general periodic signal? What is the autocorrelation of the sum of two sinusoidal components? Can you see the PSD connection? 9.3.2 Express and prove the Wiener-Khintchine theorem for digital signals.
9.3.3 Generalize the Wiener-Khintchine theorem by finding the FT of the crosscorrelation of two signals z(t) and y(t).
9.4
The Frequency
Domain
Signal Detector
Simply observing the input signal in the time domain is not a very sensitive method of detecting low-SNR signals, a fact made obvious by looking back at Figure 2.9. Since correlation is a method for detecting weak signals, and correlation is related to spectrum by the Wiener-Khintchine theorem, there should be a way of exploiting the frequency domain for signal detection. In Section 5.3 we saw how to reduce noise by averaging it out. This would seem to be a purely time domain activity, but there is a frequency domain connection. To see this, consider the simplest case, that of a pure sinusoid in noise. For averaging to optimally reinforce the signal we must first ensure that all the times intervals commence at precisely the same phase in a period, an operation called time registration. Without registration the signal cancels out just like the noise; with inaccurate registration the signal is only partially reinforced. If we wish to take successive time intervals, accurate registration requires the intervals to be precise multiples of the
360
CORRELATION
sinusoids basic period. Thus signal emphasis by averaging requires precise knowledge of the signals frequency. Now lets see how we can emphasize signals working directly in the frequency domain. In a digital implementation of the above averaging each time interval corresponds to a buffer of samples. Assume that the period is L samples and lets use a buffer with exactly k periods. We start filling up the buffer with the input signal consisting of signal plus noise. Once the buffer is filled we return to its beginning, adding the next signal sample to that already there. Performing this addition M times increases the sinusoidal component by M but the noise component only by m (see exercise 5.3.1). Hence the SNR, defined as the ratio of the signal to noise energies, is improved by M. This SNR increase is called the processing gain. How many input samples did we use in the above process? We filled the buffer of length kL exactly M times; thus N = kLM input samples were needed. We can use a buffer with length corresponding to any integer number of periods k, but the N input signal samples are used most efficiently when the buffer contains a single cycle k = 1. This is because the processing gain M = & will be maximal for a given N when k = 1. However, it is possible to do even better! It is possible to effectively reduce the buffer to a single sample such that M = N, and obtain the maximal processing gain of N. All we have to do is to downmix the signal to DC, by multiplying by a complex exponential and low-pass filtering. The noise will remain zero mean while the sinusoid becomes a complex constant, so that averaging as in Section 6.6 cancels out the noise but reinforces the constant signal. Now, as explained in Section 13.2, this complex downmixing can be performed using the DFT. So by performing a DFT the energy in the bin corresponding to the desired signal frequency increases much faster than all the other bins. In the frequency domain interpretation the processing gain is realized due to the signal being concentrated in this single bin, while the white noise is spread out over N bins. Thus were the signal and noise energies initially equal, the ratio of the energy in the bin corresponding to the signal frequency to that of the other bins would be N, the same processing gain deduced from time domain arguments. So we see that our presumption based on the Wiener-Khintchine theorem was correct; the frequency domain interpretation is indeed useful in signal detection. Although we discussed only the simple case of a single pure sinusoid, it is relatively easy to extend the ideas of this section to more general signals by defining distinctive spectral signatures. Instead of doing this we will return to the time domain and see how to build there a signal detection system for arbitrary signals.
361
EXERCISES
9.4.1 Express the processing gain in decibels when the DFT is performed using a 2m point FFT. 9.4.2 In the text we tacitly assumed the signal frequency to be precisely at a bin center. If this is not the case a window function 20~ (see Section 13.4) must be employed. Show that with a window the signal energy is enhanced by (C, UJ,)~ while the noise energy is increased by C, wi thus resulting in a processing gain of the ratio of these two expressions. 9.4.3 Build a detector for a signal that consists of the equally weighted sum of two sinusoids. Is it worthwhile taking the phases into account? What if the signal is the weighted sum of the two sinusoids? 9.4.4 Extend the technique of the previous exercise and build a DFT-based detector for a completely general signal.
9.5
Correlation
and Convolution
Although we have not mentioned it until now, you have no doubt noticed the similarity between the expression for digital crosscorrelation in equation (9.5) and that for convolution in equation (6.13). The only difference between them is that in correlation both indices run in the same direction, while in convolution they run in opposite directions. Realizing this, we can now realize our signal comparator as a filter. The filters coefficients will be the reference signal reversed in time, as in equation (2.16). Such a filter is called a matched filter, or a correlator. The name matched filter refers to the fact that the filter coefficients are matched to the signal values, although in reverse order. What is the frequency response of the matched filter? Reversing a signal in time results in frequency components FT (s(4)) = S(-w), and if the signal is real this equals S* (w) , so the magnitude of the FT remains unchanged but the phase is reversed. From the arguments of Section 9.1 the correlator, and hence the theoretically identical matched filter, is the optimum solution to the problem of detecting the appearance of a known signal sn contaminated by additive white noise x, = sn + u,. Can we extend this idea to optimally detect a signal in colored noise? To answer this question recall the joke about the mathematician who wanted a
362
CORRELATION
cup of tea. Usually he would take the kettle from the cupboard, fill it with water, put it on the fire, and when the water boiled, pour it into a cup and drop in a tea bag. One day he found that someone had already boiled the water. He stared perplexed at the kettle and then smiled. He went to the sink, poured out the boiling water, returned the kettle to the cupboard and declared triumphantly: The problem has been reduced to one we know how to solve. How can we reduce the problem of a signal in colored noise to the one for which the matched filter is the optimal answer? All we have to do is filter the contaminated signal xn by a filter whose frequency response is the inverse
of this noise spectrum. Such a filter is called a whitening filter, because it flattens the noise spectrum. The filtered signal XL = sk + u; now contains
an additive white noise component uk, and the conditions required for the matched filter to be optimal are satisfied. Of course the reference signal sk is no longer our original signal s,; but finding the matched filter for sk is straightforward.
EXERCISES
9.5.1 Create a sinusoid and add Gaussian white noise of equal energy. Recover the sinusoid by averaging. Experiment with inaccurate registration. Now recover the sinusoid by a DFT. What advantages and disadvantages are there to this method? What happens if the frequency is inaccurately known? 9.5.2 Build a matched filter to detect the HPNA 1.0 pulse (see exercise 7.7.4). Try it out by synthesizing pulses at random times and adding Gaussian noise. HPNA 1.0 uses PPM where the information is in the pulse position. How precisely can you detect the pulses time of arrival? 9.5.3 Compare the time domain matched filter with a frequency domain detector based on the FFT algorithm. Consider computational complexity, processing delay, and programming difficulty.
9.6
Application
to Radar
Matched filters were invented in order to improve the detection of radar returns. We learned the basic principles of radar in Section 5.3 but were limited to explaining relatively primitive radar processing techniques. With
363
our newly acquired knowledge of matched filters we can now present improved radar signals and receivers. Radar pulses need to have as much energy as possible in order to increase the probability of being detected, and thus should be long in duration. In order to increase a radars range resolution we prefer narrow pulses since its hard to tell when exactly a wide pulse arrives. How can we resolve this conflict of interests? The basic idea is to use a wide pulse but to modulate it (Le., to change its characteristics with time). The output of a filter matched to this modulation can be made to be very short in duration, but containing all the energy of the original pulse. To this end some radars vary their instantaneous frequency linearly with time over the duration of the pulse, a technique known as FM chirp We demonstrate in Figure 9.2 the improvement chirp can bring in range resolution. The pulse in Figure 9.2-A is unmodulated and hence the matched filter can do no better than to lock onto the basic frequency. The output of such a matched filter is the autocorrelation of this pulse, and is displayed in Figure 9.2.B. Although theoretically there is a maximum corresponding to the perfect match when the entire pulse is overlapped by the matched filter, in practice the false maxima at shifts corresponding to the basic period
Figure
9.2: The autocorrelation of pulses with and without chirp. In (A) a pulse with constant instantaneous frequency is depicted, and its wide autocorrelation is displayed in (B). In (C) we present a pulse with frequency chirp; its much narrower autocorrelation is displayed in (D) .
364
CORRELATION
make it difficult to determine the precise TOA. In contrast the chirped pulse of Figure 9.2.C does not match itself well at any nontrivial shifts, and so its autocorrelation (Figure 9.2.D) is much narrower. Hence a matched filter built for a chirped radar pulse will have a much more precise response. Chirped frequency is not the only way to sharpen a radar pulses autocorrelation. Barker codes are often used because of their optimal autocorrelation properties, and the best way to embed a Barker code into a pulse is by changing its instantaneous phase. Binary Phase Shift Keying (BPSK), to be discussed in Section 18.13, is generated by changing a sinusoidal signals phase by 180, or equivalently multiplying the sinusoid by -1. To use the 13-bit Barker code we divide the pulse width into 13 equal time intervals, and assign a value f 1 to each. When the Barker code element is +l we transmit + sin(wt), while when it is -1 we send - sin(&). This Barker BPSK sharpens the pulses autocorrelation by a factor of 13. Not all radars utilize pulses; a Continuous Wave (CW) radar transmits continuously with constant amplitude. How can range be determined if echo arrives continuously? Once again by modulating the signal, and if we want constant amplitude we can only modulate the frequency or phase (e.g., by chirp or BPSK). Both chirp and BPSK modulation are popular for CW radars, with the modulation sequence repeating over and over again without stopping. CW radars use LFSR sequences rather than Barker codes for a very simple reason. Barker codes have optimal linear autocorrelation properties, while maximal-length LFSR sequences can be shown to have optimal circular autocorrelation characteristics. Circular correlation is analogous to circular convolution; instead of overlapping zero when one signal extends past the other, we wrap the other signal around periodically. A matched filter that runs over a periodically repeated BPSK sequence essentially reproduces the circular autocorrelation.
EXERCISES
9.6.1 Plot, analogously to Figure 9.2, the autocorrelation Barker code BPSK. 9.6.2 What is the circular autocorrelation of a pulse with a 13-bit
9.6.3 What is the difference between coherent and incoherent pulse radars? In what way are coherent radars better?
365
9.7
The Wiener
Filter
The matched filter provides the optimum solution to the problem of detecting the arrival of a known signal contaminated by noise; but correlationbased filters are useful for other problems as well, for example, removing noise from an unknown signal. If the signal is known in the matched filter problem, then why do we need to clean it up? The reason is that the signal may be only partially known, and we must remove noise to learn the unknown portion. In one common situation we expect a signal from a family of signals and are required to discover which specific signal was received. Or we might know that the signal is a pure sinusoid, but be required to measure its precise frequency; this is the case for Doppler radars which determine a targets velocity from the Doppler frequency shift. Lets see how to build a filter to optimally remove noise and recover a signal. Our strategy is straightforward. It is simple to recover a sufficiently strong signal in the presence of sufficiently weak noise (i.e., when the SNR is sufficiently high). When the SNR is low we will design a filter to enhance it; such a filters design must take into account everything known about the signal and the noise spectra. Before starting we need some notation. For simplicity we observe the spectrum from DC to some frequency F. We will denote the original analog signal in time as s(t) and in frequency as S(f). We will call its total energy Es. We denote the same quantities for the additive noise, v(t), V(f), and Ev, respectively. These quantities are obviously related by
and if the noise is white then we further define its constant power spectral density to be Vu = $Y watt per Hz. The overall signal-to-noise ratio is the ratio of the energies Y but we can define time- and frequency-dependent SNRs as well.
SNR = 2
(9.12)
2 Iv@)I
(9.13)
366
CORRELATION
Finally, the observed signal is the sum of the signal plus the noise. x(t) = s(t) + v(t) X(f) = S(f) + w (9.14)
Well start with the simple case of a relatively pure sinusoid of frequency fo in white noise. The signal PSD consists of a single narrow line (and its negative frequency conjugate), while the noise PSD is a constant VO; accordingly the SNR is 3. What filter will optimally detect this signal given this noise? Looking at the frequency-dependent SNR we see that the signal stands out above the noise at fo; so it makes sense to use a narrow band-pass filter centered on the sinusoids frequency J-J. The narrower the filter bandwidth BW, the less noise energy is picked up, so we want BW to be as small as possible. The situation is depicted in Figure 9.3.A where we see the signal PSD represented as a single vertical line, the noise as a horizontal line, and the optimum filter as the smooth curve peaked around the signal. The signal-to-noise ratio at the output of the filter (9.15)
is greater than that at the input by a factor of & . For small B W this is a great improvement in SNR and allows us to detect the reference signal even when buried in very high noise levels. Now lets complicate matters a bit by considering a signal with two equal spectral components, as in Figure 9.3.B. Should we use a filter that captures both spectral lines or be content with observing only one of them? The twocomponent filter will pass twice the signal energy but twice the noise energy as well. However, a filter that matches the signal spectrum may enhance the time-dependent SNR; the two signal components will add constructively at some time, and by choosing the relative phases of the filter components we can make this peak occur whenever we want. Also, for finite times the noise spectrum will have local fluctuations that may cause a false alarm in a single filter, but the probability of that happening simultaneously in both filters is much smaller. Finally, the two-component filter can differentiate better between the desired signal and a single frequency sinusoid masquerading as the desired signal. Were one of the frequency components to be more prominent than the other, we would have to compensate by having the filter response H(f) as
367
IH( A
I
A I i
IH( -f IH( A I \ ) \ -f
A I I
B I I \
-f
Figure 9.3: Expected behavior of an optimum filter in the frequency domain. In all the figures we see the PSD of the reference signal and noise, as well as the Wiener filter. The various cases are discussed in the text.
depicted in Figure 9.3.C. This seems like the right thing to do, since such a filter emphasizes frequencies with high SNR. Likewise Figure 9.3.D depicts what we expect the optimal filter to look like for the case of two equal signal components, but non-white noise. How do we actually construct this optimum filter? Its easier than it looks. From equation (9.14) the spectrum at the filter input is S(f) + V(f), so the filters frequency response must be
(9.16)
in order for the desired spectrum S(f) to appear at its output. This frequency response was depicted in Figure 9.3. Note that we can think of this filter as being built of two parts: the denominator corresponds to a whitening filter, while the numerator is matched to the signals spectrum. Unlike the whitening filter that we met in the matched filter detector, here the entire signal plus noise must be whitened, not just the noise. This filter is a special case of the Wiener filter derived by Norbert Wiener during World War II for optimal detection of radar signals. It is a special case because we have been implicitly assuming that the noise and signal are
368
CORRELATION
uncorrelated. When the noise can be correlated to the signal we have to be more careful. This is not the first time we have attempted to find an unknown FIR filter. In Section 6.13 we found that the hard system identification problem for FIR filters was solved by the Wiener-Hopf equations (6.63). At first it seems that the two problems have nothing in common, since in the Wiener filter problem only the input is available, the output being completely unknown (otherwise we wouldnt need the filter), while in the system identification case both the input and output were available for measurement! However, neither of these statements is quite true. Were the output of the Wiener filter completely unspecified the trivial filter that passes the input straight through would be a legitimate solution. We do know certain characteristics of the desired output, namely its spectral density or correlations. In the hard system identification problem we indeed posited that we intimately knew the input and output signals, but the solution does not exploit this much detail. Recall that only the correlations were required to find the unknown system. So lets capitalize on our previous results. In our present notation the input is xn = sn + u, and the desired output sn. We can immediately state the Wiener-Hopf equations in the time domain
k
so that given CsZ and CZ we can solve for h, the Wiener filter in the time domain. To compare this filter with our previous results we need to transfer the equations to the frequency domain, using equation (4.47) for the FT of a convolution. Here PsZ(w) is the FT of the crosscorrelation between s(t) and x(t) , and Px(w) is the PSD of x(t) (i.e., FT of its autocorrelation). Dividing we find the full Wiener filter. f+) = psz(w) (9.17) w4 For uncorrelated noise Ps&) = P&) and Pz(w) = P&) + P&) and so the full Wiener filter reduces to equation (9.16). The Wiener filter only functions when the signals being treated are stationary (i.e., Psx and Ps are not functions of time). This restriction too can be lifted, resulting in the K&nun filter, but any attempt at explaining its principles would lead us too far astray.
369
9.7.1 Assume that the signal s(t) has constant PSD in some range but the noise
y(t) is narrow-band. Explain why we expect a Wiener filter to have a notch at the disturbing frequency.
9.7.2 An alternative to the SNR is the signal-plus-noise-to-noise-ratio S+NNR.
Why is this ratio of importance? What is the relationship between the overall S+NNR and SNR? What is the relationship between the Wiener filter and the frequency-dependent S+NNR and SNR?
9.8
Correlation
and Prediction
A common problem in DSP is to predict the next signal value sn based on the values we have observed so far. If sn represents the closing value of a particular stock on day n the importance of accurate prediction is obvious. Less obvious is the importance of predicting the next value of a speech signal. Its not that I impolitely do not wish to wait for you to finish whatever you have to say; rather the ability to predict the next sample enables the compression of digitized speech, as will be discussed at length in Chapter 19. Any ability to predict the future implies that less information needs to be transferred or stored in order to completely specify the signal. If the signal s is white noise then there is no correlation between its value sn and its previous history (i.e., Cs(m) = 0 Vm # 0), and hence no prediction can improve on a guess based on single sample statistics. However, when the autocorrelation is nontrivial we can use past values to improve our predictions. So there is a direct connection between correlation and prediction; we can exploit the autocorrelation to predict what the signal will must probably do. The connection between correlation and prediction is not limited to autocorrelation. If two signals x and y have a nontrivial crosscorrelation this can be exploited to help predict yn given xn. More generally, the causal prediction of yn could depend on previous y values, x~, and previous x values. An obvious example is when the crosscorrelation has a noticeable peak at lag m, and much information about gn can be gleaned from xnern. We can further clarify the connection between autocorrelation and signal prediction with a simple example. Assume that the present signal value sn depends strongly on the previous value s,-1 but only weakly on older values. We further assume that this dependence is linear, sn M b ~~-1 (were we to
370
CORRELATION
take sn ti b ~~-1 + c we would be forced to conclude c = 0 since otherwise the signal would diverge after enough time). Now we are left with the problem of finding b given an observed signal. Even if our assumptions are not very good, that is, even if sn does depend on still earlier values, and/or the dependence is not really linear, and even if sn depends on other signals as well, we are still interested in finding that b that gives the best linear prediction given only the previous value. - = bsn-l (9.18) Sn What do we mean by best prediction? The best definition of best is for the Mean Squared Error (MSE) dz =
(in
Si
- 2 bsn
Sn-1
+ b2si-l
to be as small as possible, on the average. We are now in familiar territory. Assuming the signal to be time-invariant we average over all time (di) = (SE) -2b (SnSn-1) +b2
(Si-1)
= (1+b2)Cs(0)
- 2bCs(l)
and then differentiate and set equal to zero. We find that the optimal linear prediction is b= s(l) = c,(l) (9.19) cs (0) the normalized autocorrelation coefficient for lag 1. Substituting this back into the expression for the average square error, we find ( d; > = cm - c,2(1) cs (0) (9.20)
so that the error vanishes when the lag 1 correlation equals the energy.
EXERCISES
9.8.1 Wiener named his book The Extrapolation, Interpolation and Smoothing of Stationa y Time Series with Engineering Applications. Wieners extrapola-
tion is what we have called prediction. What did he mean by interpolation and smoothing ?
9.8.2 Find the optimal linear prediction coefficients when two lags are taken into
account.
s,, = bls,.+1 + bzsn-2
371
9.9
Linear
Predictive
Coding
Signal coding, that is, compression of the amount of information needed to represent a signal, is an important application of DSP. To see why, consider the important application of digital speech. A bandwidth of 4 KHz is required so we must sample at 8000 samples per second; with 16 bit samples this requires 128 Kb/s, or just under 1 MB of data every minute. This data rate cannot be transferred over a telephone connection using a modem (the fastest telephone-grade modems reach 56 Kb/s) and would even be a tremendous strain on storage facilities. Yet modern speech compression techniques (see Chapter 19) can reduce the required rate to 8 Kb/s or less with only barely noticeable quality degradation. Lets call the signal to be compressed sn. If s is not white noise then it is at least partially linearly predictable based on its M previous values. sn = Ge, -I- 5
m=l
bmsn-m
(9.21)
Here e, is the portion of the signal not predictable based on the signals own history, G is an arbitrarily introduced gain, and bm are called the Linear Predictive Coding (LPC) coefficients. Note that most people use a for these coefficients, but we reserve a for FIR coefficients; some people use a minus sign before the sum (i.e., use what we call ,0 coefficients). Equation (9.21) has a simple interpretation; the signal sn is obtained by filtering the unpredictable signal e, by a all-pole filter with gain G and coefficients bm. The e, is called the excitation signal since it excites the filter into operation. Since the filter is all-pole it enhances certain excited frequencies; these amplified frequencies are responsible for the non-flat spectrum and nontrivial autocorrelation of predictable signals. For speech signals (see Section 11.3) the excitation e, is the glottal excitation; for voiced speech (e.g., vowels) this is a periodic set of pulses created by the vocal chords,.while for unvoiced speech (e.g., h) it is a noise-like signal created by constricting the passage of air. For both cases the mouth and nasal cavities act as a filter, enhancing frequencies according to their geometry. In order to compress the signal we need an algorithm for finding the M + 1 parameters G and bm given a buffer of N samples of the signal N-l Looking carefully at equation (9.21) we note a problem. There { Sn 1n=O* are too many unknowns. In order to uniquely determine the coefficients bm we need to know both the observed speech signal sn and the excitation en. Unfortunately, the latter signal is usually inaccessible; for speech signals
372
CORRELATION
obtaining it would require swallowing a microphone so that it would be close to the vocal chords and before the vocal tract. We thus venture forth under the assumption that the excitation is identically zero. This is true most of the time for a pulse train excitation, only erring for those time instants when the impulse appears. It is obviously not a good approximation for many other cases. Under the assumption of zero excitation we get the homogeneous recursion
M sn = c bmsn-m
m=l
(9.22)
for which s = 0 (the zero signal) is a solution. It is the only solution if the excitation was truly always zero; but due to the IIR nature of the filter, other possibilities exist if the excitation was once nonzero, even if zero during the duration of the present buffer. For speech the excitation is not truly zero, so even when we find the coefficients bm we can only approximately predict the next signal value.
M sn = c
m=l bmsn-m
(9.23)
(where ,& E 1 and pm = -bm), and the correct LPC coefficients minimize this residual. Note that the residual is obtained by FIR filtering the input signal, with the filter coefficients being precisely pm. This all-zero filter is usually called the LPC analysis filter and it is the inverse filter of the LPC synthesis filter that synthesizes the speech from the excitation (see Figure 9.4). The analysis filter is also called the LPC whitening filter, the residual being much whiter than the original speech signal, since the linear predictability has been removed. There is another way of looking at the residual signal. Rather than taking no excitation and treating the residual as an error signal, we can pretend that there is excitation but take the error to be precisely zero. What must the excitation be for s, to be the correct signal value? Comparing equations (9.24) and (9.21) we see that rn = Gen, the residual is simply the excitation amplified by the gain. Thus when analyzing voiced speech we see that the residual is usually small but displays peaks corresponding to the vocal chord pulses.
373
e+Ft-+s
t-
Li
= r
synthesis filter
Figure
9.4: LPC synthesis and analysis filters. The synthesis filter synthesizes the signal s,, from the excitation e,, while the analysis filter analyzes incoming signal s,, and outputs the residual error signal rn. The synthesis and analysis filters are inverse systems to within a gain.
One final remark regarding the residual. In speech compression terminology the residual we defined is called the open-loop residual. It can be calculated only if the original speech samples sn are available. When decompressing previously compressed speech these samples are no longer available, and we can only attempt to predict the present signal value based on past predicted w&es. It is then better to define the closed-loop residual
M
7-i =
sn
+ C
m=l
bm%t-m
and minimize it instead. Returning to our mission, we wish to find coefficients bm that minimize the residual of equation (9.24). In order to simultaneously minimize the residual rn for all times of interest n, we calculate the MSE E = cri
n
= C(sn
n
- 5
m=l
bmsn-m)2
(9.25) is
and minimize it with respect to the bm (m = 1. . .1M). This minimization carried out by setting all M partial derivatives equal to zero
(9.26)
374
CORRELATION
Moving the sum on 72inside we can rewrite these (9.27) in which the signal enters only via autocorrelations Cs. (9.28) These are, of course, the Yule-Walker equations for the LPC coefficients. The sum in the autocorrelations should run over all times n. This is problematic for two reasons. First, we are usually only given an input signal buffer of length IV, and even if we are willing to look at speech samples outside this buffer, we cannot wait forever. Second, many signals including speech are stationary only for short time durations, and it is only sensible to compute autocorrelations over such durations. Thus we must somehow limit the range of times taken into account in the autocorrelation sums. This can be done in two ways. The brute-force way is to artificially take all signal values outside the buffer to be zero for the purposes of the sums. A somewhat more gentle variant of the same approach uses a window function (see Section 13.4) that smoothly reduces the signal to zero. The second way is to retain the required values from the previous buffer. The first way is called the autocorrelation method and is by far the most popular; the second is called the covariance method and is less popular due to potential numerical stability problems. The autocorrelation method allows the sum in the MSE to be over all times, but takes all signal values outside the buffer se. . . s~-l to be zero. Since the error en in equation (9.24) depends on A4 + 1 signal values, it can only be nonzero for n = 0. . . N + M - 1. Accordingly, the MSE is
N+M-1
E=
n=O
et
C,g(mjZ)
Sn-m,Sn-1
C~(l?72 -
21)
n=O
375
cs (0)
cs (1) cs (2)
. .
C,(M-1)
C,(M-2)
l:.
GiO)
we see that the matrix is symmetric and Toeplitz. In the next section we will study a fast method for solving such equations. The MSE in the covariance method is taken to be
N-l E=p; n=O
and here we dont assume that the signal was zero for n < 0. We must thus access N + M signal values, including M values from the previous buffer. Equations (9.27) are still correct, but now the sums over n no longer lead to genuine autocorrelations due to the limits of the sums being constrained differently.
N-l
Cs(77%, Z) S C
n=O
Sn-m,Sn-1
= Cs(1, m)
In particular C, although symmetric is no longer a function of II - ml, but rather a function of I and m separately. Writing these equations in matrix form we get a matrix that is symmetric but not Toeplitz.
Cs(l,l)
C&,2)
C&,2)
C&2)
*** ...
Cs(l,M) G(2, M)
C&3) . . Cs(l;M)
Cs(2,3) . . C,(2jM)
... . . .:.
G(3,M)
(9.30)
The fast methods of solving Toeplitz equations are no longer available, and the Cholesky decomposition (equation (A.94)) is usually employed. Since general covariance matrices are of this form this method is called the covariance method, although no covariances are obviously present. For N >> M the difference between using N samples and using N + M samples becomes insignificant, and the two methods converge to the same solution. For small buffers the LPC equations can be highly sensitive to the boundary conditions and the two methods may produce quite different results.
376
CORRELATION
EXERCISES
9.9.1 What is the approximation error for the covariance method? 9.9.2 Equation (9.22) predicts sn based on M previous values s,+r, ~~-2,. . . s,+M and is called the forward predictor. We can also predict (postdict?) s,-M based on the next M values s+M+r, . . . , s,+ 1, sn. This surprising twist on LPC is called backward linear prediction. Modify equation (9.22) for this case (call the coefficients c,). What is the residual? 9.9.3 Show that the MSE error can be written E = C, si + C,=, b, C, snsnmm and thus for the autocorrelation method E = C,(O) + Cz=1 bmCs(m). 9.9.4 Show that assuming the input to be an impulse G&Q the gain is given by the error as given in the previous exercise. 9.9.5 Use the LPC method to predict the next term in the sequence 1, a, 02, a3,. . . for various 0 < c1!< 1. Repeat for cy > 1. Does the LPC method always correctly predict the next signal value?
9.10
The Levinson-Durbin
Recursion
Take an empty glass soft-drink bottle and blow over its mouth. Now put a little water in the bottle and blow again. The frequency produced is higher since the wavelength that resonates in the cavity is shorter (recall our discussion of wavelength in Section 7.9). By tuning a collection of bottles you can create a musical instrument and play recognizable tunes. The bottle in this experiment acts as a filter that is excited by breath noise. Modeling the bottle as a simple cylinder, the frequency it enhances is uniquely determined by its height. What if we want to create a signal containing two different frequencies? One way would be to blow over two different bottles separately (i.e., to place the filters in parallel). From our studies of filters we suspect that there may be a way of putting the filters in series (cascade) as well, but putting two cylinders one after the other only makes a single long cylinder. In order to get multiple frequencies we can use cylinders of different cross-sectional areas, the resonant frequencies being determined by the widths rather than the heights. If we send a sound wave down a pipe that consists of a sequence of cylinders of different cross-sectional areas Ai, at each interface a certain amount of acoustic energy continues to travel down the pipe while some is reflected back toward its beginning. Lets send a sinusoidal acoustic wave
377
down a pipe consisting of two cylinders. Recalling from Section 7.9 that traveling waves are functions of s - wt, we can express the incoming wave for this one directional case as foilowL #Iant (z 7t) = Asin(z vt)
(9.31)
The reflected wave in the first cylinder will be sinusoid of the same frequency but traveling in the opposite direction and reduced in amplitude $J~>(z 7t) = kAsin(z + vt)
(9.32)
where the reflection coefficient k is the fraction of the wave that is reflected. Since the energy is proportional to the signal squared, the fraction of the waves energy that is reflected is k2, while the wave energy that continues on to the second cylinder is whatever remains. E2 = (1- k2)E1 (9.33)
Now for a little physics. The $ for sound waves can represent many different physical quantities (e.g., the average air particle displacement, the air particle velocity, the pressure). Well assume here that it represents the velocity. Physically this velocity must be continuous across the interface between the two sections, so at the interface the following must hold.
The derivative of the velocity is the acceleration, which is proportional to the force exerted on the air particles. The pressure, defined as the force per unit area, must be continuous at the interface, implying that the following must hold there.
$I ant (z 7 t) - &$(z 9 t)
AI
p(rr: ?t)
A2
l-k
A2
and rearranging we find an expression for the reflection coefficient in terms of the cross-sectional areas.
k = AI A2 A,+&
378
CORRELATION
Lets check to see that this result is reasonable. If the second cylinder shrinks to zero area (closing off the pipe) then k = 1 and the wave is entirely reflected, as it should be. If there really is no interface at all (i.e., Al = AZ) then k = 0 and no energy is reflected. If AZ >> A1 then k = -1, which seems unreasonable at first; but an open-ended pipe has zero pressure at its end, and so the wave reflects but with a phase reversal. It isnt hard to generalize our last result to a pipe with many sections. The reflection coefficient at the interface between section i and section i + 1 is Ai - Ai+1 ki = (9.34) Ai + Ai+1 What does all this have to do with solving the Yule-Walker equations for the LPC coefficients in the autocorrelation method? The LPC coefficients b, are not the only way of describing an all-pole system; the area ratios, the reflection coefficients, and many others (including an interesting set to be discussed in the next section) can be used instead. Since all of these parameter sets contain exactly the same information, it follows that we can derive any set from any other set. Many of the parameter sets are related by linear transformations, and hence the conversion is equivalent to multiplying by a matrix. We will now show that the connection between the reflection and LPC coefficients can be expressed as a recursion that is the most efficient way of deriving both. How can equation (9.29) be solved recursively? For simplicity well drop the subscript identifying the signal, but we have to add superscripts identifying the recursion depth. The first case is simple (for further simplicity we have dropped the subscript) C(0) bl] = C(1) and its MSE is El = C(0) - bl]lC(l) = C(O)(l - kf) --+
where we have defined kl z br]. Lets assume we have already solved the mth case
. . blnm]
379
and lets write this C[]b[ml = ~1~1.W e are now interested in the (m + l)th = case
where we have drawn in delimiters that divide the equations into two parts:
C(O)
cm . . .
C(m-1) C(m2) . . . 04
C(1IE-1)
(9.35)
and (
C(m)
(9.36)
Now multiply equation (9.35) by the inverse of the autocorrelation matrix of the mth iteration (C[ml)-l and use the results of that iteration. =
380
CORRELATION
J = =
0 0 ... 1 0 ii;;;
0 1 ... 0 0 1 0 ... 0 0
1
(9.37)
and noting that it commutes with Toeplitz matrices, we can finally write the following recursion for the LPC coefficients
[ F)=(L-km+li)[ fl
where k, E bk]. In the statistics literature the k variables are called partial correlation or PARCOR coefficients, since they can be shown to measure the correlation between the forward and backward prediction errors (see exercise 9.9.2). Later we will show that they are exactly the reflection coefficients. Were we to know km+1 this recursion would produce all the other new b[+l given the old b[ml. So we have reduced the problem of finding the LPC coefficients to the problem of finding the PARCOR coefficients. Yet it is obvious from equation (9.36) that the converse is also true, kLm+l] can be derived from the lower b[+] coefficients. So lets derive a recursion for the ks and try to eliminate the bs. First we rewrite equation (9.36) as C(1) C(2) ... J = + km+lC(O) = C(mtl)
+ km+lC(O)
= C(m + 1)
381
Now we substitute the !I[~+] vector from equation (9.37) c . J(I - km+l J)b[l + k,+#(O) == =and finally solve for km+1 (noting J2 = I) = = C(m + 1) - c JbLrnl
l
= C(m + 1)
km+l
C(O) C(m+lG-_
c T$Tc * __ Jb[ml
Em
identifying the MSE in the denominator. After following all the above the reader will have no problem proving that the MSE obeys the simplest recursion of all. (9.38) E m+l = (1~ k&+,)Ern Lets now group together all the recursive equations into one algorithm that computes the k and b coefficients for successively higher orders until we reach the desired order M. Given the signal autocorrelations Start with EO = C(0) for m + 1 to M
b[m] + km =
C(0) through
C(M)
for
for
To see how the algorithm works lets run through it for the case of two coefficients . ( g; g, ) (i:)=( 2)
b[l] =
382
E
bp
1-
_
=
C2(O>- C2(l>
C(O) k2=
C(2)C(O) - C2(1)
C2(0) - V(l)
$1
and we have found the desired coefficients br and b2. We finish the section by fulfilling our promise to show that the k are the reflection coefficients. If we implement the LPC analysis filter (the FIR filter that converts the signal into the residual as a multistage lattice filter) then equation (9.38) tells us how the energy of the residual decreases. Comparing this with equation (9.33) completes the identification.
EXERCISES
9.10.1 Prove equation (9.34) for a pipe with multiple sections taking into account the reflected wave from the next interface. 9.10.2 Transmission lines have both voltage and current traveling waves, the ratio between the voltage and current being the impedance 2. At a splice where the impedance changes a reflected wave is generated. Express the reflection coefficient in terms of the impedances. Explain the limiting cases of shorted and open circuited cables. 9.10.3 Prove equation (9.38) for the MSE. 9.10.4 Solve the three-coefficient problem on paper using the Levinson-Durbin cursion. 9.10.5 Show that the complexity of the Levinson-Durbin than O(M3) as for non-Toeplitz systems. algorithm is O(i@) re-
rather
9.10.6 Levinson originally solved the more general problem of solving the equations TX = y where T is Toeplitz but unrelated to y. Generalize the recursion -G solve this problem. (Hint: You will need another set of recursions.) How much more computationally complex is the solution?
383
9.11
Line Spectral
Pairs
Another set of parameters that contain exactly the same information as the LPC coefficients are the Line Spectral Pair (LSP) frequencies. To introduce them we need to learn a mathematical trick that can be performed on the polynomial in the denominator of the LPC system function. A polynomial of degree M
M u(x) = C pmxm = a0 + ala: + azx2 + . . . aM-2xMe2 + aMelxM-l + aMxM
is called palindromic
a() = aM
if Um = UM-m, i.e.,
al = aM-1 a2 = aM-2
etc.
and antipalindromic
if Um = -aM-m,
al = -M-l
i.e.,
a2 = -M-2
etc.
so 1 + 2x + x2 is palindromic, while z + x2 - x3 is antipalindromic. It is not hard to show that the product of two palindromic or two antipalindromic polynomials is palindromic, while the product of an antipalindromic polynomial with a palindromic one is antipalindromic. We will now prove that every real polynomial that has all of its zeros on the unit circle is either palindromic or antipalindromic. The simplest cases are x + 1 and x - 1, which are obviously palindromic and antipalindromic, respectively. Next consider a second degree polynomial with a pair of complex conjugate zeros on the unit circle.
x2
e-i4
&4,
,i4,--i+
x2 - 2cos($) + 1
This is obviously palindromic. Any real polynomial that has k pairs of complex conjugate zeros will be the product of k palindromic polynomials, and thus palindromic. If a polynomial has k pairs of complex conjugate zeros and the root +l it will also be palindromic, while if it has -1 as a root it will be antipalindromic. This completes the proof. The converse of this statement is not necessarily true; not every palindromic polynomial has all its zeros on the unit circle. The idea behind the
384
CORRELATION
LSPs is to define palindromic and antipalindromic polynomials that do obey the converse rule. Lets see how this is done. Any arbitrary polynomial a(z) can be written as the sum of a palindromic polynomial p(z) and an antipalindromic polynomial a(x)
a, = 2Prn + qm)
where
Pm
%?I
=
=
am + a,--WI
a??XaM--m
(9.39)
(if M is even the middle coefficient appears in pm only). When we are dealing with polynomials that have their constant term equal to unity, we would like the polynomials pm and qm to share this property. To accomplish this we need only pretend for a moment that am is a polynomial of order M + 1 and use the above equation with a~+1 = 0.
a, =
$<Pm + Qm)
where
Pm
am + aM+l-m
qT7-t =
am-
(9.40)
M + 1.
aM+l-772
Figure 9.5: The zeros of a polynomial and of its palindromic and antipalindromic components. The Xs are the zeros of a randomly chosen tenth order polynomial (constrained to have its zeros inside the unit circle). The circles and diamonds are the zeros of the p(z) and q(z). Note that they are all on the unit circle and are intertwined.
385
44 = $(P(X) + q(2))
where
and it is not hard to show that if all the zeros of a(x) are inside the unit circle, then all the zeros of p(x) and of q(x) are on the unit circle. Furthermore, the zeros of p(x) and q(x) are intertwined, i.e., between every two zeros of p(x) there is a zero of q(x) and vice versa. Since these zeros are on the unit circle they are uniquely specified by their angles. For the polynomial in the denominator of the LPC frequency response these angles represent frequencies, and are called the LSP frequencies. Why are the LSP frequencies a useful representation of the all-pole filter? The LPC coefficients are not a very homogeneous set, the higher-order bm being more sensitive than the lower-order ones. LPC coefficients do not quantize well; small quantization error may lead to large spectral distortion. Also the LPC coefficients do not interpolate well; we cant compute them at two distinct times and expect to accurately predict them in between. The zeros of the LPC polynomial are a better choice, since they all have the same physical interpret at ion. However, finding these zeros numerically entails a complex two-dimensional search, while the zeros of p(x) and q(x) can be found by simple one-dimensional search techniques. In speech applications it has been found empirically that the LSP frequencies quantize well and interpolate better than all other parameters that have been tried.
EXERCISES 9.11 .l Lets create a random polynomial of degree M by generating M + 1 random numbers and using them as coefficients. We can now find the zeros of this polynomial and plot them in the complex plane. Verify empirically the hardto-believe fact that for large M most of the zeros are close to the unit circle (except for large negative real zeros). Change the distribution of the random number generator. Did anything change? Can you explain why? 9.11.2 Prove that if all the zeros of U(Z) are inside the unit circle, then all the zeros of p(z) and of a(~) are on the unit circle. (Hint: One way is write the p and Q polynomials as a(z) (1 f h(x)) w here h(x) is an all-pass filter.) Prove that the zeros of p(z) and q(z) are intertwined. (Hint: Show that the phase of all-pass filter is monotonic, and alternately becomesx (zero of p) and 0 (zero of cl>*> 9.11.3 A pipe consisting of M + 1 cylinders that is completely open or completely closed at the end has its last reflection coefficient kM+i = f 1. How does this relate to the LSP representation?
386
CORRELATION
9.11.4 Generate random polynomials and find their zeros. Now build p(x) and q(x) and find their zeros. Verify that if the polynomial zeros are inside the unit circle, then those of p and q are on the unit circle. Is there a connection between the angles of the polynomial zeros and those of the LSPs? 9.11.5 The Greek mathematician Apollonius of Perga discovered that given two points in the plane ~1 and ~2, the locus of points with distances to zi and ~2 in a fixed ratio is circle (except when the ratio is fixed at one when it is a straight line). Prove this theorem. What is the connection to LSPs?
9.12
Higher-Order
Signal
Processing
The main consequence of the Wiener-Khintchine theorem is that most of the signal processing that we have learned is actually only power spectrum processing. For example, when we use frequency selective filters to enhance signals we cannot discriminate between signals with the same power spectrum but different spectral phase characteristics. When we use correlations to solve system identification problems, we are really only recovering the square of the frequency response. We have yet to see methods for dealing with signals with non-Gaussian distributions or non-minimum-phase attributes of systems. In this section we will take a brief look at a theory of signal processing that does extend beyond the power spectrum. We will assume that our signals are stochastic and stationary and accordingly use the probabilistic interpretation of correlations, first introduced in Section 5.6. There we defined the moment functions, definitions we repeat here in slightly modified form.
M~kl(ml, m2,. . . , mk) f (wn+m,
l l l
sn+mk)
The lath moment function of the digital stationary stochastic signal s is the average of the product of Ic + 1 signal values, at time lags defined by the moment functions parameters. The first-order moment function is simply MA] = (sn) the signals average (DC) value. The second-order moment function is
387
the autocorrelation (recall the probabilistic interpretation of autocorrelation of equation (5.22)). The third-order moment function is a new entity.
The interpretation of the third moment function is now clear. It is the probability that the O-l stochastic signal takes on the value 1 at all three times n, n + ml, and n + m2. If both ml and m2 are very large we expect the third moment to equal the mean cubed, while if ml is small enough for there to be nontrivial correlations, but rn2 still large, then we expect a slightly more complex expression.
However, the third moment can be significantly different from this as well. For instance, a signal that is generated by
Sn = 0 (Vn +
(where V, is some driving noise signal) will have a nontrivial third moment function with just these lags.
388
CORRELATION
Similarly the fourth and higher moment functions give the probability that the signal takes on 1 values at four or more times. In practice, interpretation of numeric moment function data is complex because of the contributions from lower-order moments, as in equations (9.42) and (9.43). For example, if 0 and 1 are equally probable, we expect to observe 1 at two different times with a probability of one-quarter; only deviations from this value signify that there is something special about the lag between the two times. Likewise, to really understand how connected four different times are, we must subtract from the fourth moment function all the contributions from the third-order moments, but these in turn contain portions of second-order moments and so on. The way to escape this maize of twisty little passages is to define cumulants. The exact definition of the cumulant is a bit tricky since we have to keep track of all possible groupings of the time instants that appear in the moment function. For this purpose we use the mathematical concept of a partition, which is a collection of nonempty sets whose union is a given set. For example, in the third moment there are three time instances no = n, and n2 = n + m2, and these can be grouped into five different 721 = n+ml, partitions. PI = {(n17n2,n3)), P2 = e-Q>, @2r n3)), p3 = {(n2), (w, n3)), p4 = {(n2), (ni,n2)}, and P5 = {(nr), (ns), (n3)). Well use the symbol Sij for the jth set of partition Pi (e.g., $1 = (ni) and &2 = (nl,n2)), Ni the number of such sets (Ni = 1, N2 = N3 = N4 = 2 and N5 = 3)) and Nij for the number of elements in a set (e.g., Nrr = 3, N51 = 1). We can now define the cumulant C[I s = C(-l)~~-l(N~ i - l)! ~ ~~Nij(S~j)
j=l
(9.44)
where the sum is over all possible partitions of the Ic time instants. It will be convenient to have a special notation for the signal with its DC component removed, $ E s - (s) . The first few cumulants can now be exmessed as follows: as expected,
389
(M;](n)
AI;]
( n+m1)
M;21(n+m2)
>
2(A4q3
(&&+ml
&l+m2
&z+ma
>
-&m,)c;
cg
which is somewhat more complex. For the special case of a zero mean signal and ml = rn2 = ma, Cl21 is the variance, Cl31 the skew, and Ci41 the kurtosis. Other than their interpretability, the cumulants are advantageous due to their convenient characteristics. The most important of these, and the reason they are called cumulants, is their additivity. PI Cz+&-m,
m2,.
+ $(ml,
m2,. . . m&l)
It is easy to see that this characteristic is not shared by the moment functions. Another nice feature is their blindness to DC components
> * * * m&l) = cikl(ml, m2,. . . mk-1)
where a is any constant. Like the moments, cumulants are permutation blind
PI (ml, C~kl(m,l, m,,, . . . mcrk.l) = C, m2,
. . . mk-1)
. . mk-1)
= gkdkl(ml S
9 m2 9
mk-1)
If the signal is symmetrically distributed then all odd-order cumulants vanish. If a signal is Gaussianly distributed all cumulants above the second-order vanish. Higher-order spectra are defined in analogy with the Wiener-Khintchine theorem. Just as the spectrum is the FT of CL2(m), the &spectrum is defined PI(ml, m2), and the trispectrum the threeto be the two-dimensional FT of CS dimensional FT of Ci41(ml, m2, ma). It can be shown that for signals with finite energy, the general polyspectrum is given by a product of FTs. s[l (al,
u2 - . . uk-1) =
s(4s(td2)
. . . s(tik-l)s*(til
(4
. . . +
w&l)
390
CORRELATION
Now that we have defined them, we can show that cumulants are truly useful. Assume that we have a non-Gaussian signal distorted by Gaussian noise. Standard signal processing does not take advantage of the higherorder statistics of the signal, and can only attempt to separate the signal from the noise in the power spectral domain. However, cumulants of the third and higher orders of the noise will vanish exactly, while those of the signal will not, thus providing a more powerful tool for recovery of such a signal. For example, higher-order matched filters can be used as sensitive detectors of the arrival of non-Gaussian signals in Gaussian noise. We know from Section 8.1 that intermodulation products are produced when two sinusoids enter a nonlinearity. Assume we observe several frequency components in the output of a possibly nonlinear system; is there any way to tell if they are intermodulation frequencies rather than independent signals that happen to be there? The fingerprint of the phenomenon is that intermodulation products are necessarily phase coupled to the inputs; but such subtle phase relations are lost in classical correlation-based analysis. By using higher-order cumulants intermodulation frequencies can be identified and the precise nature of system nonlinearities classified. In Sections 6.12 and 6.13 we saw how to perform correlation-based system identification when we had access to a systems input and output. Sometimes we may desire to identify a system, but can only observe its output. Amazingly, this problem may be tractable if the input signal is nonGaussian. For example, if the unknown system is an N tap FIR filter,
N-l
Yn =
c
m=O
hmxn-m
+ vn
the input x is zero mean but with nonzero third-order cumulant, and the output y is observed contaminated by additive Gaussian (but not necessarily white) noise v, then the systems impulse response can be derived solely from the outputs third-order cumulants. hm = This amazing result is due to the inputs third-order cumulant (assumed nonzero) appearing in the numerator and denominator and hence cancelling out, and can be generalized to higher-order cumulants if needed. A related result is that cumulant techniques can be used for blind equalization, that is, constructing the inverse of an unknown distorting system, without access to the undistorted input.
391
EXERCISES
9.12.1 Find all the partitions of four time instants and express CP](mi, na2, ma) in terms of moments. 9.12.2 Consider the three systems (0 < a, b < 1)
yn Yn Yn
=
= =
&a
- (U + b)xn-1 + abxn-2
(a + b)xn+l + ubxn+2
Xn -
-ax,+1
What are the system functions for these systems? Which system is minimum phase, which maximum phase, and which mixed phase? Take xn to be a zero mean stationary white noise signal, with (xnxn+m) = S,O and Show that the output signals from all three sys> = ~,I,,. (XnXn+m~Xn+mz tems have the same autocorrelations. Prove that for all three systems the same frequency response is measured. Why is this result expected? Show that the third-order moments are different. 9.12.3 Prove equation (9.45). 9.12.4 There is another way of defining cumulants. Given the k signal values
Sn,Sn+m17-.-7 Sn+mk--l
we posit k dummy variables wa . . . wk-1 and define the following function, known as the characteristic function.
The cumulants are the coefficients of the Taylor expansion of the logarithm of this function. Derive the first few cumulants according to this definition and show that they agree with those in the text. Derive the additivity property from this new definition. 9.12.5 In the text we mentioned the application of higher-order signal processing to the identification of intermodulation products. Let cpr, (~2 and (~3 be independent uniformly distributed random variables and define two stochastic signals &I
$1
=
=
cm
(wn
+ 91) + cpl)
+ 92) + 4
+ w2)n + w2)n
+ (p1 + 93)
+ $92))
cos (win
each of which has three spectral lines, the highest frequency being the sum of the lower two. The highest component of ~1~1 could be an intermodulation product since it is phase-locked with the other two, while that of ~1~1 is an unrelated signal. Show that both signals have the same autocorrelation and power spectrum, but differ in their third-order cumulants.
392
BIBLIOGRAPHICAL NOTES
Bibliographical
Notes
Matched filters are covered in most books on communications theory, e.g. [242, 951. Wieners first expositions of the Wiener-Khintchine theorem were in mathematical journals [276] but he later wrote an entire book on his discoveries [277]. The co-discoverer of the theorem was Aleksandr Khintchine (or Khinchin), whose Mathematical Foundations of Information Theoy was translated into English from the original Russian in 1957. The second volume of Norbert Wieners autobiography [280] has fascinating background information on Wieners work at MIT during the World War II years. His 1942 report, entitled Extrapolation, Interpolation and Smoothing of Stationary Time Series, was suppressed because of possible military applications, and finally released only in 1949 [278]. Even though written to be more understandable than the former paper, its mathematics, more familiar to physicists than engineers, was so difficult to the latter audience that it was commonly called the yellow peril. Levinson both explained Wieners results to a wider audience [146] and translated the formalism to the digital domain. While accomplishing this second task he invented his recursion [147], although digital hardware capable of computing it did not exist at the time. The invention of LPC is due to Bishnu Atal of Bell Labs [lo], who was mainly interested in its use for compression of speech [9]. The LSP frequencies are due to Itakura of NTT Research Labs [log] (but dont bother checking the original reference, its only an abstract). Higher-order signal processing is the subject of a book [181] and numerous review articles [173, 1821. [33] d iscusses partitions in a simple way, and includes source code for computing the number of partitions of n objects. Cumulants were introduced in statistics by Fisher in the 1930s and in use in physics at about the same time. The idea of higher-order spectra as the FT of cumulants dates back to Kolmogorov, but the nomenclature polyspectra is due to Tukey. The use of cumulants for output-only system identification is due to Georgios Giannakis [72]. A few references to the extensive literature on applications of cumulants include noise cancellation [49]; system identification [73, 651; blind equalization [235, 2361; and signal separation [286, 287, 1081.