A Vlsi: DSP Chip For Real Time Iterative Deconvolution
A Vlsi: DSP Chip For Real Time Iterative Deconvolution
A Vlsi: DSP Chip For Real Time Iterative Deconvolution
Abstract: The design of a CMOS VLSI chip to perform real time iterative deconvolution is presented. A bit-level systolic array architecture for convolution, multiplication and addition is used to implement the circuit. A brief introduction to the iterative deconvolution problem is also given. Introduction
following
In many situations, a measurement ?rocess produces measurement data which may not agree with what is known to be true about the data. Spectroscopic or chromatographic systems offer good examples of this fact. Output from thesc systems is in the form of Gaussian shaped peaks which may have poor resolution or an insufficient signal-to-noise ratio, thus prohibiting proper interpretation of the measurements. Numerical deconvolution can be used to enhance data from a given measurement process. In particular, relaxation based iterative deconvolution methods that use constraints based on apriori knowledge of the measured data (i.e. minimum and maximum peak amplitudes or no possible negative data values) have been used successfully to restore severely corrupted chromatographic and spectroscopic data.[l,2] The use of these techniques is computationally intensive. Therefore, a CMOS integrated circuit is proposed which would implement a class of iterative deconvolution algorithms in real time.
Theory
If it is known apriori that the data has no negative values, constraints can be incorporated into Eqn (3) to take advantage of this information. This is done as follows x,k+l=px,k where 1 P= x,k>o
+ (g - h*px,k)
(4)
{
0
x,k<o
A similar technique modifies (g - h*x,k) using a relaxation function that causes corrections to x when x is negative. It is as folIows
x,k+'=x:
In spectroscopic and chromatographic mexxrement systems, the blurring of features can be modeled by considering the additive noise and the impulse response of the system. This can be written mathematically as follows
+ r(x,k)(g - h*x,k)
() 6
where
g=h*x+n
(1)
(7)
where g is the system output, x is the ideal system output, h is the impulse response function of the system, n is the additive noise of the system and the * operator denotes convolution. Note that g, h, x and n are functions of time, t, or index, i. We can find an estimate of x by using Fourier transform techniques as follows x,=F-'{G/H) While these methods reduce or eliminate negative artifxts, they may also eliminate important information due to the truncation caused by p and r ( x?}. Perhaps a better method would be the one proposed by Jansson where successive estimates are constrained to occur between upper and lower amplitude bounds. The constraining operation is done in a gradual fashion withcut truncating any information. This method is as follows x,k+l=x,k+ r(x,k)(g - h*x,k) with r(x,k)=b(I - 2/clx,k - c/21) (9) (8)
(2)
where x, is an estimate of x, C and H are Fourier transforms of g and h respectively, and F-l is the inverse Fourier transform operator. Unfortunately, for bandlimited processes, 11 will have finite values for some frequencies while C may have finite values (possibly negative) for the same frequencies due to noise. Thus, x, will contain spurious values. Another approach to the deconvolution is an iterative method. Four possible approaches have been investigated and are summarized here.[l,3] The first method was developed by Van Cittert and produces estimates by using a correction factor to adjust the kth estimate of x. The (k+l)th estimate of x is given by the
where b is a constant and c is the upper amplitude bound. Eqn (9) is plotted in Figure 1. In order to implement the broadest class of iterative deconvolution algorithms, equations (3,4, 6, 8) have been rewritten as follows
x,k+'=Dx:+
Eg
+ F(h*x,k)
(10)
where D, E, and F are general operators which incorporate the desired constraints and relaxation operations.
*
-+-
Iinndshnkn
lnnlruclinnq
Figure 1: Plot of the relaxation function r{x,k). IC Design The chosen architecture for the implementation of Eqn (10) is that of systolic arrays. In particular, the use of bit-level systolic arrays for the convolution circuit and for the multipliers allows for a high degree of pipelining and thus an increased-system throughput. The high throughput is required in order that the chip operate in real time. The operating environment of the chip is given schematically in Figure 2.
11111 C l l l
1l""l
C""1
Convulver Cell
Tam skcwcd iiipui dr
1'7 H H 2
Lsu
tI
&uk-
W W M
"
*t
Figure 4: Convolver array and operation. Figure 2: Set-up for real time iterative deconvolution. The computer in Figure 2 is used to program the DSP chip with the operating information it will need to perform the deconvolution. Some of the information provided to the chip would be: the impulse response function, h(i) of the measurement system, which deconvolution algorithm to use, the number of iterations to perform, etc. Once programmed, the chip would receive the digitized input g(t), perform the programmed iterations and then output the estimate of x. This could in turn be converted back to analog form or be used in it digital form for further processing. The following is a short description of the major parts of the chip. The components of the IC, and the general overall layout of the chip is given in Figure 3. Convolver The convolver constitutes the heart of the design and it was decided early in the design to use a bit-level systolic array approach to implementing the convolution term of Eqn (10). The architecture for this array processor is based upon the work done by McCanny and McWhirter. [4] A diagram of the array operation is shown in Figure 4. Basically, the array operates as follows: the data words, h and x, enter the array from the top and from the left respectively and propagate through the array of processors. A result vector, y , propagates through the array in the opposite direction as x. As the bits of the data words interact, they form the sum-of-products
N- 1
(11)
which represents the convolution.The result then exits the array on the left hand side. Notice that the bits of the data words enter the array in a time skewed fashion. This is required for the proper operation of the array and is achieved by inputting parallel words into a wedge-shaped array of latches (not shown) which delays the bits in the proper fashion before they reach the array boundaries. The result, y , also exits the array in a skewed fashion and can be passed through another wedge-shaped set of latches to produce a parallel output word. Both the convolver and the multipliers are pipeline devices. This means that the circuits can operate on continuous streams of input data and after a latency of several clock cycles can produce a result after every cycle of the clock. The clock frequency is then determined by the delay of an individual cell in the array. Preliminary simulations of the convolver processor cells indicate that the array should run at more than 5OMHz. Multipliers The multipliers used in the design are also based on the work
71
of McCanny and McWhirter [ 5 ] and are also bit-level systolic arrays. The architecture for the multiplier is shown in Figure 5. The multiplier works as follows: the nth pair of numbers to be multiplied enter the array along the top edges of the diagram. Note that the two upper triangular regions i n Figure 5 are merely latches and serve to skew, in time, the input bits before they reach the boundaries of the array which is contained within the diamondshaped region. These latch networks are similar to those mentioned in the discussion of the convolver above. This skew ensures that each individual bit of a(n) meets every individual bit of b(n). The kth bit of each partial product akd(n)bi(n) is formed on one of the cells within the kth vertical column, and the kth bit of the product
a b CIN
H
overhow circuitry
is formed by letting these partial products accumulate as sk(n) passes down that column. The sum and carry bits which enter the boundaries of the array are set to zero. The carries generated in the array are passed the processor cell to the left of the cell where it was generated. The triangular group of cells to the lower left of the diamond region are used to add to the product any residual carry bits which might be generated by the cells along the lower left boundary of the diamond-shaped region.
The three ALUs shown in Figure 3 are used to compute the constraints used in the various deconvolution algorithms. The possible operating modes of each of the ALUs ar:: shown in Table 1. Table 1: Operating modes of the ALUs. Algorithm used (equation numbeis) ............................................
________________________________________----------------------1 1 1 or 0 D-ALU 1
ALU
E-ALU F-ALU
b -b
1 -1 or 0
borO -b or 0
r(x,k)(eq. 7) -t-{x$](eq. 7)
The ALUs may be the critcal path with respect to overall system speed if they cannot be designed to operate as fast as the systolic arrays used in the convolver, multiplier:; and the adder. A bit-slice ALU is being considered at the present stage of the design. Memory The memory shown in Figure 3 will be used to store the programming data loaded from the computer in Figure 2, the estimate of x, and other housekeeping data used by the chip while operating. The memory will be an appropriate amount of RAM and its associated control circuitry. Finite State Machine
c = (L1.b) . S'
+ (3.b).
C'
s'.~'
The finite state machine (FSM) found in Figure 3 will be used to control the process of programming and operating the chip. It will be implemented using a PLA and registers. Conclusions
Adder
The adder found in the design will be a pipelined structure in order to keep up with the continuous stream of products coming from the multiplier arrays. The proposed implementation would be in the form of a systolic array also, as shown i n Figure 6. The adder would operate as follows: the input words enter thc top of the array as the inputs a, b, and CIN of the processor cell. The first row of cells compute the sum and carry out of the bits present at their inputs. The SUM is passed vertically down to the cell below while COUT is passed diagonally to the cell below and to the left. Then the next row computes the SUM and COUT and passes them to the row below, This continues until all rows are finished and the result of the addition exits the bottom of the array. Carry bits which exit the array on the left can be used to generate an overflow signal.
12
Although the final design of the chip is not yet compleie, it is believed that the initial goal of real time deconvolution using a CMOS VLSI circuit is feasible. The simulations done to date have shown that the systolic array portions of the chip will yield satisfactory throughput for millions of iterations per second. The bottleneck seen at this time may be the ALU responsible for computing the relaxation function of Eqn (7). Further improvements to the basic design will include the ability to cascade the chips to allow for larger word sizes and possibly multi-dimensional signal processing. Acknnwledmen t S The authors would like to thank the Bell South Foundation and the University of Tennessee Measurement and Control Center for their funding of this research.
Reference? [I] P. A. Jansson, Deconvolution With Application?in Spectroqcopy. New York: Academic Press, 1984, ch. 3-4, pp. 67-134. P. B. Crilly, Numerical Deconvolution of Gas Chromatographic Peaks Using Janssons Method, L Chemometrics, vol l~1p.79-90,April 1987. P. B. Crilly, A Comparative Study of Relaxation Based Iterative Deconvolution Methods, in Proceedincq of the Twentysecond Southeaqtem S y i u n i On S v w m Theory, 1990, pp.545-549.
[2]
[31
141
J. V. McCanny and J. G. McWhirter, Some Systolic Array Developments in the United Kingdom, Magazine, vol. 20, pp. 58-59, July 1987.
Signal Processing Functions Using 1-Bit Systolic Arrays, Electronics Letters, vol. 18, no. 6, pp. 241-243, March 1982.
13