High Speed Low Level Image Processing On Fpgas Using Distributed Arithmetic
High Speed Low Level Image Processing On Fpgas Using Distributed Arithmetic
High Speed Low Level Image Processing On Fpgas Using Distributed Arithmetic
Abstract. A method of calculating the 3x3 mask convolution required for many low level image processing algorithms is presented, which is well suited for implementing on reconfigurable FPGAs. The approach is based on distributed arithmetic, and uses the symmetry of weights present in many filter masks to save on circuit space, at the expense of requiring custom circuitry for each set of coefficients. A four product example system has been implemented on a Xilinx XC6216 device to demonstrate the suitability of this reconfigurable architecture for image processing systems.
Introduction
The spatial convolution process uses a weighted average of an input pixel and its immediate neighbours to calculate an output pixel value. For a 512x512 image, the number of multiply-accumulate operations needed to perform a 3x3 mask convolution is around 2.4 million. At a frame rate of 25 frames/sec, real-time processing requires 59 million operations per second. Normally, to meet this computational requirement, fast dedicated parallel multipliers and accumulators are needed. We consider the use of current FPGAs for performing these multiply accumulate operations; in particular, we examine the capabilities of the Xilinx XC6200 series devices. It is clear that fast parallel multipliers are not suitable due to their large size. Instead, we consider the use of distributed arithmetic, and investigate current methods of implementing distributed arithmetic elements on the 6200 series devices.
Multiply accumulate (MAC) structures can be implemented on devices with limited space by using Distributed Arithmetic (DA) techniques [1,2], Partial products are stored in look-up tables (LUTs) which are accessed sequentially, and summed by a single accumulator.
In implementing masks where several weights are the same, it is possible to exploit symmetry to reduce the size of the partial product LUTs. This can be performed by accumulating the inputs prior to accessing the look-up table using a serial adder [3]. At the expense of slightly larger LUTs, it is also possible to use full width addition resulting in faster circuits, which is the approach used here. Also since all weighted results are produced at the same time, this unifies the control process for filters containing different symmetries. An example block structure for a 3x3 mask convolution process is shown below in figure 1.
mask = P0 P2 P5 P6 P8 P1 P3 P5 P7 Adder Constant multiply accumulate w(P1+P3+P5+P7) Multiply accumulate w0 w w6 w w4 w w2 w w8
Design Implementation
A four product example system using distributed arithmetic has been produced to illustrate the method of implementing LUT-based circuits on the XC6200. The design can be used to calculate the inner products of a mask for four common coefficients, and is shown in figure 1 as the constant multiply accumulate block. The XC6200 series devices offer a fine-grained sea of gates, with dynamic partial reconfiguration and a dedicated microprocessor interface [4]. The cells can implement any 2 input function with a register if required. Unlike other RAM based Xilinx devices, no demonstration of the use of RAM embedded on the device by other parts of the device has yet been shown. Thus, to implement storage on the XC6200 series devices, it is necessary to explicitly design circuits capable of being addressed and producing the appropriate data word. This is of great significance when implementing LUTs on the XC6200. In implementing the constant multiply accumulate block, the most important aspect is the implementation of the look-up table. The other components are quite simple to design. Briefly, the adder prior to the look up table provides the sum of 4 input bits, and occupies 9 cells. For a system using 8 bit pixel values, and 4 bit weights, the DA accumulator occupies a bounding box of size 5 x 12, though the routing within the box is fairly sparse. A number of methods are suitable for implementing look-up tables on the XC6200, depending on whether the contents of the look-up tables are known at syn-
thesis time. If the contents are to be changed to any arbitrary value at run time without any software support for pre-computing values, then a series of loadable registers and multiplexors must be used. This will clearly use a lot of the FPGA real estate; for example, a 256 entry look up table storing 8 bit values would consume all the area of a 6216 device. If the contents of the LUT are known at synthesis time, it is possible to get smaller implementations. In [5,6], a technique of embedding the bits of the look-up table in the logic gate available in the 6200 series cell is presented. We have used this technique to implement LUTs by constructing a generator which accepts the required stored values as input and produces a LOLA description [7] of the required circuit complete with placement information. This can subsequently be processed by the Lola Development System tools to produce configuration data as shown in figure 2. LOLA is in effect used by the generator as an intermediate language, which can then be further processed by device specific tools, in a similar manner to the bytecode used by Java interpreters.
C-code definition of LUT contents LUT Generator Lola description of LUT circuit Lola Development system XC6200 config datastream
Fig. 2. Design flow for LUTs To clarify the flow used in [5,6], an example will be discussed. Consider a LUT required to store the 4 unsigned values <13,9,2,4>. Clearly the address of the LUT will be 2 bits (a1a0), and four bits will correctly represent the output (d3d2d1d0). Firstly, a table is constructed showing the binary pattern required for each of the output bits. The required logic function to implement each output bit given the two input bits is identified by reading down the column for each bit as shown in table 1. Table 1. Example of LUT construction d3 d2 d1 1 1 0 1 0 0 0 0 1 0 1 0 ~a1 ~(a1-a0) a1*~a0
a1a0 00 01 10 11 Function
d0 1 1 0 0 ~a1
Since the logic of a 6200 cell can implement any 2 input function, LUTs containing any bit pattern can be constructed this way. Implementing LUTs requiring more bits per entry is simply a matter of duplicating the process for as many additional bits as are required. Implementing LUTs with more than 4 entries requires multiplexors on the outputs of multiple LUTs using the additional address bits as selectors, as discussed in [5]. The LUTs produced by the generator match the sizes reported in [5]. For example, k a 64x4 bit LUT requires 124 cells. In general for a LUT of size 2 x 4m, the cell k k-2 usage is given by m(2 +4(2 -1)) .
In the case of a 5xM LUT, as is required in the example system, a second LUT containing the value when a2 is set could be used together with the multiplexor. Since the value is independent of a0 and a1, this LUT would simply contain hardwired 1s and 0s. It is possible to use constant propagation to reduce the storage requirements by merging these hardwired units with the logic of the multiplexors. Since each of the required functions for multiplexing in a 0 or 1 can be implemented using a single cell, this saves a third of the required area. For instance, a 5x8 bit LUT can be implemented in 8 cells. To evaluate the performance of the four product example system, an XC6200 FPGA card was used, which allows data to be written and read over a system PCI bus. The pixel data was transferred to and from internal registers using the FastMap interface of the 6216 present on the board. Though it is possible for user logic to determine that a register has been accessed using the FastMap interface, and hence start working [8], a separate software controlled clock was used. Using 8 bit pixels, two write accesses were required (1 for pixel data, 1 for the clock) before the result could be read back. Using the functions supplied in the boards API for transferring data to and from columns on the XC6216 device, the performance was limited to 60,000 results per second on a 100 MHz Pentium machine. By pre-computing target memory addresses and re-writing the assembly code using the API as a base, 700,000 results per second could be calculated. As noted in [8], this could be further improved by using the SRAM on the board. However, extra circuitry would be required to control the transfers to and from the SRAM. This result is an order of magnitude short of the real time processing requirements of the 6.5 million pixel results per second required to process 25 512x512 frames every second. The problem is now I/O bound by the performance of the PCI interface. However, due to the small size of the system, it is possible to process a number of pixels simultaneously, and since the mask slides over the target image, this will actually reduce the amount of I/O required, and hence speed up the overall execution.
Changing the weights can be performed in two ways, which on the XC6200 series are actually rather similar. Considering first the address bus format for the XC6216 device, described in [4], if mode(1:0) is set to 00, this allows access to the cell configuration and the state of the cell register. If a register and multiplexor method is used to implement the look-up tables, the new weights are loaded into the registers by writing into the region of devices memory map which maintains the state of the cell register. This is performed by setting the column offset to 11 and writing to the appropriate columns. If the bits of the look-up tables are embedded within logic functions, as described in section 4, new weights are loaded by writing pre-computed configuration data into internal cell routing registers for each cell. This is performed by setting the column offset to 01 or 10, and writing to the appropriate row and columns. The configuration
data must be pre-computed using the generator described in section 4, to give the correct logic functions. For image processing systems requiring adaptive changes to weights, precomputing configuration data is impractical. In addition, the configuration overhead is significant. For image processing systems requiring a small number of changes to the weights, the pre-computing of configuration data is not such a problem. Also techniques are available for compressing the configuration bitstream to minimize the reconfiguration overhead [9]. As well as the changing of individual weights within a mask, it is important to consider changing the symmetry of the mask. In the worst case, each mask value is different, requiring a single large LUT to hold the partial products. Currently, we are investigating techniques for minimizing the reconfiguration overhead between different mask symmetries, and hope to have results shortly.
Distributed arithmetic techniques can successfully be implemented on devices without explicit support for embedded storage. By using techniques to save on LUT entries, it is possible to reduce the space required for the LUTs, without affecting the performance of the system. It is possible to exploit partial reconfiguration to change weight values, and in the future, we wish to quantify the effects of changing the symmetry of masks in terms of the reconfiguration overhead.
References
1. Jaggernauth, J., Loui, A.C.P., Venetsanopoulos, A.N.: Real-Time Image Processing by Distributed Arithmetic Implementation of Two-Dimensional Digital Filters, IEEE Trans. ASSP, Vol. ASSP - 33, No. 6, pp. 1546-155, Dec. 1985 2. Goslin, G.R.: Using Xilinx FPGAs to design custom digital signal processing devices, available at http://www.xilinx.com/appnotes/dspx5dev.htm 3. Mintzer, L.:FIR Filters with Field-Programmable Gate Arrays, Journal of VLSI Signal Processing, Vol. 6, pp.119-127, 1993 4. Xilinx Inc., XC6200 field programmable gate arrays, available at http://www.xilinx.com/partinfo/6200.pdf 5. Xilinx Inc., A fast constant coefficient multiplier for the XC6200, available at http://www.xilinx.com/xapp/xapp082.pdf 6. Duncan, A., Kean, T : DES keybreaking, encryption and decryption on the th XC6216, Proc. 6 Annual IEEE Symposium on Custom Computing Machines, 1998 7. Wirth, N. : Digital circuit design, Springer-Verlag Berlin Heidelberg, 1995 8. Singh, S, Slous, R. : Accelerating Adobe Photoshop with reconfigurable logic th Proc. 6 Annual IEEE Symposium on Custom Computing Machines, 1998 9. Hauck, S, Z. Li, Schwabe, E. : Configuration compression for the Xilinx XC6200 th FPGA Proc. 6 Annual IEEE Symposium on Custom Computing Machines, 1998