A Low-Power Reconfigurable Data-Flow Driven DSP System: Motivation and Background
A Low-Power Reconfigurable Data-Flow Driven DSP System: Motivation and Background
Marlene Wan, Hui Zhang, Martin Benes Jan Rabaey Berkeley Wireless Research Center EECS Department, University of California, Berkeley
ABSTRACT - Reconfigurable architectures have emerged as a promising implementation platform to provide high-flexibility, high-performance, and lowpower solutions for future wireless embedded devices. We discuss in details a reconfigurable data-flow driven architecture, including the computation model, communication mechanism, and implementation. We also describe a set of software tools developed to perform automatic mapping from algorithms to the architecture, as well as to evaluate the resulting performance and energy of the mapping. Finally, we present results on digital signal processing and wireless communication algorithms to show the energy efficiency of the system and the effectiveness of the tools. Our system shows more than one order of magnitude of improvement in terms of energy efficiency when compared to low-power programmable processors.
is demonstrated by mapping some wireless communication and signal processing algorithms to the architecture. algorith m dataflow kernel* computation high-level control on microprocessor
hardwar e compone
mapping estimatio n
architectur e description
algorithm optimizati on architecture selection ASIC? programmable DSP? reconfigurable? reconfigurable architecture implementation optimization
Figure 1. Reconfigurable Digital Signal Processor Design Flow *Kernel:computational intensive operations within an algorithm, often correspond to data-fow computations in nested loops
In out realization of the architecture, the data-flow driven satellites are medium to fine-grained according to the definition of [5]. The functionality of the satellites is divided into three categories: source, computation and memory. To support adaptive computations without reconfiguration such as changing the vector length or number of taps for the computation satellites, we have developed a minimumoverhead mechanism for passing data structures (scalar, vector and matrix). Each computation satellite needs to be configured for the data structures it consumes and produces (e.g., vectors to scalar for MAC, shown in Figure 2). The source satellites generate tokens indicating the end of the data structure in parallel with corresponding data. End-of-vector sent with this data
Vector -> n Scalar -> 1 Figure 2 Data-flow Driven Operations of the Satellites In order to support dedicated links between satellites without reconfiguration overhead and global control, data steering elements are embedded in the reconfigurable network. In general, data steering elements are divided into three categories: static (data goes in a fixed direction in between reconfiguration periods); statically scheduled (data goes in directions instructed by programs configured at reconfiguration times); dynamically determined (data is annotated with the direction). Only the first two are supported by our realization of the architecture template because dynamic data steering imposes too large of an energy overhead for the granularity of the computational satellites. Currently, the data-flow driven computation is implemented using global asynchronous and local synchronous clocking. A general handshaking scheme has been developed and a library of satellites has been designed using the scheme[6]. Address generators, input ports (with data from microprocessor) and FPGAs can serve as sources and are in charge of generating end of data structure tokens. Implementation issues for low-power reconfigurable interconnection networks are addressed in detail in [7].
Performance Models
Power, delay, and area models have been developed for an extensive library of satellite modules designed at the University of California at Berkeley. Latency and analytical models of the effective switching capacitance (Ceff) of the modules are derived from circuit level simulation. In this section, we will introduce only models used for energy since the characterization of area and timing is well understood. Since our models are used for high level architecture selection, some degree of inaccuracy is acceptable. Therefore, a white noise signal distribution is assumed instead of statistical signal modeling when obtaining Ceff.
EnergySAT = CeffVdd
(Eq1)
Another important contributor of power consumption in our architecture is the reconfigurable interconnect. Methodology exists [7] to optimize domain specific reconfigurable interconnect architectures such that the energy and performance is close to ASIC implementations. Therefore, ASIC based interconnect power estimation is used for reconfigurable interconnect base cost--the average length (thus switching capacitance, Cave) between satellites is predicted based on the area of the modules needed for an application. A preliminary dynamic switching element has also been designed and the power model characterized [6]. The energy for a satellite-to-satellite link is therefore as follows:
(Eq2)
The parameter, M, specifies the number of dynamic switches required on the particular link which is known at synthesis time.
Simulation Tool
Based on the realization of the architecture template, a simulation environment is developed to provide an application-specific simulator in a style similar to [8]. Since computation is mapped to clusters of satellites, an object-oriented intermediate form based on the concept of modules (heterogeneous satellites) and queues (links between satellites) is created. A mapped kernel is constructed by
building a netlist using the module and queue library (Figure 3). In order to facilitate verification and performance feedback, wrappers are placed around all modules and queues so modules can be modeled as concurrent processes and queues as synchronized objects. Energy and time stamps are also associated with each module and queue so performance data can be collected. An application specific simulator is automatically instantiated once a netlist is specified.
Figure 3 An Intermediate Form Specification for a Computation Kernel Currently, the intermediate form is implemented in the C++ language and the Solaris thread library [9] (other common thread libraries can be switched in easily). Common satellite processors (such as MAC/multiply processor, ALU processor, memory and address generator, etc.) and data-steering modules have been incorporated in our module library.
Synthesis Tool
To ease the process of manually mapping algorithms to the architecture, we provide a synthesis tool to translate an algorithm (specified in a subset of C) to the direct-mapped implementation of the architecture. The output is the computation specified in the intermediate form. The kernel performance and energy can then be dynamically collected. In addition, for algorithms with nested loops of constant loop length, energy and performance information is also analyzed statically to avoid the overhead of simulation. The algorithm is compiled to the Stanford Unified Intermediate Form (SUIF) then converted to hierarchical Control/Data Flow Graph (CDFG [10]) representation. The current conversion from SUIF to CDFG exposes all scalar dependencies and preserves all WAW, RAW, and WAR dependencies in array accesses. The current synthesis tool allocates arrays of the same name to a particular memory and each
operation node in CDFG to a hardware unit (Figure 4 shows an example of a mapped kernel). This assignment of operations gives the rate-optimal execution of each computational node in the CDFG graph. The address generator program for each memory is generated based on the sequence of address expressions and corresponding loop iterations (an end of loop indicates an end of a vector). By merging corresponding fan-outs of each memory data read node and computational node in CDFG, a dedicated data steering element is generated for each output port. As shown in the example in Figure 4, while all other links are static, the output of memory Y1 (y1 has 4 read nodes in the algorithm) is statically scheduled by a program. Data from memory Y1 has to be broadcast at first to the MAC satellites, but after an end-of-vector (corresponds to Loop1), the direction of the data is changed to the multiplier.
Figure 4 Direct Mapping from C to Data-flow Driven Implementation Static performance estimation for loops with constant loop length is also provided so the overhead of simulation can be avoided. For a hierarchical CDFG, the total energy can be computed by performing a tree search on the graph in O(E) time:
TotalEnergy = IterationNum
comp
Energy
comp
In the equation, comp is either a satellite (Eq1), a link between satellites (Eq2) or another CDFG hierarchy.
Since the synthesis tool performs a direct mapping of CDFP to hardware, the latency of the implementation is characterized by the longest path and iteration period bound of the CDFG graph. The longest path is calculated by performing a topological sort of the CDFG graph (O(E)) and the iteration period bound is calculated in O(V E logE) [12][13].
CASE STUDIES
We show the low-energy feature of the system and effective performance feedback of the tools by using the performance information in several architecture selection processes. All energy and performance models of all satellite modules and interconnects are based on physical implementations designed in 0.25 m technology. Detailed reconfigurable interconnects are characterized in [7] also. The preliminary overhead of steering elements is included as well.
Energy on ARM8 (2.5V) Dot_product FIR VectorSumScalarMul Compute_Code IIR 11550 J 5690 J 4800 J 1550 J 390 J
CONCLUSION
We have presented a low-power reconfigurable data-flow driven digital signal processing system, described architecture concept in detail, and shown the energy efficiency of the architecture in the case studies. The examples in the case study also
illustrate how the tools introduced in this paper allow rapid architecture selection and serve as the basis of future optimizations. Utilizing the ideas introduced in this paper, future work will include algorithm level transformations (loop transformation and parallelism), implementation optimizations as well as more application mappings in adaptive filtering for the wireless communication domain.
ACKNOWLEDGEMENTS
The authors would like to acknowledge DARPA s support of the Pleiades project (DABT-63-96-C-0026) and all the Pleiades members for their input on the research topic discussed in this article.
REFERENCES
[1] G. R. Goslin, A Guide to Using Field Programmable Gate Arrays for Application Specific Digital Signal Processing Performance, Proceedings of SPIE, vol. 2914, p321-331. [2] Abnous et al, Evaluation of a Low-Power Reconfigurable DSP Architecture, Proceedings of the Reconfigurable Architecture Workshop, Orlando, Florida, USA, March 1998. [3] M. Goel and N. R. Shanbhag, Low-Power Reconfigurable Signal Processing via Dynamic Algorithm Transformations (DAT), Proceedings of Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, November, 1998. [4] Arthur Abnous and Jan Rabaey, "Ultra-Low-Power Domain-Specific Multimedia Processors", Proceedings of the IEEE VLSI Signal Processing Workshop, San Francisco, California, USA, October 1996. [5] P. Lieverse, E.F. Deprettere, A.C.J. Kienhuis and E.A. de Kock, ``A Clustering Approach to Explore Grain-sizes in the Definition of Weakly Programmable Processing Elements'', In 1997 IEEE Workshop on Signal Processing Systems: Design and Implementation, pp. 107-120, De Montfort University, Leicester, UK, November 3-5 1997. [6] Martin Benes, Master Thesis, University of California at Berkeley, 1999 [7] H. Zhang, M. Wan, V. George, J. Rabaey, "Interconnect Architecture Exploration for Low Energy Reconfigurable Single-Chip DSPs", Proceedings of the WVLSI , Orlando, FL, USA, April 1999. [8] B. Kienhuis, E. Deprettere, K. Vissers and P. van der Wolf, An Approach for Quantitative Analysis of Application Specific Dataflow Architectures, In
Proc. 11th Int. Conf. on Application-specific Systems, Architectures and Processors, Zurich, Switzerland, July 14-16 1997. [9] SunSoft Press, Solaris Multithreaded Programming Guide. [10] J. Rabaey, C. Chu, P. Hoang, M. Potkonjak, Fast Prototyping of DatapathIntensive Architectures. IEEE Design & Test of Computers, vol.8, (no.2), June 1991. p.40-51. [11] D. Messerschmitt, "Breaking The Recursive Bottleneck", in Performance Limits in Communication Theory and Practice, Kluwer Academic Publishers, 1988. [12] Shan-Hsi Huang and Rabaey, J.M. An Integrated framework for optimizing transformations, Proceedings of VLSI Signal Processing IX, p. 263-72. [13] C. Leiserson and F. Rose, "Optimizing Synchronous Circuitry by Retiming", Third Caltech Conf. On VLSI, March 1983. [14] D. Lidsky and J. Rabaey, Early Power Exploration a World Wide Web Application, Proceedings of Deisgn Automation Conference, Las Vegas, NV, June 1996. [14] N. Zhang, Implementation Issues in a Wideband Receiver Using Multiuser Detection, Master s Thesis, University of California at Berkeley, 1998. [16] W. Lee et. al A 1-V Programmable DSP for Wirelss Communications, IEEE Journal of Solid-State Circuits, Nov. 1997, vol.32, (no.11):1766-76. [17] Gerson and M. Jasiuk, Vector Sum Excited Linear Prediction (VSELP) Speech Coding at 8Kbps, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 461-464, April 1990. [18] M. Wan, Yuji Ichikawa, David Lidsky, Jan Rabaey, "An Energy Conscious Methodology for Early Design Exploration of Heterogeneous DSPs" Proceedings of the Custom Intergrated Circuit Conference , Santa Clara, CA, USA, May 1998