Data Flow Graph Mapping Techniques of Computer Architecture With Data Driven Computation Model
Data Flow Graph Mapping Techniques of Computer Architecture With Data Driven Computation Model
ce, Slovakia
Data Flow Graph Mapping Techniques of Computer Architecture with Data Driven Computation Model
Branislav Mado, Anton Bal
Department of Computers and Informatics, Faculty of Electrical Engineering and Informatics, Technical University of Koice, Letn 9, 042 00 Koice, Slovakia, e-mail: {branislav.mados, anton.balaz}@tuke.sk
Abstract Article introduces architecture of computer with data driven computation model based on principles of tile computing which is modern approach to multi-core design of microprocessors, with basic principle of cores layout in bi-directional mesh of cores with interconnecting communication network. Special attention is paid to description of data flow graph mapping technique and multi-mapping techniques proposed within this architecture. Hardware implementation of computers prototype with use of Xilinx Spartan 3 PCIe Starter board with Spartan 3 AN FPGA chip is also described. Architecture is developed at the Department of Computers and Informatics, Faculty of Electrical Engineering and Informatics, Technical University of Kosice. The work is one of reached results within projects VEGA 1/4071/07 and APVV 0073-07, being solved at the Department of Computers and Informatics, Faculty of Electrical Engineering and Informatics, Technical University of Koice.
I. INTRODUCTION Exponentially increasing amount of transistors integrated on the chip with increasing density of integration, in accordance to the Moores law, provides opportunity to increase the performance of microprocessors. Conventional superscalar architecture of microprocessor cannot be simply scaled up in line with this trend, and this situation results in challenging problem how to maintain continuity in proportion between quantity of integrated transistors and performance of the chip. Disproportion between communication performance and computation performance intensifies with miniaturization of transistors, because of side effect of relative lengthening of inside chip wiring. Centralized control of microprocessor intensifies the problem. Memory model using cache memory causes extensive associative searches. Increasing circuit complexity of superscalar microprocessors with longer design times, complex tests of designs and increasing amount of design bugs are also negative aspects of the effort to increase the performance of superscalar microprocessors by integration of more transistors into single processor core. Mainstream multi-core designs of superscalar processors are trying to address those challenges and commercially available microprocessors are integrating two or four cores into single chip.
II. TILE COMPUTING Pushing trend in the architecture of multi-core microprocessors is integration of tens of cores in tiled configuration. Tile computing, introduced in [1], can be characterized by multiple use of processing elements, memory elements, input/output elements and various types of interconnecting networks. Proposed architectures are using control driven computation models, some of them with VLIW, and data driven computation models. Representative of general purpose tile computing microprocessors is Tile64 designed by Tilera Corporation, described in [2]. Tile64 integrates 64 general purpose cores called tiles. Each core integrates L1 and L2 cache. Tiles are arranged in 8 8 bi-dimensional mesh using interconnecting network with 31Tbps data throughput. Chip utilizes 3-way VLIW pipeline for instruction level parallelism. Each tile is able to independently run operating system, or multiple tiles are able to run multiprocessing operating system. Performance of the chip at 700 MHz is 443.109 Operations Per Second (BOPS). Intel microprocessor TeraScale Processor is designed under Tera-Scale Computing Research Program and is described in [3]. Terascale Processor integrates 80 cores in bi-dimensional mesh in 8 10 cores organization. With 65 nm processor technology implemented, chip integrates 100 million transistors on 275 mm2 die. Processor performs 4 FLOP per cycle per core, and with 4.27 GHz delivers theoretical peak performance of 1.37 TFLOPS in single precision. Instructions set architecture consists of 16 instructions and Terascale Processor uses 96 bit VLIW. Each core runs own program and interconnection network transfers data and coordination instructions for execution of programs between cores via message passing. Representative of DSP microprocessors with tile organization is VGI. Organization of 64 cores is 8 8 in bi-dimensional mesh and each core contains approximately 30 000 transistors. VGI utilizes dataflow paradigm of computation. Whereas described architectures are integrating tens of cores, new approach to microprocessor design - spatial computing, with TRIPS, RAW, SmartMemories, nanoFabrics or WaveScalar representatives, is heading towards integration of hundreds or even thousands of simple cores or processing elements (PE), often arranged along with memory elements in the grid [4][5][6][7].
- 355 -
B. Mado and A. Bal Data Flow Graph Mapping Techniques of Computer Architecture with Data Driven Computation Model
III. DATA FLOW COMPUTING Data flow paradigm of computation was popularized in 60 and 70 and describes non Von Neumann architectures with the ability of fine grain parallelism in computation process. In data flow architecture the flow of computation is not instructions flood driven. There is no concept of program counter implemented in this architecture. Control of computation is realized by data flood. Instruction is executed immediately in condition there are all operands of this instruction present. When executed, instruction produces output operands, which are input operands for other instructions. Data flow paradigm of computing is using directed graph G = (V, E), called Data Flow Graph (DFG). DFG is used for the description of behavior of data driven computer. Vertex v V is an actor, a directed edge e E describes precedence relationships of source actor to target actor and is guarantee of proper execution of the data flow program. This assures proper order of instructions execution with contemporaneous parallel execution of instructions. Tokens are used to indicate presence of data in DFG. Actor in data flow program can be executed only in case there is a presence of a requisite number of data values (tokens) on input edges of an actor. When firing an actor execution, the defined number of tokens from input edges is consumed and defined number of tokens is produced to the output edges. An important characteristic of data flow program is its ability to detect parallelism of computation. This detection is allowed on the lowermost basis on the machine instructions level. There are static, dynamic and also hybrid data flow computing models. In static model, there is possibility to place only one token on the edge at the same time. When firing an actor, no token is allowed on the output edge of an actor. Disadvantage of the static model is in impossibility to use dynamic forms of parallelism, such a loops and recursive parallelism. Computer with static data flow architecture was first introduced by Dennis and Misunas in 1974 [8]. Dynamic model of data flow computer architecture allows placing of more than one token on the edge at the same time. To allow implementation of this feature of the architecture, the tagging of tokens was established. Each token is tagged and the tag identifies conceptual position of token in the token flood. For firing an actor execution, a condition must be fulfilled that on each input edge of an actor the token with the same tag must be identified. After firing of an actor, those tokens are consumed and predefined amount of tokens is produced to the output edges of an actor. There is no condition for firing an actor that no tokens must be placed on output edge of an actor. The architecture of dynamic data flow computer was first introduced at Massachusetts Institute of technology (MIT) as a Tagged Token Dataflow Architecture [9]. Hybrid data flow architecture is a combination of control flow and data flow computation control mechanisms. Data flow computing is predominantly domain of the research laboratories and scientific institutions, and has limited impact on commercial computing because of difficulties in cost of communication, organization of
computation, manipulation with structured data and cost of matching [10][11]. Paradigm of tile computing in combination with dataflow computing brings new possibilities to overcome some of deficiencies of dataflow architectures. IV. PROPOSED ARCHITECTURE
Architecture of data flow computer proposed at the Department of Computers and Informatics, Faculty of Electrical Engineering and Informatics, Technical University of Koice is a representative of computer-ona-chip approach, which combines paradigm of computing with data driven computation model with principles of tile computing. Architecture consists of elements with simple design, represented by processing elements (PE) and Input /Output elements (IO). Architecture also comprises of local interconnection network spread across the chip for local communication and global interconnection network for data flow graph mapping Fig. 1. Processing elements are arranged in accordance with tile computing paradigm into bi-directional mesh of 8 8 processing elements across the whole chip, forming processing array (PA). Each PE integrates activation frames store as the storage of activation frames, arithmetic and logic unit for instructions execution and control unit. All PEs are unified general purpose computing units with simple design as well as I/O elements which are also simple and unified units. Input/Output (IO) elements connected to the pins of the chip are localized at the edges of processing array. I/Os are used not only for communication with surrounding equipment of computer, but also allow creation of multichip computer architectures, where I/Os on different chips are creating bridges between processing elements of different processing arrays Fig. 2.
Figure 1. Computer architecture with bi-directional mesh of processing elements in processing array, with Input/Output elements on the edges of processing array and with local and global communication networks.
- 356 -
SAMI 2011 9th IEEE International Symposium on Applied Machine Intelligence and Informatics January 27-29, 2011 Smolenice, Slovakia
It is possible to switch between modes of data flow graph mapping in time of this process, with aim to optimize time of data flow graph mapping. A. Sequential data flow graph mapping In sequential mapping it is possible to map dataflow graph into activation frames only sequentially, with use of X, Y and Z axis address in address space Fig. 3.
Figure 2. Mutli-chip configuration of computer, with PEs on the edges of processing array, connected through the I/O elements in function of bridges between computer chips.
Local interconnecting network allows concurrent short range high-bandwidth communication between pairs of neighboring processing elements and also between processing elements and Input/Output elements. Processing element communicates directly with one of eight neighbors which are other processing elements or I/Os and allows speed communication over short distance in this communication network. Advantages of proposed architecture with tile organization are in simple design of elements which are arranged across the chip in uniform simple manner which secures high-level scalability of the architecture. Decentralization of data storing, execution and control of computation results in shorter wire lengths on the chip and small latency in communication. V. DATA FLOW GRAPH MAPPING Global interconnecting network, spread across the whole chip, allows data flow graph (DFG) mapping onto each activation frame of each processing element in the processing array of computer. Each activation frame is addressable and all activation frames are forming virtual cube of addressable space of computer with addresses consisting of three components X(2:0), Y(2:0) and Z(2:0). Two of them are forming address of processing element in processing array (axis X and Y) and third is axis Z, which is address of activation frame in activation frames store of respective processing element. There are three modes of dataflow graph mapping, allowing not only sequential mapping of instructions into activation frames, but also concurrent mapping of instructions into activation frames stores of all processing elements of the array. This is called global multi-mapping mode. Third mode allows concurrent mapping of instructions into activation frames stores of selected processing elements of processing array. Selection of processing elements is allowed by use of mask. This is called mask multi-mapping mode.
TPAmap is total time in machine cycles (MC) for mapping of data flow graph onto processing elements of processing array. TPEmap is total time in MC for mapping of sub-graph of dataflow graph onto respective processing elements of processing array. TPAmap can be expressed by TPAmap =
PAmap
(1)
Where X is the number of processing elements onto which DFG is mapped. TPEmap can be expressed by TPEmap = N 1 MC [MC] (2)
Where N is the number of activation frames which are mapped in activation frames store of respective processing element. Maximal total time of data flow graph mapping Max TPAmap into processing array in MC can be expressed by Max TPAmap = X Max TPEmap [MC] (3)
Where X is the number of processing elements which are forming processing array into which data flow graph is mapped.
- 357 -
B. Mado and A. Bal Data Flow Graph Mapping Techniques of Computer Architecture with Data Driven Computation Model
Where CPAR is capacity of activation frames store of processing element, expressed in number of activation frames. B. Global data flow graph multi-mapping Global data flow graph multi-mapping allows mapping of activation frame into respective address in Z axis in all processing elements of processing array concurrently in one machine cycle Fig. 4. TPAmap can be expressed by TPAmap = TPEmap = N 1 MC [MC] (5)
Where N is the number of activation frames which are mapped into activation frames store of processing element. It is possible to define multi-mapping mask MASKX(m:0) for X axis of processing element and MASKY(m:0) for Y axis of processing element, where m is dimension of bi-directional mesh m m of processing elements in processing array. PE(x:y) is activated for mapping in case of MASKX(x) = 1 and MASKY(y) = 1.
Where N is number of activation frames which are mapped in activation frames store of each processing element.
Figure 5. Mask multi-mapping of DFG into PA with mask MASKX(7:0) = 00001111 and MASKY(7:0) = 00001111
Maximal total time of DFG mapping into processing array and processing element which are Max TPAmax and Max TPEmap can be expressed by Max TPAmap = Max TPEmap = CPAR 1 MC [MC]
Figure 4. Global data flow graph multi-mapping
(8)
Maximal total time of data flow graph mapping into processing array and processing element which are Max TPAmax and Max TPEmap can be expressed by Max TPAmap = Max TPEmap = CPAR 1 MC [MC] (6)
Where CPAR is the capacity of activation frames store of processing element, expressed in activation frames. VI. PROTOTYPE Prototype of data flow computer with proposed architecture was realized with software development tool ISE WebPack which was used for architecture development. Software tool ModelSim was used for simulation and verification of function of developed design. Simulation is oriented to Spartan 3 AN FPGA chip and hardware prototype of the computer is built-up with use of Xilinx Spartan 3 PCIe Starter board as the hardware platform of the prototype Fig. 6. There is an FPGA chip Xilinx Spartan 3 XC3S10004FG676 with 676 pins in FBGA package in the centre of development board and it works on 50 MHz clock frequency.
Where CPAR is capacity of activation frames store of processing element, expressed in number of activation frames. C. Mask data flow multi-mapping Mask data flow graph multi-mapping allows mapping of activation frame into respective address in activation frames store of processing elements of processing array concurrently in one machine cycle Fig. 5. Processing elements which are activated for mask multi-mapping are defined by mask.
- 358 -
SAMI 2011 9th IEEE International Symposium on Applied Machine Intelligence and Informatics January 27-29, 2011 Smolenice, Slovakia
ACKNOWLEDGMENT This work was supported by the Slovak Research and Development Agency under the contract No. APVV0073-07. The work is one of reached results within project VEGA 1/4071/07, being solved at Department of Computers and Informatics, Faculty of Electrical Engineering and Informatics, Technical University of Koice. REFERENCES
[1] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Rabb, S. Amarasinghe, and A. Agarwal, Baring it all to software: RAW machines, Technical Report, Massachusetts Institute of Technology Cambridge, MA, USA, IEEE Computer, 30(9):8693, September 1997. [2] D. Wintzlaw, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, Ch. Miao, J. F. Brown III, and A. Agarwal, On-Chip Interconnection Architecture of the Tile Processor, IEEE Micro, vol. 27, no. 5, pp. 15-31, Sep./Oct. 2007, doi:10.1109/MM.2007.89. [3] T. G. Mattson, R. van der Wijngaart, and M. Frumkin, Programming the Intel 80-core network-on-a-chip Terascale Processor, SC2008 November 2008, Austin, Texas, USA 978-14244-2835-9/08. [4] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin, M. Oskin, and J. S. Eggers, Modeling Instruction Placement on a Spatial Architecture, SPAA06, July 30. August 2, 2006, Cambridge, Massachusetts, USA 1-59593-452-9/06/0007 ACM. [5] K. Mai, T. Paaske, J. Nuwan, R. Ho, W. Dally, and M. Horowitz, Smart Memories: A Modular Reconfigurable Architecture, ISCA 00, Vancouver, British Columbia, Canada, ACM 1-58113287-5/00/06-161. [6] R. Sakaralingham, R. Nagarajan, R. McDonald, S. Desikan, S. Drolica, S. Govidan, P. Gratz, D. Gulati, H. Hanson, Ch. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler, and D. Burger, Distributed Microarchitectural Protools in the STRIP Prototype Processor, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO06) 0-7695-2732-9/06. [7] M. Budiu, T. Venkataramani, T. Chelcea, and S. C. Goldstein, Spatial Computation, In International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1426, October 2004. [8] J. B., Dennis, R.P., Misunas, A Preliminary Architecture for a Basic Data Flow Processor, Proceedings of the 2nd Annual Symposium on Computer architectures, 1974. [9] I. Watson, J. R. Gurd, A prototype data flow computer with token labeling, In Proceedings of the National Computer Conference. 623 628, 1979 [10] N. dm, "Single Input Operators of the DF KPI System", Acta Polytechnica Hungarica, 7, 1, 2010, pp. 73-86, ISSN 1785-8860 [11] L. Vokorokos, N. dm and J. Trelov: "Pipeline system for processing of single operand data flow operators", International Conference on Applied Electrical Engineering and Informatics, Genoa, September 7-11, FEI TU, 2009, pp. 49-54, ISBN 978-80553-0280-5
Figure 6. Spartan 3 PCIe Starter Board with FPGA chip Spartan 3 AN as the hardware prototype platform of proposed computer architecture.
There are 391 free for use pins in this chip and 17000 logic cells. Development board involves also 4 Mb serial flash memory. There are 8 light emitting diodes (LED) for diagnostics and development purpose on the design board. There is also 9-pin RS 232 serial interface and 168 pins expansion ports which functions are configurable in accordance with the developed design on the board. VII. CONCLUSION This work introduced paradigm of tile based computing, dataflow computing and introduced details of dataflow computer architecture designed at the Department of Computers and Informatics, Faculty of Electrical Engineering and Informatics at Technical University of Koice. Article describes also mapping techniques of data flow graph and proposed multimapping approach. Contribution of this work is in the design of the architecture and realization of software simulation of data flow computer with the tile based architecture, which allows further research in the field oriented on possibilities of utilization of data flow computing in praxis and brings possibilities to overcome some disadvantages of dataflow paradigm.
- 359 -