0% found this document useful (0 votes)
94 views13 pages

A Low-Power FPGA Based On Autonomous Fine-Grain Power Gating

This paper presents a field-programmable gate array (fpga) based on lookup table level fine-grain power gating with small overheads. The proposed FPGA is fabricated using the ASPLA 90-nm CMOS process with dual threshold voltages. Granularity size of a power-gated domain is as fine as a single two-input and one-output lookup table.

Uploaded by

Vimala Priya
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views13 pages

A Low-Power FPGA Based On Autonomous Fine-Grain Power Gating

This paper presents a field-programmable gate array (fpga) based on lookup table level fine-grain power gating with small overheads. The proposed FPGA is fabricated using the ASPLA 90-nm CMOS process with dual threshold voltages. Granularity size of a power-gated domain is as fine as a single two-input and one-output lookup table.

Uploaded by

Vimala Priya
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1394

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

A Low-Power FPGA Based on Autonomous Fine-Grain Power Gating


Shota Ishihara, Student Member, IEEE, Masanori Hariyama, Member, IEEE, and Michitaka Kameyama, Fellow, IEEE
AbstractThis paper presents a eld-programmable gate array (FPGA) based on lookup table level ne-grain power gating with small overheads. The power gating technique implemented in the proposed architecture can directly detect the activity of each look-up-table easily by exploiting features of asynchronous architectures. Moreover, detecting the data arrival in advance prevents the delay increase for waking-up and the power consumption of unnecessary power switching. Since the power gating technique has small overheads, the granularity size of a power-gated domain is as ne as a single two-input and one-output lookup table. The proposed FPGA is fabricated using the ASPLA 90-nm CMOS process with dual threshold voltages. We use an image processing application called template matching for evaluation. Since the proposed FPGA is suitable for processing where the workload changes dynamically, an adaptive algorithm where a small computational kernel is employed. Compared to a synchronous FPGA and an asynchronous FPGA without power gating, the power consumption is reduced respectively by 38% and 15% at 85 C. Index TermsAsynchronous architecture, asynchronous eld-programmable gate array (FPGA), level-encoded dual-rail (LEDR) encoding, recongurable VLSI, self-timed architecture.

I. INTRODUCTION

IELD-PROGRAMMABLE gate arrays (FPGAs) are widely used to implement special-purpose processors. FPGAs are cost-effective for small-lot production because functions and interconnections of logic resources can be directly programmed by end users. Despite their design cost advantage, FPGAs impose large dynamic and standby power consumption overheads compared to custom silicon alternatives [1]. These overheads increase packaging costs and limit integrations of FPGAs into portable devices. In FPGAs, the clock network occupies a large proportion of the dynamic power because it has signicantly more registers than custom VLSIs. The most well-known technique to reduce the clock network power is clock gating. In FPGAs, the customized clock network can be implemented using the programmable interconnects. However, the worst case of clock

Manuscript received October 05, 2009; revised January 23, 2010 and March 17, 2010; accepted April 24, 2010. Date of publication June 10, 2010; date of current version July 27, 2011. This work was supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with STARC, Fujitsu Limited, Matsushita Electric Industrial Company Limited, NEC Electronics Corporation, Renesas Technology Corporation, Toshiba Corporation, Cadence Design Systems Inc., Synopsys Inc., and Mentor Graphics, Inc. The authors are with Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 980-8579, Japan (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identier 10.1109/TVLSI.2010.2050500

skew cant be estimated since FPGA vendors do not guarantee the worst case of the minimum delay of components. As a result, it is impossible to guarantee that no hold-time violations occur [2]. The Xilinx ISE reference manual [3] recommends the use of clock gating without the customized global clock. In FPGAs, to implement clock gating, circulation is employed. The idea of circulation is to retain the contents of the ip-op in the sleep state [4]. Circulation can reduce the dynamic power consumption of registers and the gates in the fan-out of the registers. However, the dynamic power consumption of the clock network cannot be reduced. As the transistor feature sizes and threshold voltages decrease, the standby power due to leakage current becomes comparable to dynamic power. Especially, in FPGAs, the standby power is a serious problem because it has an enormously large number of transistors to achieve its programmability. Current high-end FPGAs consume over 1 W of the standby power, while low-cost FPGAs consume up to hundreds of milliwatts [5][7]. Power gating has emerged as the most effective design technique to achieve low standby power [8]. Power gating techniques are based on selectively setting the functional units into a low leakage mode when they are inactive. The power consumption of power gating circuitry is consumed by the sleep controller, the sleep signal distribution network, and the sleep transistors. The fundamental challenge for any power gating technique is to ensure that the saved standby power outweighs the power overhead of the power gating. Power gating techniques are classied into two types: coarse-grain power gating and ne-grain power gating. In coarse-grain power gating, a large number of lookup tables (LUTs) share a single sleep controller so the area and power overheads of the sleep controller are relatively small. However, if any LUT within a coarse-grain power-gated domain is active, none of the LUTs which share the same sleep transistor can be set to the sleep mode. FPGAs with coarse-grain power gating also causes a large dynamic power and area overhead in the sleep signal distribution network since it is distributed to many LUTs through programmable interconnection resources. On the other hand, in ne-grain power gating, each LUT has its own sleep transistor and related sleep controller, so when any LUTs are inactive, they can be set to the sleep mode immediately. This results in much lower standby power compared to coarse-grain power gating. Especially, for FPGAs, no programmable interconnection resources for distributing the sleep signal is required. In ne-grain power gating, each LUT has its sleep controller, the number of the sleep controllers is much larger than that of coarse-grain power gating. In synchronous architectures, the sleep controller consists of some memory bits

1063-8210/$26.00 2010 IEEE

ISHIHARA et al.: LOW-POWER FPGA BASED ON AUTONOMOUS FINE-GRAIN POWER GATING

1395

to store the sleep time and a sequencer for control. Note that the sleep controller is always running. This results in large area and dynamic power overheads. Due to these overheads, ne-grain power gating is commonly assumed to be less efcient than coarse-grain power gating, although it has the potential to cut most of the standby power [5], [9]. In spite of the importance of efcient sleep controllers, most studies on power gating focused on power-gated circuits or power-gated domain partitioning, but little work is carried out for sleep controllers. In this paper, we present a low power FPGA. To reduce dynamic power consumption, we introduce an level-encoded dual-rail (LEDR)-based architecture, which achieves the lowest dynamic power consumption among all dual-rail asynchronous architectures we have considered. To reduce the standby power, a LUT-level power gating technique called autonomous ne-grain power gating is proposed. Our asynchronous architecture detects the activity of a power-gated domain, and uses this activity to determine when to shut down and wake up the power-gated domain. The activity of a power-gated domain can be easily detected by comparing the phases of the input data with that of the output data. In this technique, since the activity of each LUT can be detected easily, the area and the power overheads of the sleep controller are small. Using the ASPLA 90nm CMOS process, we have fabricated a test chip which is used for our experimental evaluation. In the test chip, the granularity size of the power-gated domain is as ne as a single two-input and one-output LUT. Thanks to the small power overhead of the sleep controller, the energy penalty for entering the sleep-state from the active-state and exiting the active-state from the sleep-state is small. As a result, leakage energy savings compensates for the energy penalty even if the sleep time is as short as tens of nanoseconds. In other words, it is worth entering the sleep state even if the sleep time is as short as tens of nanoseconds. We use an image processing application called template matching for evaluation. Since the proposed FPGA is suitable for processing where the workload changes dynamically, an adaptive algorithm where a small computational kernel is employed. Compared to the synchronous FPGA and the asynchronous FPGA without power gating, the power consumption of the proposed FPGA is reduced respectively by 38% and 15% at 85 C. This paper is the long version of the conference papers [10] and [11] with detailed implementation and new evaluations. II. RELATED WORK A. Asynchronous FPGAs In asynchronous FPGAs, the clock and clock distribution network create several difcult challenges, namely dynamic power consumption and clock skew. References [12] and [13] proposed asynchronous FPGAs basedonbundled-dataencoding,themostcommonasynchronous encoding. In this encoding, delay elements are used for the control path. The worst-case minimum delay of delay elements must be larger than the worst-case maximum delay of the data-path. Thus, the use of delay elements limits the throughput. Especially, for FPGAs, since the data path is programmable, complex programmable delay elements are required. References

Fig. 1. Example of four-phase dual-rail encoding.

TABLE I CODE TABLE OF LEDR ENCODING

Fig. 2. Example of LEDR encoding.

[14] and [15] proposed asynchronous FPGAs based on dual-rail encoding which requires no delay insertion. They use four-phase dual-rail encoding because of relatively small hardware cost. However, as shown in Fig. 1, in four-phase dual-rail encoding, a spacer must be inserted between two consecutive valid data values. This results in low throughput and high dynamic power consumption because of the large number of signal transitions. References [16] and [17] proposed asynchronous FPGAs based on LEDR encoding. LEDR is one of several two-phase dual-rail encodings [18]. In LEDR encoding, no spacer is required. Table I shows the code table of LEDR encoding. In LEDR encoding, each data value has two types of code words with different phases. Fig. 2 shows the example where data values 0, 0, and 1 are transferred. The main feature is that the sender sends data values alternately in phase 0 and phase 1. Because no spacer is required, the number of signal transitions is half of four-phase dual-rail encoding. As a result, the throughput is high and the power consumption is small. Based on this observation, in the proposed FPGA, LEDR encoding is employed for implementing the asynchronous architecture to reduce the dynamic power. B. Wave-Pipelining for Bit-Serial FPGAs In FPGAs, area for routing is dominant. To reduce the routing area without performance degradation, wave-pipelining is combined with bit-serial architecture. To reduce the area in FPGAs, reducing the complexity of the interconnect using bit-serial architecture is efcient. However, bit-serial data transmission decreases the throughput. To achieve a both a simple interconnect and high throughput, bit-serial wave-pipelining FPGA architectures have been proposed [19][21]. The idea of wavepipelining is to allow the circuit to process new data set before the previous data set reached the registers, such that few circuit sit idle. References [19][21] use bit-serial data transfers to reduce the complexity of the interconnect in conjunction with wave-pipelining to increase throughput. Instead the

1396

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

minimum data pulse width that can be sustained on the wavepipelined interconnects, which is smaller than the interconnect latency, determines the maximum interconnect throughput. As a result, there can be a signicant enhancement in the interconnect throughput through the simultaneous presence of multiple bits on the interconnect. Although wave-pipelining is used to achieve small area without performance degradation, it cannot reduce standby power. C. Sleep Signal Generation Techniques for Power Gating Sleep signal generation techniques for power gating are roughly classied into two categories: software-based ones [22], [23] and hardware-based ones [24], [25]. Software-based techniques are based on ofine analysis of application code to identify periods of inactivity. Hardware and software approaches for software-based techniques are proposed in [22] and [23]. Although the software-based technique can dynamically track the utilization of the function units and in turn assist the task of sleep signal generation, it suffers from large power and delay overheads [25], [26]. These overheads make the software-based technique not suitable for ne-grain power gating. Hardware-based techniques for relative ne-grain power gating have recently appeared [24], [25]. They were originally proposed for microprocessors. In the techniques proposed in [24] and [25], the activities of the power-gated domain are extracted from the instruction ow. A power-gated domain is shut down after it stays idle for a given threshold. Reference [24] proposed a static sleep signal generator (SSSG) technique where the threshold time is predened. Reference [25] proposed a dynamic sleep signal generator (DSSG) technique where the threshold time dynamically changes according to the requirements of the running application. The DSSG technique enhances the accuracy of the prediction of the standby period length by adding history-like information to the threshold time decision making process. Since the SSSG and the DSSG techniques both use an instruction level analysis of the activity of the power-gated domains, they are applicable only to block-level power gating [25]. This paper proposes a ne granularity sleep signal generation technique, where the power-gated domain is as ne as a single two-input and one-output LUT. Table II summarizes the features of the previous work and this work. In terms of the activity detection, only this work directly detects the activity of a power-gated domain by exploiting asynchronous architectural features. In the asynchronous architecture, the activity of the power-gated domain is easily detected by comparing the phases of the input data with the output data. Therefore, LUTlevel power gating is possible and the area and power overheads of activity detection is small. In the conventional coarse-grain approaches, the power-gated domain is controlled every clock cycle. In the proposed ne-grain approach, each logic block has its own sleep controller. The use of asynchronous architecture allows the small area and the small delay overheads. As a result, each logic block can be turned OFF after operation completion with a small delay comparable to that of a few small

TABLE II COMPARISON BETWEEN THIS WORK AND PREVIOUS WORK

gates. In terms of the threshold time decision, the threshold time in this work is predened like SSSG since the hardware of DSSG requires large area and dynamic power overheads. In the DSSG, state machines for each power-gated domain and a global counter are required for the threshold time decision. The global counter always runs and its output is distributed to each power-gated domain through programmable resources. As a result, the dynamic power overhead consumed by the global counter and the distributed network of the output is large. Moreover, dynamic power is consumed even in the sleep state. Due to these area and dynamic power overheads, the DSSG technique is not suitable for LUT-level power gating. Therefore, to achieve small overheads of the sleep controller for LUT-level power gating, the threshold time of the proposed technique is predened. Because of LUT-level activity detection and small overheads, the proposed autonomous ne-grain power gating is suitable for FPGAs. III. ARCHITECTURE A. Overview Fig. 3 shows the overall architecture of the proposed FPGA which has a mesh-connected cellular array based on a bit-serial architecture. Each logic block (LB) has a sleep controller which controls the sleep transistor of the LUT. As the asynchronous protocol, we employ LEDR encoding which is suitable for FPGAs [16], [17]. The proposed architecture requires four wires: two for a data, one for acknowledge (ACK), and one for wake-up. The wake-up signal is used to wake up the next LB in advance. Since the next LB has already been woken up before the data arrives, there is no penalty of the wake-up time. Fig. 4 shows the structure of the cell. The switch block consists of pass-switch blocks. In a switch block, a wire-set consists of four wires: two for data (V and R), one for the acknowledge

ISHIHARA et al.: LOW-POWER FPGA BASED ON AUTONOMOUS FINE-GRAIN POWER GATING

1397

Fig. 5. Direct allocation: (a) CDFG; (b) data-path; (c) mapping result.

Fig. 3. Overall architecture.

Fig. 6. Activity detection using the asynchronous architecture. Fig. 4. Structure of a cell.

signal and one for the wake-up signal. A pass-switch block consists of four pass switches and a single memory bit. The four pass switches are used for the four wires of the wire-set, respectively. The pass switches are controlled by the same memory bit. Since a LUT is small, the cost of switch blocks is a primary concern in mapping. A mapping technique called direct allocation of a Control/Data Flow Graph (CDFG) is efcient for reducing the complexity of the interconnection network of the resulting mapping [27][29]. As shown in Fig. 5(a), in the direct allocation, a behavior is given by a CDFG that represents data dependencies between operations. Each node of the CDFG represents an operation, and each edge of the CDFG represents a data dependency between the operations. As shown in Fig. 5(b), to execute the behavior represented by the CDFG, its operations are mapped onto the logic blocks. Each node of the CDFG is directly mapped to a logic block. Therefore, as shown in Fig. 5(c), a logic block executes only one operation and the connection between logic blocks is xed. In the direct allocation, the input of a logic block is directly connected to the output of another logic block. Therefore, the complexity of interconnection networks between logic blocks is reduced. B. Fundamental Principle of Autonomous Fine-Grain Power Gating In an asynchronous architecture, it is easily detected whether a LUT is used or not. Fig. 6 demonstrates the principle of the activity detection using the asynchronous architecture. To explain easily, each LB is assumed to operate a one-input and one-output

function. As the initial state (t0), the phases of input and the output data of the LB are phase 0. If the new data arrives at the LB (t1), the phase of the input data changes to 1, and then the operation starts. When the operation is complete (t2), the phase of the output data changes to 1 as the same of the phase of the input data. After that, if the new data arrives at the LB (t3), the phase of the input data changes to 0, and then the operation starts. When the operation is complete (t4), the phase of the output data changes to 0 as the same of the phase of the input data. In summary, when a new data arrives at the LB, the phase of the input data is different from that of the output data. When the operation is complete, the phase of the input data is the same as that of the output data. Base on this consideration, the activity of the LB is detected just by comparing the phases of the input data and the output data. The activity information can be exploited to power OFF unused LBs and to wake them up. Therefore, the proposed sleep controller just extracts and compares the phases of input and output data. As a result, the area and power overheads of the proposed sleep controller are much smaller than that in synchronous architecture. Fig. 7 shows the simplest implementation of the autonomous ne-grain power gating. In this scheme, the sleep controller consists of XOR gates and a comparator. The XOR gates are used to extract the phases of the input and the output data. The comparator is used to detect whether the phases of the input and output are the same or not. If the LB is busy, the phases of the input and the output are different. Then the output of the comparator is 1. If the LB is idle, the phases of the input and the output are the same. Then the output of the comparator is 0. The output of the comparator which represents the activity of

1398

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

Fig. 7. Simplest implementation and control strategy of autonomous power gating: (a) circuit; (b) control strategy.

Fig. 10. Example of the proposed power gating method.

Fig. 8. Problems of the simplest implementation of autonomous power gating.

Fig. 9. Control strategy of the proposed power gating method.

the LB is directly used as the control signal of the sleep transistor. In this implementation, the LB has two states: sleep and active. If the new input data arrives at the LB, the LB turns to the active state, and the sleep transistor turns ON to execute the operation. If the operation is complete, the LB turns to the sleep state, and the sleep transistor turns OFF to reduce the leakage current. However, this scheme has two problems as shown in Fig. 8. The rst one is that the wake-up time affects the delay time since the sleep transistor of the LB turns ON after the input data arrives. The second one is that the switching power may become larger than the saved power. This is because the sleep transistor turns ON and OFF frequently when the input data comes frequently. To solve this problem, we propose an efcient control strategy of the autonomous ne-grain power gating. As shown in Fig. 9, the standby state is used to do the following: 1) wake up the LB before the data arrives; 2) power OFF the LB only when the data does not come for quite a while. The use of the standby state has two major advantages. First, the wake-up time can be hidden since the LB has already been woken up when the data arrivals. Second, the dynamic power can be saved since the number of the unnecessary switching of the sleep transistor is reduced.

Fig. 10 shows an example to explain the proposed power gating method using two LBs: LB1 and LB2 where LB1 is the previous LB of LB2. As shown in Fig. 10(a), LB1 and LB2 are respectively in the standby and sleep state as the initial state. To avoid the penalty of the wake-up time, LB1 which is the rst LB of the pipeline chain is not powered OFF. In other words, LB1 is in either the standby state or the active state. As shown in Fig. 10(b), when the new data arrives at the previous LB (LB1), a wake-up signal from LB1 is sent to LB2 to wake it up. Then, LB2 turns to the standby state. As shown in Fig. 10(c), when the data arrives at LB2, LB2 turns to the active state. In this state, the operation is executed immediately because the sleep transistor is woken up in the standby state. As shown in Fig. 10(d), LB2 turns to the standby state since the operation of LB2 is complete. As shown in Fig. 10(e), if no data arrives at LB2 during the threshold time, LB2 predicts that the data does not arrive for quite a while. Then, LB2 turns to the sleep state and is powered OFF. The threshold time is determined such that the LB is not powered OFF in a busy condition where data arrives frequently. The method to decide the threshold time is explained in [24]. The waveform of the sleep signal is shown in Fig. 11. The LB is woken up before the data arrival and powered OFF only while the LB is idle. C. Circuit Implementation Fig. 12 shows the block diagram of a LB. Each LB mainly consists of a LUT, an output register, a sleep controller, and a C-element. The LUT operates arbitrary two-input and oneoutput logic functions. The C-element is a state-holding element for handshake protocol [30]. The gray region is the sleep controller. The Wake-up signals from previous LBs are used to wake up the LB before the new input data arrives. The Data-arrive signal is used to wake up the next LB when the data arrives. The phase comparator is used to detect the data arrivals. Two latches retain the Wake-up signals from previous LBs until all the input data arrive at the LB.

ISHIHARA et al.: LOW-POWER FPGA BASED ON AUTONOMOUS FINE-GRAIN POWER GATING

1399

TABLE III TRUTH TABLE OF THE LATCH FOR THE Wake-Up SIGNAL

Fig. 11. Waveform of the sleep signal of proposed power gating method.

Fig. 14. Block diagram of the programmable delay.

TABLE IV RELATIONSHIP BETWEEN THE MEMORY CONFIGURATION AND THE THRESHOLD TIME OF THE PROGRAMMABLE DELAY

Fig. 12. Block diagram of an LB.

Fig. 13. Block diagram of a phase comparator.

The programmable delay delays the sleep signal by the predetermined threshold time in powering OFF the LB. There is no penalty of the wake-up time despite that the whole sleep controller is composed of small-size and high-threshold voltage transistors. This is because the LB gets ready to wake up when the data arrives at previous LBs. As a result, the area and power overheads are small. Fig. 13 shows the block diagram of a phase comparator for a two-input and one-output LB. The phase comparator is used to detect the data arrival. Phases of each data are extracted by XOR gates. If Phase-a and Phase-b are different from Phase-out, it means that all new data has arrived. In that case, the LB is active, and the output is 1. Otherwise, it means that some data has not yet arrived and that the LB cannot start the operation. In that case, the LB is inactive, and the output is 0. Table III shows the truth table of the latch for the Wake-up signal. If the Wake-up once goes to 1, the latch retain the signal until all data arrive at the LB. When all data arrive at the LB and

no data arrives at the previous LBs, the output of the latch is reset to 0. Fig. 14 shows the block diagram of the programmable delay. As described in the last paragraph of Section III-B, if no data arrives at the LB during the predetermined threshold time, the LB predicts that the data does not arrive for quite a while. Then, the LB turns to the sleep state and is powered OFF. The programmable delay is used to power OFF an LB after it stays idle for the predetermined threshold time. Therefore, the function of the programmable delay is to delay the sleep signal by the predetermined threshold time in powering OFF. Note that the programmable delay does not delay the sleep signal in powering ON. The programmable delay consists of a series of OR gates and several memory bits. The memory bits are used to program the delay time. Table IV shows the relationship between the memory conguration and the threshold time. Let us consider a case when power gating is used. In powering ON the LB, turns from 0 to 1. Since is used as an input of the last OR gate, of 1 makes the output of the last OR gate to 1 imturns from 1 to 0. mediately. In powering OFF the LB, propagates through the series of OR gates The value 0 of so that the sleep signal is delayed. The use of more OR gates and more memory bits make it possible to increase the number of choices of the delay time. In the proposed FPGA, LEDR encoding is used. For the LEDR-based FPGA, the major consideration is designing a compact LUT. Fig. 15 shows the conventional multiis illustrated, plexer-based LUT for the LEDR, where only . Based on two and another LUT is necessary to obtain and , is determined. If the 2-bit inputs combination of the inputs is invalid (i.e., if the combination of

1400

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

Fig. 17. Detailed structure and the behavior of the sub-module of the LUT for invalid inputs.

Fig. 15. Multiplexer-based LUT for LEDR encoding (only V

is illustrated).

Fig. 18. Behavior of the sub-module of the LUT for valid inputs.

Fig. 16. Block diagram of the proposed LUT.

the inputs have different phases), the previous output is kept by the feed back loop. To make a correct output for such invalid combination of inputs, the number of multiplexers becomes large. To solve this problem, the LUT based on a hybrid of decoders and multiplexers was proposed in [17]. The hybrid LUT is extended to power gating. Fig. 16 shows the block diagram of the proposed LUT, which consists of four sub-modules. Each sub-module consists of a decoder, a multiplexer and a memory bit. The decoders exclude invalid input patterns with different phases. Then, only valid data are fed to the multiplexer. As a result, the numbers of multiplexers are reduced, and the transistor count is reduced by 36% compared to the multiplexer type LUT. Fig. 17 shows the detailed structure of a sub-module. For simplicity of the gure, the circuitry of the power gating is omitted. If the combination of inputs is invalid (i.e., if the two inputs have the different phases), all pass-transistors turn OFF according to the output of the decoder; the outputs of the multiplexer are in a high-impedance condition; the previous outputs stored in latch are kept. On the other hand, if input patterns are valid (i.e., if the two inputs have the same phase), according to the output of the decoder, these two pass-transistors turn ON; the value of the memory bit is selected as outputs; the outputs are stored in the latches as shown in Fig. 18.

The major difference between the LUT in [17] and the proposed LUT is that the proposed LUT has a sleep transistor. To reduce the standby power of the latches, the buffer of the latch is composed of small transistors with high threshold voltages. In designing the LUT, the major consideration is to guarantee that there is no indenite value occurs due to the power gating. Such indenite values cause malfunctions in asynchronous architectures since the data and control signals are integrated. In order to prevent indenite values from occurring, the sleep signal is input to the decoders. In the sleep mode, all pass-transistors turn OFF according to the outputs of the decoders; the outputs of multiplexers are in the high impedance condition; the latches keep the previous operation result. IV. EVALUATION A. Evaluation of a Single Cell The proposed asynchronous FPGA is fabricated using the ASPLA 90-nm CMOS process. Fig. 19 and Table V show the microphotograph and the features of the proposed FPGA, respectively. The chip includes 200 cells on 0.46 mm 0.63 mm area, where a cell consists of an LB and a switch block. The sleep controller occupies 11% of the cell area. Correct operation of the proposed FPGA on the test chip is conrmed. The proposed FPGA is compared with the conventional LEDR-based FPGA without power gating [17] and the synchronous FPGA. The synchronous FPGA basically has the same architecture as the proposed asynchronous FPGA. It has a mesh-connected cellular array. Each cell consists of a two-input LUT, an output register and a switch block. The difference between the proposed FPGA and the synchronous FPGA is

ISHIHARA et al.: LOW-POWER FPGA BASED ON AUTONOMOUS FINE-GRAIN POWER GATING

1401

TABLE VI ADVANTAGES AND OVERHEADS OF THE PROPOSED POWER GATING METHOD

Fig. 19. Chip microphotograph of the proposed FPGA.

TABLE V FEATURES OF THE PROPOSED FPGA

that the synchronous FPGA has a clock tree and does not have power gating circuitry. All the evaluation results come from HSPICE simulation. Table VI shows the advantages and overheads of the proposed FPGA running at 85 C. The evaluation circuit is a cell which consists of a logic block and a switch block as shown in Fig. 4. The number of power transistor switchings means the number of transitions from the active state to the sleep state. In this paragraph, the proposed FPGA is compared to the conventional LEDR-based FPGA [17]. The standby power of a cell is reduced by 69% in the sleep state, and the area and the dynamic power of a cell are increased by 13% and 8%, respectively. In the proposed FPGA, because the LB is powered ON before the input data arrives, there is no delay overhead. In terms of area of a cell, since the area is increased by 13%, the available logic is reduced by 12% under the same area constraint. The area and dynamic power of the logic block are increased by 14% and 6%, respectively. These overheads are mainly caused by the sleep controller, including its phase comparator and programmable delay. The area and the dynamic power of a switch block are increased by 13% and 22%, respectively. These overheads are caused by routing the wake-up signal. As shown in Fig. 4, the wake-up signal is propagated together with the data and the acknowledge signal. No extra conguration memory bits are required to route the wake-up signals; the extra routing resources, therefore, are only pass-switches and wires. In this paragraph, the proposed FPGA is compared to the synchronous FPGA. The standby power of a cell is reduced by 38% in the sleep state, and the delay of a cell is increased by 34%. In terms of area of a cell, since the area is increased by 170%, the available logic is reduced by 63% under the same area constraint. The total energy consumption per data set of the synchronous FPGA includes the power consumption by clock distribution. The clock distribution network of synchronous FPGAs consume power, even when the FPGA itself is not used; clock network customization to address this issue appears to be impractical [2], [3].

Based on this consideration, we evaluate the relationship between the power consumption and workload for the synchronous FPGA, the conventional LEDR-based FPGA and the proposed FPGA. The workload refers to the rate of the number of active-state cycles to the total number of cycles. Fig. 20 explains the workload. The vertical axis denotes the state of a circuit (active or inactive), and the horizontal axis denotes cycles. The gray cycles indicate that the circuit is active. The number of active-state cycles of Fig. 20 is given by cycles Then, the workload is given by (2) In our evaluation model, we specify three parameters: the number of cycles per second (Ncs), a workload and the number of power transistor switchings (Nps) instead of the workload distribution. Note that Ncs corresponds to the data rate in asynchronous architectures or the clock frequency in synchronous architectures. Therefore, no strong assumption for workload distribution is required. For example, we do not assume that the workload is distributed evenly over time. Fig. 21 shows the evaluation results of a single cell at 85 C. Ncs is assumed to be 200 M/s (corresponding to 200 MHz); (1)

1402

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

Fig. 20. Workload.

Fig. 22. Relationship between the power consumptions and Ncs.

Fig. 23. Relationship between power reductions and Ncs.

Fig. 21. Evaluation in terms of power consumption at 85 C.

Nps of the proposed FPGA is assumed to be 100, 10 000, and 1 000 000 times/sec. The proposed FPGA provides lower power consumption than the conventional LEDR-based FPGA and the synchronous FPGA if the workload of the implemented application is less than 44%45% and 55%, respectively. From this gure, the number of power switchings has a small impact on the total power consumption since the power overhead of the proposed sleep controller is small. Fig. 21 also indicates that the proposed FPGA is suitable for low-workload applications. Fig. 22 shows the relationship between the power consumptions and Ncs. From Fig. 22, we obtain the relationship between the power reductions and Ncs as shown in Fig. 23. The workload and the temperature are assumed to be 20% and 85 C, respec, and be the power consumptions of the tively. Let conventional LEDR-based FPGA, the synchronous FPGA and the proposed FPGA, respectively. The power reduction of the conventional LEDR-based FPGA is given by (3) and the power reduction of the synchronous FPGA is given by (4) In the proposed FPGA, if Ncs is larger than 40 M/s, Nps is assumed to be 10 000. Otherwise, the cell turns to the sleep state immediately after each operation completion to save the standby power considering the energy breakeven time described later.

When Ncs is small, the interval between each data is long even in the active-state cycles, and it is worth switching the power switch to save the standby power after each operation completion. Compared to the conventional LEDR-based FPGA, as Ncs decreases, the power reduction becomes larger. This is because the ratio of the standby power to the total power becomes larger and the power gating becomes more efcient as Ncs decreases. Compared to the synchronous FPGA, as Ncs decreases to 40 M/s, the power reduction decreases. This is because reducing the dynamic power due to the clock tree is more effective than reducing the standby power of the proposed FPGA. As Ncs decreases from 40 M/s, the power reduction becomes larger. This result shows that the power gating within a cycle is effective when the interval between each data is long. Standby power due to leakage current increases exponentially with temperature [31]. Temperature dependence of the subthreshold leakage current is important, since digital VLSI circuits usually operate at elevated temperatures due to the power dissipation (heat generation) of the circuit [32]. Therefore, we evaluated our circuits at three different temperatures. Our results indicate that augmenting the proposed FPGA has the greatest impact at higher temperatures. Based on this consideration, the comparison in terms of power consumption is also evaluated at 105 C and 125 C. Figs. 24 and 25 show the evaluation results of a single cell at 105 C and 125 C, respectively. Ncs is assumed to be 200 M/s; Nps of the proposed FPGA is assumed to be 100, 10 000, and 1 000 000 times/s. As mentioned above, the number of power switchings has a small impact on the total power consumption in the proposed architecture. Therefore, we omit the enlarged

ISHIHARA et al.: LOW-POWER FPGA BASED ON AUTONOMOUS FINE-GRAIN POWER GATING

1403

TABLE VII ENERGY BREAKEVEN TIME OF THE AUTONOMOUS FINE-GRAIN POWER GAITING

Fig. 24. Evaluation in terms of power consumption at 105 C.

Fig. 26. Workloads of four cells.

TABLE VIII POWER CONSUMPTION MEASUREMENTS FOR FIG. 26(a)

Fig. 25. Evaluation in terms of power consumption at 125 C.

graphs around the crosspoints. Although the power consumption of the clock network can be eliminated by the asynchronous architecture, the standby power of the asynchronous architecture is larger than that of the synchronous architecture due to its hardware complexity. The ratio of the standby power to the total power becomes higher as the temperature increases. As a result, the conventional LEDR-based architecture in high leakage environment such as in high temperature is less efcient than that in low temperature. At 125 C, the conventional LEDR-based FPGA is more efcient than the synchronous FPGA only if the workload is lower than 18%. Accordingly, for the asynchronous architecture, an efcient power gating technique is necessary to achieve low power. Compared with the conventional LEDR-based FPGA, the proposed FPGA achieves much more power efciency as the temperature increases. The proposed FPGA provides lower power consumption than conventional FPGA if the workload of the implemented application is less than 59% and 73% at 105 C and 125 C respectively. Compared with the synchronous FPGA, the proposed FPGA provides lower power consumption if the workload of the implemented application is less than 46% and 42% at 105 C and 125 C, respectively. When power gating is used, the sleep controller and power switch consume energy when transitioning between the sleep and active states. Energy breakeven time is dened to be the point at which the leakage energy savings becomes equal to the energy penalty incurred to entering and exiting the sleep state. If a circuit sleeps for more than the energy breakeven time, it

is worth switching the power switch [33]. Table VII shows the energy breakeven time of the proposed power gaiting method. Due to the small power overhead of generating the sleep signal, the energy breakeven times at 85 C, 105 C, and 125 C are as short as 25, 16, and 10 ns, respectively. In a real application, many cells are used. In this case, the power consumption is estimated by using the measurements of power of a single cell shown in Figs. 21, 24, and 25 as follows. Let us consider the case where four cells are used as shown in Fig. 26(a). The workloads of cell0, cell1, cell2 and cell3 are 30%, 15%, 25%, and 10%, respectively. The temperature, Ncs and Nps of each cell are assumed to be 85 C, 200 M/s, and 10 000, respectively. The power consumption of each cell is estimated by using the workload and the measurements of power of a single cell shown in Fig. 21. Table VIII shows the power consumptions of the cells and the total power consumptions estimated by adding up the power consumptions of all cells. As shown in Table IX, the total power consumption can be also estimated by the average workload per cell. From the workloads of the cells, the average workload per cell is (5) From the average workload, the average power consumption per cell is estimated from Fig. 21. The total power consumption per cell is estimated by multiplying the average power consumption by the number of cells. From Tables VIII and IX, the total power consumption of the case of Fig. 26(b) is same as that of Fig. 26(a). The most accurate way to evaluate the power consumption is to use circuit simulators such as HSPICE throughout the evaluation at a transistor level. However, FPGAs are too large to evaluate using only circuit simulators. Therefore, our evaluation is

1404

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

TABLE IX POWER CONSUMPTION MEASUREMENTS FOR FIG. 26(b)

TABLE X POWER REDUCTION COMPARED TO THE CONVENTIONAL LEDR-BASED FPGA

TABLE XI POWER REDUCTION COMPARED TO THE SYNCHRONOUS FPGA

based on the workload of cells as follows. The relationship between the workload of a cell and the power consumption of the cell is measured using HSPICE simulation in advance as shown in Figs. 21, 24, and 25. The whole circuit for a target application is divided into blocks such as multipliers, and the workloads of the blocks are estimated at a cycle level. Each block is mapped onto cells of an FPGA, and the workload of cells in the active state are estimated. Using the workloads of the blocks and the workloads of cells in the active state, we obtain the average workloads of cells throughput the target application. Using the resulting average workload of cells and the premeasured power consumption of a cell, we estimate the total power consumption for the target application by adding up the power consumptions of all the cells as mentioned above. B. Evaluation Using Benchmarks In the following, the FPGAs are evaluated using four benchmarks: template matching, median lter, elliptic lter [34], and 64-point FFT. Tables X and XI summarize the results. Template matching is used in a variety of applications, such as image processing, MPEG encoding, and object recognition, among others. A template is a square sub-image. The objective

of template matching is to nd a window that is most similar to the template within an image. The similarity measure is used to estimate how much similar the template and the candidate window are. Given a template, a similarity measure is computed for all the possible candidate windows. Then, the most similar candidate window is obtained. A sum of absolute difference (SAD) is one commonly-used similarity measure. The smaller SAD means that the template is more similar to the candidate window. Template matching is computation-intensive processing since SADs are computed for many candidate windows and even a single SAD includes many operations. To reduce the computational amount, we use the adaptive algorithm [35]. Given a template, the similarity measure is calculated for many candidate windows. In the adaptive algorithm, an SAD is calculated in a pixel serial manner. If the intermediate value of the SAD exceeds the current minimum SAD for previous candidate windows, the SAD calculation is stopped since the window is found not to be the most similar windows to the template. When this adaptive algorithm is executed by processing elements in parallel for windows, the workloads of processing elements varies depending on the windows. Such a variance of workloads is suitable for the proposed FPGA since the processing elements can turns off automatically depending on their workloads. Moreover, the adaptive algorithm is suitable for the proposed FPGA, since the average workload per cell is low as 18.5%. We consider images having dimension 512 512 pixels 16. In order to achieve high and templates having size 16 throughput and area efciency, bit-serial pipeline architecture is fully employed. Basically, a processing element for SAD computation is designed based on the mapping [27] except that a full adder is constructed by some cells in the mapping of the proposed FPGA. This is because the cell of the proposed FPGA does not have a carry logic unlike [27]. In the evaluation, the number of cycles per second is assumed to be 200 M/s. The maximum throughput is determined by the critical path delay of a full adder, which is estimated to be 2 ns. Hence, the maximum number of cycles per second is estimated to be 500 M/s, which is enough high for image processing at the video rate (30 frames/s). Compared to the LEDR-based FPGA, the power consumption is reduced by 15%, 22% and 30% at 85 C, 105 C, and 125 C, respectively. Compared to the synchronous FPGA, the power consumption is reduced by 38%, 33%, and 30% at 85 C, 105 C, and 125 C, respectively. For median lter, the bit-serial pipeline architecture is fully employed like the template matching example. The quick sort is employed to reduce the computational amount, and the average workload per cell is low as 16.4%. Compared to the conventional LEDR-based FPGA, the power consumption of the proposed FPGA is reduced by 17%, 24%, and 33% at 85 C, 105 C, and 125 C, respectively. Compared to the synchronous FPGA, the power consumption of the proposed FPGA is reduced by 41%, 36%, and 33% at 85 C, 105 C, and 125 C, respectively. For elliptic lter, the bit-serial pipeline architecture is fully employed, too. Pipelined parallel-serial multipliers [36], [37] are used to enhance the throughput. Although multipliers occupy a larger area than adders, their workloads are low because of data dependency. As a result, the average workload per cell

ISHIHARA et al.: LOW-POWER FPGA BASED ON AUTONOMOUS FINE-GRAIN POWER GATING

1405

is low as 18.4%. Compared to the conventional LEDR-based FPGA, the power consumption of the proposed FPGA is reduced by 15%, 22%, and 30% at 85 C, 105 C, and 125 C, respectively. Compared to the synchronous FPGA, the power consumption of the proposed FPGA is reduced by 38%, 32%, and 30% at 85 C, 105 C, and 125 C, respectively. For 64-point FFT, the bit-serial pipeline architecture is fully employed, too. In order to enhance the total throughput, we use as many adders as ve and one multiplier. As a result, the average workload per cell is relatively high as 26.7%. Compared to the conventional LEDR-based FPGA, the power consumption of the proposed FPGA is reduced by 8%, 14%, and 21% at 85 C, 105 C, and 125 C, respectively. Compared to the synchronous FPGA, the power consumption of the proposed FPGA is reduced by 25%, 20%, and 17% at 85 C, 105 C, and 125 C, respectively. V. DISCUSSION The main topic of this paper is to demonstrate the small overheads of the proposed ne-grain power gating method in FPGAs. Therefore, in this prototype chip, the simplest datapath architecture with two-input LUTs and neighbor-switch-block connections is used in the proposed FPGA. In the following, we discuss the extension to more complex datapath architectures of modern FPGAs such as FPGAs from Xilinx and Altera. Such a modern FPGA has a complex datapath architecture as follows: a LUT has four to six inputs to implement complex functions efciently; carry logic and fast carry chain to implement arithmetic functions efciently; diamond switches to achieve high connectivity; long-length wires that span more than one logic blocks to allow signals to travel greater distances more efciently. These modern FPGAs are much efcient for arithmetic functions compared to the simple datapath architectures. The proposed ne-grain power gating method can be applied to the complex datapath architectures of the modern FPGAs. Even in the simple datapath architecture of the proposed FPGA, the power and area of the sleep controller are much smaller than those of the datapath components such as a logic block and a switch block. In a more complex datapath architecture, the overheads of the sleep controller are smaller. When applying the proposed ne-grain power gating method to such complex datapath architectures, the major problem is that the area of an LEDR-based LUT increases rapidly as the number of inputs increases although the LEDR encoding is useful to enhance the throughput and reduce the power consumption of switch blocks. To solve this problem, four-phase dual-rail encoding can be efciently combined with the LEDR encoding. Four-phase dual-rail encoding is suitable for a LUT because of its small area, while LEDR encoding is suitable for switch block because of its high throughput and low power [38]. VI. CONCLUSION This paper proposed an asynchronous FPGA based on autonomous ne-grain power gating with small overheads. In asynchronous architecture, the activity of an LB is easily

detected only by comparing the phases of the input and the output data. To implement the autonomous ne-grain power gating efciently, the standby state is used to wake up the LB before the data arrives and power OFF the LB only when the data does not come for quite a while. As a result, the wake-up time can be hidden and the dynamic power of unnecessary switching of the sleep transistor can be saved. By fully exploiting the adaptivity of asynchronous architecture, control parameters such as supply voltages, threshold voltages and a degree of parallelism will be self-adaptive to the workload, data path and temperature. Hence, the autonomous technique is also suitable for dynamically recongurable processors which work toward self-adaptation for ambient intelligence. Since their data paths change dynamically and frequently, it is more difcult than FPGAs to determine the control parameters for each LB using ofine analysis. REFERENCES
[1] H. Z. V. George and J. Rabaey, The design of a low energy FPGA, in Proc. Int. Symp. Low Power Electron. Des., CA, Aug. 1999, pp. 188193. [2] Synplicity Inc., Sunnyvale, CA, Gated clock conversion with Synplicitys synthesis products, Jul. 2003. [3] Xilinx Inc., San Jose, CA, Synthesis and simulation design guide, 2008. [Online]. Available: http://www.xilinx.com/itp/xilinx10/books/ docs/sim/sim.pdf [4] Y. Zhang, J. Roivainen, and A. Mammela, Clock-gating in FPGAs: A novel and comparative evaluation, in Proc. EUROMICRO Conf. Digit. Syst. Des., 2006, pp. 584590. [5] T. Tuan, S. Kao, A. Rahman, S. Das, and S. Trimberger, A 90 nm low-power FPGA for battery-powered applications, in Proc. FPGA, Feb. 2006, pp. 2224. [6] Xilinx Inc., San Jose, CA, Spartan-3 FPGA family datasheet, 2009. [Online]. Available: http://www.xilinx.com [7] Xilinx Inc., San Jose, CA, Virtex-4 FPGA family datasheet, 2009. [Online]. Available: http://www.xilinx.com [8] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low Power Methodology Manual: For System-on-Chip Design. New York: Springer, 2007. [9] A. Rahman, S. Das, T. Tuan, and S. Trimberger, Determination of power gating granularity for FPGA fabric, in Proc. IEEE Custom Intergr. Circuits Conf. (CICC), 2006, pp. 912. [10] M. Hariyama, S. Ishihara, and M. Kameyama, A low-power eld-programmable VLSI based on a ne-grained power-gating scheme, in Proc. IEEE Int. Midw. Symp. Circuits Syst. (MWSCAS), Knoxville, Aug. 2008, pp. 430433. [11] S. Ishihara, M. Hariyama, and M. Kameyama, A low-power FPGA based on autonomous ne-grain power-gating, in Proc. Asia South Pacic Des. Autom. Conf. (ASP-DAC), Yokohama, Japan, Jan. 2009, pp. 119120. [12] K. Maheswaran and V. Akella, PGA-STC: Programmable gate array for implementing self-timed circuits, Int. J. Electron., vol. 84, no. 3, pp. 255267, 1998. [13] R. Payne, Self-timed FPGA systems, in Proc. Int. Workshop Field Program. Logic Appl., 1995, pp. 2135. [14] J. Teifel and R. Manohar, An asynchronous dataow FPGA architecture, IEEE Trans. Computers, vol. 53, no. 11, pp. 13761392, Nov. 2004. [15] R. Manohar, Recongurable asynchronous logic, in Proc. IEEE Custom Integr. Circuits Conf., Sep. 2006, pp. 1320. [16] M. Hariyama, S. Ishihara, C. C. Wei, and M. Kameyama, A eldprogrammable VLSI based on an asynchronous bit-serial architecture, in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Jeju, Korea, Nov. 2007, pp. 380383. [17] M. Hariyama, S. Ishihara, and M. Kameyama, Evaluation of a eldprogrammable VLSI based on an asynchronous bit-serial architecture, IEICE Trans. Electron, vol. E91-C, no. 9, pp. 14191426, 2008. [18] M. E. Dean, T. E. Williams, and D. L. Dill, Efcient self-timing with level-encoded 2-phase dual-rail (LEDR), in Proc. Univ. California/ Santa Cruz Conf. Adv. Res. VLSI, 1991, pp. 5570.

1406

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

[19] P. Teehan, G. G. Lemieux, and M. R. Greenstreet, Towards reliable 5 Gbps wave-pipelined and 3 Gbps surng interconnect in 65 nm FPGAs, in Proc. Int. Symp. Field-Program. Gate Array (FPGA), 2009, pp. 4352. [20] T. Mak, C. DAlessandro, P. Sedcole, P. Y. Cheung, A. Yakovlev, and W. Luk, Implementation of wave-pipelined interconnects in FPGAs, in Proc. ACM/IEEE Int. Symp. Networks-on-Chip, 2008, pp. 213214. [21] T. Mak, P. Sedcole, P. Y. Cheung, and W. Luk, Wave-pipelined signaling for on-FPGA communication, in Proc. IEEE Int. Conf. FieldProgram. Technol. (FPT), 2008, pp. 916. [22] S. Narayanasamy, T. Sherwood, S. Sair, B. Calder, and G. Varghese, Catching accurate proles in hardware, in Proc. 9th Int. Symp. High Perform. Comput. Arch. (HPCA-9, 2003, pp. 269280. [23] J. Lau, S. Schoenmackers, and B. Calder, Transition phase classication and prediction, in Proc. Int. Symp. High-Perform. Comput. Arch., Washington, DC, 2005, pp. 278289. [24] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose, Microarchitectural techniques for power gating of execution units, in Proc. Int. Symp. Low Power Electron. Des., 2004, pp. 3237. [25] A. Youssef, M. Anis, and M. Elmasry, Dynamic standby prediction for leakage tolerant microprocessor functional units, in Proc. Ann. IEEE/ACM Int. Symp. Microarch., Washington, DC, 2006, pp. 371384. [26] Y. Ahmed, A. Mohab, and E. Mohamed, A comparative study between static and dynamic sleep signal generation techniques for leakage tolerant designs, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9, pp. 11141126, Sep. 2008. [27] M. Hariyama, W. Chong, and M. Kameyama, Field-programmable VLSI based on a bit-serial ne-grain architecture, IEICE Trans. Electron, vol. E87-C, no. 11, pp. 18971902, 2004. [28] N. Ohsawa, M. Hariyama, and M. Kameyama, High-performance eld programmable VLSI processor based on a direct allocation of a control/data ow graph, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2002, pp. 95100. [29] H. M. Waidyasooriya, W. Chong, M. Hariyama, and M. Kameyama, Multi-context FPGA using ne-grained interconnection blocks and its CAD environment, IEICE Trans. Electron., vol. E91-C, no. 4, pp. 517525, 2008. [30] J. Spars and S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective. Norwell, MA: Kluwer, 2001. [31] M. Mui, K. Banerjee, and A. Mehrotra, Supply and power optimization in leakage-dominant technologies, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, no. 9, pp. 13621371, Sep. 2005. [32] K. Roy, S. Mukhopadhay, and H. Mahmoodi-Meimand, Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits, Proc. IEEE, vol. 91, no. 2, pp. 305327, Feb. 2003. [33] H. Cheng and S. Goddard, Online energy-aware I/O device scheduling for hard real-time systems, in Proc. Conf. Des., Autom. Test Eur. (DATE), 2006, pp. 10551060. [34] E. D. Lagnese and D. E. Thomas, Architectural partitioning for system level design, in Proc. ACM/IEEE Des. Autom. Conf. (DAC), 1989, pp. 6267. [35] T. Enomoto, Y. Sasajima, A. Hirobe, and T. Ohsawa, Fast motion estimation algorithm and low power CMOS motion estimation array LSI for MPEG-2 encoding, in Proc. Int. Symp. Circuits Syst., 1999, vol. 4, pp. 203206.

[36] D. Gajski, A. Wu, N. Dutt, and S. Lin, Principles of CMOS VLSI Design: A Systems Perpective. Boston, MA: Addison Wesley, 1985. [37] S. G. Smith and P. B. Denyer, Serial-Data Computation. Norwell, MA: Kluwer, 1988. [38] S. Ishihara, Y. Komatsu, M. Hariyama, and M. Kameyama, An asynchronous eld-programmable VLSI using LEDR/4-phase-dual-rail protocol converters, in Proc. Int. Conf. Eng. Recong. Syst. Algorithms (ERSA), 2009, pp. 145150.

Shota Ishihara (S09) received the B.E. degree in information engineering and the M.S. degree in information sciences from Tohoku University, Sendai, Japan, in 2007 and 2009, respectively, where he is currently pursuing the Ph.D. degree in Graduate School of Information Sciences. His research interests include recongurable computing and asynchronous architecture.

Masanori Hariyama (M02) received the B.E. degree in electronic engineering, the M.S. degree in information sciences, and the Ph.D. degree in information sciences from Tohoku University, Sendai, Japan, in 1992, 1994, and 1997, respectively. He is currently an Associate Professor with the Graduate School of Information Sciences, Tohoku University. His research interests include VLSI computing for real-world application such as robots, high-level design methodology for VLSIs and recongurable computing.

Michitaka Kameyama (M79F97) received the B.E., M.E., and D.E. degrees in electronic engineering from Tohoku University, Sendai, Japan, in 1973, 1975, and 1978, respectively. He is currently Dean and Professor with the Graduate School of Information Sciences, Tohoku University. His general research interests include intelligent integrated systems for real-world applications and robotics, advanced VLSI architecture, and newconcept VLSI including multiple-valued VLSI computing. Dr. Kameyama was a recipient of the Outstanding Paper Awards at the 1984, 1985, 1987, and 1989 IEEE International Symposiums on Multiple-Valued Logic, the Technically Excellent Award from the Society of Instrument and Control Engineers of Japan in 1986, the Outstanding Transactions Paper Award from the IEICE in 1989, the Technically Excellent Award from the Robotics Society of Japan in 1990, and the Special Award at the 9th LSI Design of the Year in 2002.

You might also like