0% found this document useful (0 votes)
10 views

Imp - Powergating Fpga 2

Uploaded by

Shahzaib Ashher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Imp - Powergating Fpga 2

Uploaded by

Shahzaib Ashher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

An FPGA Architecture and CAD Flow Supporting


Dynamically Controlled Power Gating
Assem A. M. Bsoul, Student Member, IEEE, Steven J. E. Wilton, Senior Member, IEEE,
Kuen Hung Tsoi, and Wayne Luk, Fellow, IEEE

Abstract— Leakage power is an important component of


the total power consumption in field-programmable gate
arrays (FPGAs) built using 90-nm and smaller technology nodes.
Power gating was shown to be effective at reducing the leakage
power. Previous techniques focus on turning OFF unused FPGA
resources at configuration time; the benefit of this approach
depends on resource utilization. In this paper, we present an
FPGA architecture that enables dynamically controlled power
gating, in which FPGA resources can be selectively powered down
at run-time. This could lead to significant overall energy savings
for applications having modules with long idle times. We also
present a CAD flow that can be used to map applications to the
proposed architecture. We study the area and power tradeoffs by Fig. 1. Basic idea of power gating. (a) Basic power gating. (b) Configurable
varying the different FPGA architecture parameters and power power gating.
gating granularity. The proposed CAD flow is used to map a set
of benchmark circuits that have multiple power-gated modules
have originally been applied to ASICs, including guarded
to the proposed architecture. Power savings of up to 83% are
achievable for these circuits. Finally, we study a control system of evaluation, clock gating, power gating, dual supply voltages,
a robot that is used in endoscopy. Using the proposed architecture and power-aware CAD optimization [2]–[5]. Even after
combined with clock gating results in up to 19% energy savings applying all these techniques, the power consumption of
in this application. FPGAs remains prohibitive for some applications.
Index Terms— Computer aided design, field-programmable Previous techniques to reduce the power dissipation of
gate arrays, power gating, static/leakage power. FPGAs have focused on reducing both the dynamic and the
I. I NTRODUCTION static (leakage) power of these devices. Dynamic power is
dissipated due to charging and discharging of the circuit’s

F IELD-PROGRAMMABLE gate arrays (FPGAs) have


become ubiquitous in applications, such as telecommu-
nications, digital signal processing, and scientific computing.
capacitance, while leakage power is dissipated when the circuit
is idle. Static power dissipation is a major component of
the total power consumption in reconfigurable devices based
In the mobile devices market, however, FPGAs have had on the sub-90-nm CMOS technology nodes. Recent reports
limited penetration, partially due to their high power from FPGA vendors indicate that FPGAs built on a 28-nm
consumption. Compared with application-specific integrated technology have roughly equal amounts of dynamic and static
circuit (ASIC) implementations, FPGA implementations power [6], [7]. In handheld devices, it is conceivable that
consume 12× more power on average [1]. To bring recon- the leakage power will be even more significant, since these
figurable technology to these hand-held applications, new devices are often used in an always ON state, remaining
programmable devices that consume significantly less power idle except for short bursts of activity. Thus, low-leakage
are required. FPGAs are essential if they are to be used for these kinds
Many researchers have proposed techniques for reducing of applications.
the power dissipation of FPGAs based on the methods that An effective way to reduce leakage power is to employ
Manuscript received June 11, 2014; revised September 27, 2014; power gating [8]. As shown in Fig. 1(a), by connecting the
accepted December 28, 2014. The work of A. A. M. Bsoul supply voltage or the ground of a circuit component through
and S. J. E. Wilton was supported in part by Altera Toronto
Technology Center, Toronto, ON, Canada, and in part by the
a power gating transistor, also called a sleep transistor or a
Natural Science and Research Council of Canada. The work of power switch, the circuit component can be turned ON or OFF
K. H. Tsoi and W. Luk was supported in part by the U.K. Engineering by turning the corresponding power switch ON or OFF. When
and Physical Sciences Research Council and in part by the European Union
Seventh Framework Programme under Grant 257906, Grant 287804, and
the power switch is turned OFF, the leakage current is limited
Grant 318521. by that of the power switch. A performance loss may result
A. A. M. Bsoul and S. J. E. Wilton are with the Department of Electrical because of the extra resistance in the current path. By sizing
and Computer Engineering, University of British Columbia, Vancouver,
BC V6T 1Z4, Canada (e-mail: [email protected]; [email protected]).
the power switch appropriately, an acceptable tradeoff between
K. H. Tsoi and W. Luk are with the Department of Computing, Imperial the performance, power savings, and area can be found.
College London, London SW7 2AZ, U.K. Previous proposals for power gating in FPGAs
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
use configuration bits to control the power switches
Digital Object Identifier 10.1109/TVLSI.2015.2393914 [as in Fig. 1(b)] [4], [9]–[11]. We refer to them as statically
1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

controlled power gating, since once configured, the state of II. BACKGROUND
each part of the chip (ON or OFF) does not change. Statically A. Related Work
controlled power gating is effective for FPGAs, since if
Lin et al. [9] studied fine-grained power gating for FPGAs
the design does not fill an entire FPGA, the remainder of
to turn OFF unused resources at configuration time; their study
the FPGA can be safely turned OFF, saving leakage power.
showed that the area overhead could be >100%, which is
However, if only a small number of resources in an FPGA
undesirable because of the associated degradation in power
are not used, the savings from this technique may be
and timing, and the increase in cost.
limited.
Gayasen et al. [10] proposed coarse-grained power gating
In this paper, we propose dynamically controlled power
using a power switch for a region of logic blocks. The use of
gating (DCPG) in an FPGA. In our architecture, the power
dynamic reconfiguration was suggested to change the power
switches can be turned ON and OFF at run-time under the
state for the different regions in an FPGA based on their
control of other circuitry either running on the FPGA itself,
activity. However, this incurs power overhead and can only
or external to the FPGA. The signals to control the power
be applied at a very coarse granularity.
switches are connected to the general-purpose routing fabric
Tuan et al. [4] proposed power gating for an architecture
of the FPGA.
similar to the Xilinx Spartan-3. Their architecture supports
This paper is based on [12] and [13]. The work in [12]
sleep mode using a sleep signal from an off-chip controller that
focuses on power gating for logic resources in FPGAs, and
is connected to all power switches in the FPGA; this scheme
the work in [13] focuses on power gating for coarse-grained
allows creating one controllable power domain only.
routing resources. Our main additional contributions in this
Bharadwaj et al. [11] proposed synthesizing a power state
paper are as follows.
controller from the data flow graph of an application; this
1) We propose fine-grained power gating for routing controller could exploit the idleness periods of the application
resources. This allows powering down a larger number to reduce the dissipated leakage energy in an FPGA. They
of routing resources at configuration time, and enables used the same architecture in [10].
dynamic power state control for a larger number of Li et al. [14] proposed using a power control hard macro
routing resources at run-time. that is associated with each tile in an FPGA to control its
2) We present a CAD flow that can be used to map the power state (clock and power gating). They assume a power
application circuits that contain power-gated modules to gating architecture similar to that in [12].
the proposed architecture. In this flow, power control sig- Hoo et al. [15] proposed fine-grained power gating for
nals are connected to the different power-gated resources switch blocks (SBs) and a routing algorithm to optimize
to control their power state at run-time using the existing the power savings. The proposed architecture, however, only
general purpose routing fabric of an FPGA. supports powering down unused switches at configuration
3) We propose enhancements to an FPGA routing time.
algorithm that try to minimize the number of routing Dynamic partial reconfiguration is also reported to reduce
resources that cannot be powered down at run-time. the static power at run-time by enabling time sharing of FPGA
4) The presented CAD flow is used to evaluate the best resources [6]. However, swapping reconfigurable modules
granularity of routing resources power gating. happens at the scale of milliseconds, which may result high
5) We evaluate a robot control system used in medical power overhead. In contrast, the proposed architecture enables
applications using the power gating architecture changing the power state at the scale of nanoseconds.
proposed in this paper, and we study its power savings
for different operation activities. B. Architecture Framework
We evaluate the proposed architecture in terms of its area In this paper, we assume a tile-based FPGA architec-
overhead and the amount of leakage power reduction that it ture [16]. An FPGA is composed of an array of tiles; each
can achieve by varying the basic FPGA architecture parame- tile is composed of a logic cluster (LC) and the associated
ters, and by studying different architecture granularity levels. routing resources [two routing channels (RCs) and a SB],
We also use the proposed CAD flow to evaluate the poten- as shown in Fig. 2. An LC is composed of a number of
tial power savings in a set of synthetic benchmark circuits, basic logic elements (BLEs); each BLE is composed of a
in addition to the robot control system mentioned above. lookup table (LUT), a flip-flop (FF), and a multiplexer to
This paper is organized as follows. Section II provides select between the combinational or the registered output.
overview of related works, and describes the FPGA archi- A local switch matrix in the LC is typically included to support
tecture model adopted in this paper. Section III describes routing intracluster connections. Fig. 2 shows an LC composed
the proposed DCPG FPGA architecture. Section IV describes of N BLEs.
the proposed CAD flow and the enhancements to the routing Each LC is surrounded by RCs from its four sides. The
algorithm to maximize the number of resources that can be intersection of two RCs forms a SB that can be configured
turned OFF. In Section V, we describe the different benchmark to route the signals to the different directions. Fig. 3 shows
circuits used to evaluate the proposed architecture. Finally, examples of the connections for switches in an SB.
in Section VI, we experimentally evaluate the proposed A connection from a RC that borders an LC to one of
architecture. its input pins can be made through configurable switches,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BSOUL et al.: FPGA ARCHITECTURE AND CAD FLOW SUPPORTING DCPG 3

Fig. 2. Tile-based FPGA architecture.


Fig. 5. Basic fine-grained DCPG for two neighboring tiles. Shared bordering
RCs have their own power gating circuit.

them down during these times to reduce the leakage power


consumption. The power state of the PGRs of M1 and M2
is configured to DC, which allows controlling their power
state at run-time. Power control signals are routed from a
power controller module to control the power states of modules
M1 and M2. The third module, M3, does not experience idle
periods, thus its power state is configured to be always-ON.
Fig. 3. Example incoming wire tracks and switches in an SB. Similarly, the power state for the power controller is config-
ured to be always-ON. The power state for routing resources
that are used to route the power control signals is configured
to always-ON.
Power gating a module is beneficial if the energy consumed
during its idle periods is larger than the overhead of applying
power gating. This overhead results from the energy consumed
by the power controller and during power state transitions.
The proposed architecture enables realizing power domains
with different temporal (idle/active periods) and spatial char-
Fig. 4. Example application mapped to an FPGA supporting DCPG. acteristics (sizes and locations), thus it is suitable for a wide
range of applications. There is no need to have fixed tracks
called connection boxes (CBs). Buffers are typically inserted in the FPGA fabric to work as power control signals; rather,
to isolate the load capacitance of the wires in the RC from power control signals can be routed on the preexisting FPGA
the inputs of the CBs for performance issues, and they are routing fabric similar to any other user signal. The following
shared among all CBs bounded by that specific RC. Finally, sections describe the details of this architecture.
the outputs of an LC are connected directly to multiplexers
in the SBs through isolation buffers. This is similar to the A. Basic Power Gating Architecture
architectural assumptions made in the VPR 5.0 tool [17]. In this section, we describe a fine-grained version of the
proposed power gating architecture. Fig. 5 shows two tiles of
III. P ROPOSED DCPG FPGA A RCHITECTURE an FPGA; some details are not shown for the sake of clarity.
Fig. 4 shows an example system of three modules that The basic power gating architecture supports power gating at
are mapped to an FPGA that supports DCPG. Each module the granularity of one tile, thus a PGR is one tile. The power
occupies a number of power gating regions (PGRs) in the state can be set by configuring the SRAM cells that control
target FPGA architecture. Each PGR is composed of a number the select lines of the 3:1 multiplexers that drive the power
of FPGA tiles; the number of tiles in a region dictates the switches. The novelty of the proposed architecture lies in its
architecture granularity. The power state of each PGR can support for controlling the power state of individual LCs and
be configured as always-ON, always-OFF, or dynamically- the routing resources (input pin CBs, track isolation buffers,
controlled (DC). As will be explained later in this section, and SBs) dynamically at run-time. This makes the proposed
the power state for some of the internal components of power gating scheme suitable for various tile-based
a PGR can be configured to a different power state than that FPGA architectures.
for the encapsulating PGR. The supported power modes are always- ON, always-OFF,
In this example, two of the functional modules, M1 and M2, and DC. The always-ON mode sets the resources in a powered
experience long idle periods, thus it is desired to power state. This is useful for resources that need to be available all
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

the time, such as power control signals or application modules


that do not experience idle times. The always-OFF power state
puts the resources in sleep and in low-leakage mode. This is
useful for resources that are not utilized by the application that
is mapped to the device. DC means that the power state of a
resource can be controlled at run-time, and can be changed by
changing the value on the power control signal.
As shown in Fig. 5, one of the bordering input pins of
each LC can be used to route the power control signal
to the power switch (control signals are labeled Fig. 6. Internals of an LC with DCPG showing pull-down nMOS devices
PG_CNTL1 and PG_CNTL2). If an LC’s power state at FF input and LC outputs. A status signal to indicate completion of power
transition can be routed through one of the LC’s outputs (the top).
is configured as DC, its power gating multiplexer (the 4:1
multiplexer in the figure) is used to route the power control
signal by configuring the SRAM cells that control the select nMOS transistors are controlled using the output of the
lines of the 4:1 multiplexer. If an LC’s input pin is used to 3:1 multiplexers that drive the power switch.
route the power control signal, then it cannot be used by the The architecture also provides a feedback signal to indicate
logic implemented in the LC. Variations to this organization that a power domain has completed a power transition.
where a subset of the input pins are used as inputs to the Fig. 6 shows that an inverted version of PLDN_CNTL, which
power gating multiplexer can be realized. However, this makes is the output of the related 3:1 multiplexer shown in Fig. 5, can
it harder to route the power control signals, since a smaller be routed through one of the LC’s outputs. This is done using
set of routing tracks can be used to route the control signals. a 2:1 multiplexer at the output of BLE #1 inside the LC. The
For correct operation of the DC mode, the power state of SRAM cell that controls the select line of the 2:1 multiplexer
the RC that is used to route the power control signal must is configured to choose the feedback signal (PLDN_CNTL) or
be configured as always-ON. To support this, separate power the normal output of BLE #1. If the feedback signal is selected
gating circuitry is used for the bordering RCs of an LC. The to be routed to the output of the LC, the output of BLE #1 can
lower part of Fig. 5 shows the details. When configured to only be used internally in the LC. Note that since only one
the DC mode, the AND gate ensures that the shared RC is feedback signal might be needed for a power-gated module,
turned OFF only when both neighboring LCs are turned OFF at most one BLE in a power-gated module might be unusable.
(when PG_CNTL1 = 1 and PG_CNTL2 = 1). This ensures Timing analysis can be used to determine that LC is the last
that any of the bordering RCs of an LC can be used as the entry one to be turned ON/ OFF in a module, and it can be used to
point for the power control signal. Therefore, such a signal send the feedback signal.
could be routed from an on-chip power controller to the target
LCs in the same way that any other user circuit signal is B. Coarse-Grained Power Gating
routed. The area and power overheads associated with the
The same power control signal can be routed to any number architecture in Section III-A are due to the sleep transistors
of LCs that belong to the same power-gated module, which of the LC and the RCs, the power gating multiplexer of the LC,
forms one power domain. The SBs’ power state can be the 3:1 multiplexers that drive the gate of the sleep transistors,
configured in the same manner discussed above. More details the AND gates required to implement a proper power gating
about SBs power gating will be discussed later in this section. for the RCs, and the additional SRAM configuration memory
In the proposed architecture, the configuration memory cells cells.
and the FFs in an LC are not power gated. The configuration Typically, when an application is mapped to an FPGA,
SRAM cells are typically implemented using a low leakage, blocks that are part of the same functional module are placed
high-Vth process, such as the medium oxide thickness tran- close to each other in order to minimize delay and wiring
sistors used in configuration SRAM cells in some commercial costs [19]. Thus, it is likely that a group of LCs and RCs that
FPGAs [18]. The area of FFs within an LC is relatively small, are spatially close to each other share the same power state.
and thus they consume only a small amount of leakage power. It is, therefore, feasible to support power gating at a coarser
Therefore, these components are kept ON all the time instead granularity level than what is described in Section III-A in
of using other state-saving mechanisms that would increase order to reduce the area and power overheads of the power
the architecture complexity and power consumption. gating circuitry.
Fig. 6 shows the details of an LC. Pull-down nMOS The concept of coarse-grained PGRs is presented here.
transistors are used to isolate the outputs of the LC when the Unlike the fine-grained architecture in Section III-A where
LC is in sleep mode. This prevents large short-circuit current each PGR is composed of only one tile, we propose
in SB buffers that are driven by the LC outputs. Similarly, the a coarse-grained architecture, in which a PGR is composed
inputs of the FFs inside an LC are also isolated to prevent large of a number of tiles. Similar to the tile-level architecture in
short-circuit current in the FFs. Notice that we assume that Section III-A, the SRAM configuration memory cells and FFs
clock gating is used in association with DCPG mode during are powered on all the time.
the idle times; this guarantees that the values stored in the Fig. 7 shows an example DCPG (PGR) of size 2 × 2 tiles.
FFs do not get corrupted during sleep mode. The pull-down Some details are omitted for clarity. The region’s LCs and
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BSOUL et al.: FPGA ARCHITECTURE AND CAD FLOW SUPPORTING DCPG 5

Fig. 8. Power gating circuit for a SB. SB outputs are pulled down to GND
when the SB’s power is OFF (in sleep mode).

TABLE I
C ONFIGURABLE P OWER M ODES S UPPORTED BY THE
D IFFERENT C OMPONENTS IN A PGR

Fig. 7. Example PGR of 2 × 2 tiles. Internal region’s LCs and CBs share the
power switch. PGR’s SBs and bordering RCs have individual power switches.

internal RCs, within the large, dark box in the figure, share the
same power switch; thus, their power state can be configured
as one unit. The region’s SBs and bordering RCs have their
own power switches and their power states can be configured
separately; however, their power switches can still be them to ground during the sleep mode to ensure proper output
controlled using the region’s control signal (labeled PG_CNTL isolation. The gate input of the pull-down transistors is the
in the figure) when they are configured as DC. The bordering same as the gate input to the power switch.
RCs in the coarse-grained PGRs have the same structure and This scheme enables different power modes for different
functionality described for the RCs in Fig. 5; they can be used components in a PGR. Table I shows the supported
to route the power control signal to a PGR. power modes. For example, if the internal part of a PGR
Different PGR sizes can be realized in the same manner. For (LCs and internal RCs) is configured as DC, there is flexibility
example, a 3 × 3 PGR consists of 3 × 3 tiles (this PGR has in configuring the power state for the individual SBs and bor-
12 bordering RCs). Larger PGRs make it more challenging dering RCs. This flexibility allows some SBs to be always-ON
for the CAD tools to group related blocks in the same PGR, to route important signals, such as power control signals or
resulting in smaller power savings. On the other hand, the area signals that connect between different modules.
and power overheads in smaller PGRs are larger. In terms
of application mapping, a small PGR size means that an D. Fine-Grained Power-Gated Switch Blocks
application occupies a larger number of PGRs; more routing
resources would be needed to route the power control signals, The power-gated SB architecture in Section III-C enables
which may negatively impact routability and requires more configuring an SB’s power state as one unit. However, our
always-ON routing resources. experiments for many application circuits showed that >50%
of the SBs’ switches are not utilized. Supporting finer granu-
larity power gating for SBs, therefore, may result in a larger
C. Coarse-Grained Power-Gated Switch Blocks number of switches that can be turned OFF either statically
In the previous sections, we described the proposed power or dynamically at run-time compared with the coarse-grained
gating architecture for LCs and RCs (track isolation buffers SB power gating. This would result significant reduction of
and CBs). This section focuses on describing the power gating the total leakage power consumption, since an SB consumes
circuitry for SBs in a PGR. ∼70% of a tile’s leakage power.
The example PGR in Fig. 7 has a size of 2 × 2 tiles. The Fig. 9 shows how all switches in a specific SB side are
power control signal that is used to control the power state of grouped into one power gating partition to implement a finer
the LCs region (PG_CNTL) is also used to selectively control granularity power gating. Partitions per side (PPS) is used
the power state of the individual SBs that belong to the same as an architecture parameter to describe the power gating
region. For each LC, the SB that belongs to the same region granularity of an SB. For example, PPS = 1 for the SB
as the LC lies in the right-bottom corner of that LC. in Fig. 9. Increasing PPS results in finer granularity power
Fig. 8 shows the power gating circuitry for an SB. This gating. PPS = 0 represents an architecture where the power
circuitry is similar to that for the other components, as state for an SB is configured as one unit (coarse-grained
described in the previous sections. Minimum-sized pull-down power gating), while PPS = Nswitch indicates the finest power
nMOS transistors are placed at the outputs of the SB to pull gating granularity where the power state for each switch can
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 9. One power gating partition per SB side (details for two sides Fig. 10. CAD flow for the DCPG FPGA architecture.
are shown). The power state for each partition can be configured separately.

CAD steps, i.e., placement and routing. We assume that higher


be controlled individually. Nswitch is the number of switches level tools will pass information in the netlist about the
that exist in an SB’s side. The cost of increasing PPS is the
blocks that belong to a power-gated module and the power
additional area and leakage power due to the additional power controller. We leave the discussion of higher level tools and
gating circuit components and the increase in the total effective
their optimization to enable power gating for future work.
sleep transistor size. In order to ensure correct operation for
SBs when PPS > 0, the incoming tracks buffers (Fig. 3) must
be always-ON, i.e., not power gated. A. Placement and Routing
Fig. 10 shows the proposed CAD flow. The inputs to
E. Inrush Current During Wakeup Phase VPR include the circuit netlist, the names of the power-gated
modules in the circuit, the power controller netlist, and the
When a power-gated module is turned ON, a large current
architecture parameters (PGR’s width and height, and PPS).
is drawn from the power grid lines in the chip to recharge the
The input netlists to the flow are generated using a CAD
internal nodes of the FPGA circuitry. This current is known as
flow that is typically used with VPR. This includes Odin II
inrush or wakeup current. If not handled appropriately, a large
for Verilog synthesis [23], ABC for technology mapping [24],
inrush current may cause malfunction of the design [20].
and T-VPACK [25] for packing LUTs and FFs into LCs.
The work in [21], [22] describes a configurable
In Step 1, the power controller is placed and its internal
architecture to solve the inrush current problem in FPGAs that
connections are routed. The FPGA resources that are used
support DCPG by staggering the turn ON phase of the PGRs
by the power controller are locked, and their power state is
in a power-gated module. The architecture in [21] and [22] can
set to always-ON. In Step 2, placement is performed for the
be used to solve the inrush current problem in the proposed
application circuit. In Step 3, the power state for each PGR is
architecture in this paper with small area and power overheads.
determined based on the blocks that occupy the different LCs
The architecture provides short turn ON times. For example,
in the PGR. The power state for a PGR is set as follows.
turning ON a 1000 tiles module takes ∼10 clock cycles on a
1) DC: If only one power-gated module is mapped to the
300-MHz clock frequency, assuming 25 PGRs can be
PGR’s LCs.
turned ON simultaneously and each PGR has a size of 4 × 4
2) Always-OFF: If all of the PGR’s LCs are empty.
tiles [22].
3) Always-ON: All other cases.
The inrush current handling architecture in [21] and [22]
1) Routing Power Control Signals: The net for each power
enables delaying the wakeup signal for each PGR using con-
control signal is built in this step. The net’s source is one of the
figurable and fixed delay elements. The timing for activating
outputs of the power controller that has already been placed.
the isolation mechanism in our architecture (pull-down
The sinks are found as follows. For each PGR that belongs
nMOS transistors) must be handled appropriately. When a
to the power-gated module under consideration and is set to
PGR is turned OFF, isolation must be done before the rest of
DC, a free input pin from its bordering RCs is selected (if one
the PGR is powered down. On the other hand, when a PGR is
is available) to act as a sink for the control signal. Note that
turned ON, isolation must be deactivated after the PGR is
we cannot use predetermined sinks since the placement phase
powered up. To enable this, a 2:1 multiplexer can be used to
determines the number and locations of the PGRs in a power-
drive the isolation activation signal (PLDN_CNTL). This mul-
gated module. In Step 4, the nets of the power control signals
tiplexer selects between the delayed or nondelayed power con-
are routed. The SB partitions that are used to route these
trol signal. The select line can also be the power control signal.
signals are set as always- ON to ensure that the power control
signals are available all the time. Note that when selecting
IV. CAD F LOW FOR DCPG FPGA A RCHITECTURE the sinks of the power control signals, we try to build a trunk-
In this section, we present the CAD flow that is used to branch routing topology that minimizes the number of always-
map applications to the proposed DCPG FPGA architecture. ON SB partitions as in [12].
Our CAD flow is based on the VPR FPGA tool 2) Routing Circuit’s Signals: In Step 5, the connections in
(version 5.0 [17]). The proposed flow focuses on low-level the circuit netlist are routed on the available FPGA resources.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BSOUL et al.: FPGA ARCHITECTURE AND CAD FLOW SUPPORTING DCPG 7

where Criti j is the timing criticality of the connection (i, j ),


T (n) is the timing cost of the route to reach node n from the
source, and Congs(n) is the congestion cost of using node n
Cost(n) = Criti j × T (n) + (1 − Crit i j ) × Congs(n). (1)
For the PG-aware router, we modified the congestion term
of the cost function as
Fig. 11. Three power-gated modules placed on three PGRs with two ways Congs(n) = Congs(n)old × (1 + Cost Partition) (2)
shown to route the same net. The SBs used to route the net in M2 and M3
must be always-on (ON), other resources can be DC. where Congs(n)old is the original congestion cost, and
Cost partition is used to modify the cost of using a specific
Although the power control signals are routed before the power-gated SB partition. The following function is used to
circuit’s nets, this has negligible effect on routability and calculate Cost partition:
performance of the circuit because only a small fraction of 
K , if δPGR,net = 0
the routing resources are used to route the control signals. Cost partition = (3)
In order to verify this, we mapped the circuits described −L, if δPGR,net = 1
in Section V-A to the proposed architecture with a PGR size where K and L are weighting parameters determined
of 4 × 4 tiles and PPS = 0. We found that the minimum empirically, and δPGR,net is a binary function that has the
channel width has increased by 2.3% on average, with a value of 1 if the connection being routed belongs to the same
maximum increase of 16% for one circuit. module as that of the node’s PGR, and 0 otherwise. Each
Finally, in Step 6, we determine the power state for each of wire segment (node) in FPGA architecture is driven by an
the SB partitions as follows. SB switch; we consider a node to belong to a PGR if the
1) DC: If PGR is DC and only one power-gated module is switch driving it belongs to an SB in that PGR.
routed through the SB partition. Cost Partition is used to change the weight given to the
2) Always-ON: SB partition is not used to route signals. congestion cost of the node being investigated. If the module
3) Always-ON: All other cases. of the node’s PGR is the same as that of the net being routed,
then the congestion cost is decreased (by a factor of L) to
B. PG-Aware Routing encourage routing through the node. Routing nets that belong
to the same module as that of the PGR through an SB partition
Due to the complexity of the routing topology of
in the PGR enables configuring the partition as DC. On the
multiple-module power-gated circuits, some SB partitions are
other hand, if the net does not belong to the same module as
required to be always-ON. Fig. 11 shows an example of three
that of the node’s PGR, then the cost is increased (by a factor
power-gated modules mapped to three PGRs. Two possible
of K ) to discourage routing the net through that node; if the
ways to route the net from M1 to its sinks in M1 and M3
net is routed through that node, then the SB must be set as
are shown (Route 1 and Route 2—R1 and R2). In both ways,
always-ON.
SBs in M2 and M3 are required to be always-ON to ensure
We found that L = 0.2 and K = 3 give good results with
proper operation. For example, when M2 is powered down,
<1% increase in the critical path delay on average (up to 4%
the SBs in M2’s PGR that route the net need to be
for a circuit). Larger L may result in circuits that cannot
powered.
be routed because the router will not be able to resolve
In Fig. 11, R2 has a smaller number of always-ON SBs
congestion. Larger K may result in a large congestion cost,
(larger number of always-OFF and DC SBs) compared with R1,
which may negatively impact the critical path delay.
which improves the power savings during the idle periods.
In this section, we present the enhancements made to the
router in order to increase the number of SBs that can be V. B ENCHMARK C IRCUITS
powered down. We modified the timing-driven router in VPR In this section, we describe the benchmark circuits used to
to implement these enhancements. evaluate the proposed architecture.
VPR uses the pathfinder negotiated congestion-delay
router [16]. The routing resources are represented by a routing- A. Synthetic Benchmarks Generation
resource graph. In this graph, nodes represent wire segments We used the largest 20 Microelectronics Center of
and logic block pins, and edges represent switches. In the inner North Carolina benchmark circuits available with the VPR
loop of the algorithm, when searching for a route from a source download [17] as subcircuits (or modules) in the generated
node to a sink node, nodes of the graph are visited and added synthetic circuits. Each of the generated circuits is composed
with a cost value (path cost) to a priority queue. These nodes of two or more modules (up to nine), connected to each other
are used later to iteratively investigate other nodes connected using the primary I/Os of the subcircuits. Table II shows the
to them until the sink is reached. The path cost to reach a node details of the generated circuits.
from the source of the net is the sum of the costs of nodes in The modules in each circuit are connected together
that path. The function that is used to measure the cost of after performing the packing phase using T-VPack [25],
using a node has timing and congestion terms as in (1), i.e., after each circuit’s LUTs and FFs are grouped in LCs.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE II
G ENERATED S YNTHETIC C IRCUITS

Fig. 12. Snake robot example application.

TABLE III
I NFORMATION A BOUT FPGA-BASED iSnake ROBOT C ONTROL S YSTEM

This guarantees that the subcircuits used in stitching closely from the Condition module can be used as a power control
represent independent functional modules in an application. signal for the Delta module.
We assume that the power state for each module in the The FPlibrary [27] was used to implement the required
circuits of Table II can be DC. Thus, a power controller floating point operators in the PQ algorithm. Quartus II
that has an output power control signal for each module is was used to generate a technology-mapped netlist of the
generated for each circuit. Timers are used to generate the circuit [28]. The netlist was then annotated with information
power control signals in a power controller. A timer is sized about the modules of each circuit component. A modified
assuming the sleep signal of a module is asserted after 50 ms. version of T-VPack was then used to pack the circuit; this
This amount has been chosen arbitrarily. Longer timer periods version ensures that each module’s LUTs and FFs are clustered
may increase the number of resources required to implement in the same LCs, thus generating a netlist that contains three
the controller circuit on an FPGA. Notice that more sophis- interconnected modules.
ticated power controllers can be implemented. However, the Table III shows information about the robot control system.
goal of this paper is to use the power controller circuits to The size of the Delta module is ∼25% of the system, indicat-
evaluate the number of resources (especially routing resources) ing that properly managing its power state may result in large
that are occupied by power control signals. This provides an energy reduction. This is investigated in Section VI.
estimate of the FPGA resources that will be in the different
power states, and hence the potential power savings using the VI. E XPERIMENTAL R ESULTS AND D ISCUSSION
DCPG FPGA architecture.
A. Experimental Setup
Unless otherwise indicated, the following FPGA
B. Robot Control System architecture parameters are used: 1) LUT size K = 4;
The application presented in this section is used to evaluate 2) LC size N = 6; 3) inputs per cluster I = 16; 4) RC width
the proposed architecture. This application represents a control W = 90; 5) routing segment length L = 4; 6) switch box
system for a snake-shaped robot, called iSnake, that is used flexibility Fs = 3; 7) input pins CB flexibility Fc,in = 0.2;
in endoscopy [26]. The left side of Fig. 12 shows the robot and 8) output pins CB flexibility Fc,out = 0.1.
inside an organ, in two different states. We used HSPICE simulations to obtain the leakage power of
The robot’s control system provides haptic feedback to the PGRs. We used the number of minimum-width transistors
the surgeon to prevent harming the patient’s organs during as in [16] to estimate the area. We used the 45-nm HP tech-
an operation. A proximity query (PQ) algorithm is used to nology from the predictive technology models website [29],
approximate the distance between the iSnake and the surface with supply voltage VDD = 1 V and temperature T = 85 °C
of the patient’s organ [26], which is computationally intensive to measure the worst case power and timing.
and requires a high-performance implementation. For the power-gated architectures, the threshold voltage of
We developed an FPGA-based implementation of the sleep transistors has been increased by 100 mV by changing
control system. The right side of Fig. 12 shows the main the Vth0 parameter in the technology files. Sleep transistors
modules in the system. The datapath performs stream process- have been iteratively sized to constrain the performance
ing for input data. The Delta module is only activated when the degradation to 10% compared with an architecture that
robot touches the organ’s surface. This module can be put in does not support power gating. We assume 20% activity in
sleep mode when its output is not required. The select output doing this.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BSOUL et al.: FPGA ARCHITECTURE AND CAD FLOW SUPPORTING DCPG 9

Fig. 13. Results for sweeping LC’s cluster size (N ). Switch blocks are not included in the results. (a) Area overhead. (b) Leakage power. (c) Leakage power
reduction.

Fig. 14. Results for sweeping RC width (W ). Switch blocks are not included in the results. (a) Area overhead. (b) Leakage power. (c) Leakage power
reduction.

We assume that SRAM cells are built using six 2) Static Gating: This is an architecture that supports
minimum-sized transistors. All multiplexers used in our archi- Statically controlled power gating, such as the one
tecture are based on pass transistors; each multiplexer is presented in [4]. The power state for this architecture
followed by a level restorer [30] and a buffer. can only be set at configuration time.
The SPICE netlists for LCs have been generated as follows. 3) Dynamic Gating: This is the DCPG architecture that we
The size of the last inverter in a buffer is found by dividing proposed in Section III.
the number of equivalent min-sized load inverters by four, and 1) Power-Gated Tiles: We first study a basic architecture
internal stages are sized by a stage ratio of four. Roughly, this that has only one tile (without the SB); we vary two para-
sizing results in minimum delay [31]. LUTs are built using meters, the cluster size (N) and the width of the RCs (W ).
transmission gates as in [32], with inverters inserted after the When varying N, we also vary the number of input pins (I )
second and last stages to reduce the delay of series-connected of the LC. When varying N, we set W = 90, and when
transmission gates. varying W we set N = 6 and I = 16.
The SPICE netlists for SBs have been generated as follows. Figs. 13 and 14 show the effect of the cluster size (N)
Unless otherwise indicated, we assume a RC width (W ) that is and RC width (W ) on power gating. The results shown in the
20% larger than the minimum channel width required to route figures are for a tile that supports power gating, not including
a circuit [16]. We used unidirectional, single driver routing the SB, compared with a tile that does not support power
architecture [33]. The output buffers of SBs are built using gating (ungated).
multiple stages of inverters. The stages are sized using a The area overhead decreases as the cluster size and the
stage ratio of four. The capacitance of wire segments was channel width increase [Figs. 13(a) and 14(a)]; this is because
obtained using the model in [34]. The outputs of the LCs a larger number of circuit components are powered through a
connect directly to the SBs through isolation buffers without single sleep transistor. However, there is no high correlation
the need for output pins connection blocks; this is similar to between the area overhead and the channel width. The area
the architecture assumptions made in VPR 5.0 [17]. overhead for the static-gated architecture is lower than that for
the proposed dynamic-gated architecture. This is because of
B. Architecture Parameters Sweep
the additional circuit components that are required to support
In this section, we study the area and power of the proposed dynamic power state control.
architecture for different architecture parameters. Note that this The leakage power reduction increases as the cluster size
is done without mapping applications to the architecture. The and the channel width increase [Figs. 13(c) and 14(c)]. This
following list defines three architectures that are evaluated in is similar to the area overhead trend. Smaller area overhead
our experiments. results in lower leakage power overhead due to the power
1) Ungated: This is the baseline FPGA architecture that gating circuitry. The results show that the leakage power reduc-
does not support power gating. tion in the OFF-state (compared with an ungated architecture)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 15. Results for sweeping granularity of PGRs. Switch blocks are not included in the results. (a) Area overhead. (b) Leakage power. (c) Leakage power
reduction.

Fig. 16. Results for SBs power gating granularity by sweeping RC width (W ). Segment length (L) = 2. (a) Area overhead. (b) ON-leakage power.
(c) OFF-leakage power. (d) Leakage power reduction.

can be up to ∼91% for a cluster size of 10 (channel width architecture that does not support power gating. In addition
of 160). Figs. 13(b) and 14(b) show that the proposed archi- to varying the RC width (W ), we also vary the power gating
tecture has slightly larger leakage power in the OFF-state than granularity of SBs by varying the number of PPS (larger PPS
the static-gated architecture. This is because the dynamic- means finer granularity). In a fine-grained SB power gating,
gated architecture requires more circuit components to support each output buffer in an SB has a power gating circuit, whereas
controlling its power state dynamically. in a coarse-grained power gating a single power gating circuit
2) Power-Gated Regions: The results of sweeping the is used for all circuit components in an SB.
granularity of the proposed architecture are shown in Fig. 15. Figs. 16 and 17 show that the area overhead and leakage
As the region size increases, the area overhead decreases. power when L = 4 is lower than that for L = 2.
The area overhead includes that of the sleep transistors and This is because an FPGA routing architecture that has
the circuit components required to support configuring the shorter segments contains more circuit components in
different power states of a PGR. The area overhead decreases SBs [33].
as the region size increases because more circuit components The area overhead [Figs. 16(a) and 17(a)] of the power
are powered through a single sleep transistor, and the circuit gating circuitry decreases as the channel width increases. This
components required to support the different power states of a is because as we increase W the sleep transistor size increases
region are shared among larger number of circuit components. at a lower rate than the increase in the number of circuit
The area overhead is as small as 1% for a PGR of 4 × 4 tiles. components that are powered through it.
The leakage power reduction increases as the region size The power gating granularity (PPS) also affects the area
increases [Fig. 15(c)]. This is because larger regions have overhead. Fine-grained power gating results in large area
smaller area overhead, which results smaller leakage power overhead (>60% for L = 4 and ∼100% for L = 2). For
overhead due to the power gating circuitry. The OFF-state large values of W , the area overhead for granularities down to
leakage power of the power-gated architectures is much lower PPS = 4 ranges between 10% and 16%. This area overhead,
than that for the ungated architecture, leading to a leakage however, is only for SBs. The overall area overhead for the
power reduction of >90% (∼95% for a PGR of 4 × 4 tiles). power gating architecture is lower. Recall from Section VI-A
Increasing the region size by more than 4 × 4 tiles does that the area overhead (without SBs) for a PGR of size
not significantly increase the leakage power savings. As can 4×4 tiles is ∼1%. The overall area overhead for the same PGR
be observed in Fig. 15(c), the leakage power reduction in with SBs ranges between 4.7% and 10.3% for PPS from 0 to 4
a static-gated architecture is slightly larger than that in the and W = 90.
proposed dynamic-gated architecture. This is because of the Figs. 16(b) and 17(b) show the ON-leakage power for
additional circuit components that are required in the proposed the gated architecture (for different PPS) and the ungated
architecture to support changing its power state at runtime. architecture. The ON-leakage increases as PPS increases; this
3) Power-Gated Switch Blocks: In this section, we vary is because finer granularity power gating requires more circuit
the architecture parameters that describe the SBs. Results are components and larger sleep transistors. The ON-leakage over-
shown for SB architectures that have different segment lengths, head for W = 100 and PPS between 0 and 4, for example,
L = 2 in Fig. 16 and L = 4 in Fig. 17, compared with an is ∼6%–10% for L = 2 and 3.5%–7.7% for L = 4. As we
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BSOUL et al.: FPGA ARCHITECTURE AND CAD FLOW SUPPORTING DCPG 11

Fig. 17. Results for SBs power gating granularity by sweeping RC width (W ). Segment length (L) = 4. (a) Area overhead. (b) ON-leakage power.
(c) OFF-leakage power. (d) Leakage power reduction.

Fig. 18. SBs switches power state averaged over all synthetic benchmarks using the unmodified and power gating-aware routing. (a) Always-ON.
(b) Always-OFF. (c) DC. (d) Always-OFF + DC.

will see later, finer granularity SBs power gating increases the TABLE IV
number of always-OFF resources, resulting in lower total T OTAL A REA OVERHEAD FOR THE P OWER G ATING A RCHITECTURE FOR
ON -leakage power for an application circuit. D IFFERENT A RCHITECTURE G RANULARITIES (W = 80 AND L = 4)
Figs. 16(c) and 17(c) show the leakage power for the
power-gated SBs in the OFF-state, i.e., static power when SBs
are powered down, and Figs. 16(d) and 17(d) show the leakage
power reduction compared with SBs with no power gating.
The OFF-leakage power for SBs with PPS = 0 is the smallest
because all of the SB circuit components are included in
the power-gated circuit. SBs with larger PPS, incur larger leak-
age power overhead in the OFF-state because of the additional
power gating circuit components, and because all the buffers
for incoming wire tracks are designed to be powered during C. Benchmark Circuits Results
the OFF-state (Section III-D). The leakage power reduction In this section, we use the CAD flow in Section IV to
for the proposed architecture is >95% for the coarse-grained place and route the synthetic benchmark circuits presented
power-gated SBs, and could reach >90% for PPS between in Section V. This is done to study the granularity of
1 and 4. For fine-grained power gating, the leakage power SBs power gating in the proposed architecture. For each
reduction is ∼70%. circuit, a power controller has been generated as described in
Section V-A to provide a power control signal for each module
4) Total Area Overhead: Table IV shows the total area in the circuit. Each circuit has been placed and routed on an
overhead for different architecture granularities of the power architecture with a PGR’s size of 4 × 4 tiles, segment length
gating architecture. The area overhead ranges between of 4, and RC width that is 20% larger than the minimum chan-
3.9% and 34.8%. The area overhead results in longer routing nel width required to route the circuit. Multiple architectures
wire segments. This leads to larger interconnect capacitance, with different SB power gating granularities (different PPS)
and potentially larger dynamic power. For example, using have been used. The results are shown for the original VPR’s
PPS = 3 would result in area overhead between 9.2% routing algorithm, and the enhanced power gating-aware
and 10.7%. This translates to 4.5%–5.2% increase in each (PG-aware) algorithm that is described in Section IV-B.
dimension of a tile, assuming square tiles [35]. This represents 1) Breakdown of SBs’ Switches Power States: In this
a loose upper bound on the increase in interconnect capaci- section, we report the percentages of SB switches that can
tance [35]. In this paper, we do not evaluate the effect of the be configured in the different power states.
area overhead on the dynamic power and how this could affect Fig. 18(a) shows the percentage of always-ON switches
the savings achieved by the proposed architecture; we leave for different SB power gating granularities. For finer
this for future work. granularity power gating, the number of always-ON SB
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 19. Leakage power results for (a) SBs only and (b) and (c) SBs and PGRs averaged over all synthetic benchmark circuits.

switches decreases. This is expected since more SB switches OFF -leakage power, which is the leakage power for the circuits
can statically be turned OFF because the SB partitions they assuming that all modules in a circuit are idle and their
belong to are not used to route signals. Furthermore, with DC components are turned OFF. In both cases, the always-OFF
finer granularity, there is a better chance that an SB partition components are assumed to be turned OFF at configuration
is only used to route signals that belong to a single module, time.
which increases the number of DC SB switches. Fig. 19(a) shows the leakage power for the SBs for different
Using the PG-aware routing algorithm results in slightly SB power gating granularities, and the leakage power for the
fewer always-ON switches, but this improvement diminishes SBs when an ungated architecture is used. As can be seen, the
as PPS increases; finer granularity power gating (larger PPS) ON -leakage power for both the PG-aware routing and the orig-
results in larger number of components that can be statically inal routing are roughly equal because the algorithm enhance-
turned OFF. Therefore, the improvement space available for ments do not significantly improve the number of always-OFF
the PG-aware router becomes tighter. The results show that the switches as explained earlier. For finer granularity SBs
PG-aware router reduces the number of always-ON switches power gating, both the ON and OFF-leakage powers decrease.
by 3% for coarse-grained SB power gating (PPS = 0), and The ON-leakage power decreases since finer granularity
∼1% for PPS = 4 of the total number of switches. This results in more unused SB partitions that can be turned OFF
corresponds to reduction of 8% and 14% of the always-ON at configuration time, thus the overall leakage power in
switches for PPS = 0 and PPS = 4, respectively. the ON-state for an application circuit goes down. The
Fig. 18(b) shows the percentage of always-OFF switches OFF -leakage power also decreases with finer granularity
for different SB power gating granularities. As expected, finer power gating because more SB switches can be placed in
granularity power gating increases the always- OFF switches. the DC state. The minimum OFF-leakage power is when
The PG-aware routing has a negligible effect on the number PPS = 3. Larger PPS values result in more leakage power
of always-OFF switches. consumption in the OFF state because of the large overhead
Fig. 18(c) shows the percentage of DC switches for differ- of the power gating circuitry.
ent SB power gating granularities. These are switches that Fig. 19(a) also shows that the PG-aware routing slightly
can be powered OFF at run time when the module they improves the OFF-leakage power compared with the unmod-
belong to becomes idle. The largest number of DC switches ified routing algorithm. This is because PG-aware routing
is when PPS = 1. Finer SB power gating granularity results in more DC switches that can be turned OFF when
reduces the number of DC SB switches; this is because more an application circuit is idle, as shown in Fig. 18(c).
switches can be set as always-OFF at configuration time, Fig. 19(a) shows that ON-leakage power with the gated
which reduces the total number of remaining switches. The architecture is larger than that with the ungated architecture
PG-aware routing helps in slightly improving the percentage for the coarse-grained SBs power gating. However, for finer
of DC switches. This improvement is roughly the same as granularity power gating, the ON-leakage power for the gated
the amount of reduction in the always-ON switches shown architecture is lower than that for the ungated architecture.
in Fig. 18(a). Although a single SB with finer granularity power gating has
Finally, Fig. 18(d) shows the sum of the always- OFF and larger ON-leakage power than an ungated SB, finer granularity
DC switches. As expected, SBs with finer granularity power power gating enables turning OFF unused SB switches, which
gating result in larger number of switches that can be powered results in lower overall ON-leakage power.
down. The PG-aware routing shows slight improvements The total leakage power is shown in Fig. 19(b). This
compared with the original VPR router; these improvements includes the leakage power for both SBs and PGRs. The
diminish as the granularity of power gating decreases as same trends discussed above apply here because a significant
explained above. portion of the leakage power comes from SBs. Comparing
2) Leakage Power: In this section, we report the leakage Fig. 19(a) and (b) shows that SBs contribute to roughly 72%
power averaged over all benchmark circuits. We report the of leakage power in the ungated architecture. However, in the
ON -leakage power, which is the leakage power assuming all gated architecture, SBs contribute to roughly 67%–72% of the
modules in a circuit are idle but not powered OFF, the total leakage power.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BSOUL et al.: FPGA ARCHITECTURE AND CAD FLOW SUPPORTING DCPG 13

Fig. 21. Energy reduction for iSnake’s control system in DCPG FPGA.
Fig. 20. Total leakage power reduction for individual synthetic circuits.

Fig. 19(c) shows the upper limit of reduction in the CAD flow that can be used to map applications to the proposed
OFF -leakage power compared with the ungated architecture. architecture, and enhancements to a routing algorithm in order
As expected, the leakage power reduction increases with finer to optimize the power savings of the architecture.
granularity SBs power gating, and the largest reduction is The area overhead of the architecture is ∼10.3% when the
achieved when PPS = 3. For coarse-grained SB power gating, power gating region size is 4×4 tiles and the number of power-
the OFF-leakage power reduction is ∼68%. For PPS = 3, the gated SB PPS is four (W = 90). The potential leakage power
reduction is ∼77%. savings for the studied benchmark circuits are up to 83%.
Fig. 20 shows the individual circuits’ potential power reduc- We also studied the energy savings in a control system for
tion when all modules are turned OFF (PPS = 3). The power a robot that is used in medical applications. Assuming only
reduction ranges between 67% and 81% using the original 25% of the system can be powered down when idle, and it is
routing algorithm, and ranges between 68% and 83% using idle for 95% of the time, we found that ∼8% energy saving
the PG-aware routing. can be achieved by the proposed architecture when compared
with an implementation with only clock gating.
This research provides the basis for a new generation of
D. Example Application Results FPGAs, which are capable of self-optimization. Future work
In this section, we show the energy saving results of using includes automating the process of identifying application
the proposed power gating architecture for the iSnake robot modules that can benefit from the proposed architecture. This
control system described in Section V-B. is suitable for designs that use accelerators with components
Fig. 21 shows the energy savings for the circuit for different that operate for only a small fraction of time. Furthermore,
active times of the Delta module. We compared mapping the enhancements to the CAD tools are required in order to better
application on the proposed power gating architecture with guide the different stages in the flow to increase idle times and
two baseline implementations. The first is a system that has increase the number of resources that can be powered down,
no power optimizations at all. Normally, one would implement while reducing the impact on performance and area.
clock gating for modules that experience inactivity periods.
We assume that the first baseline does not include this. The R EFERENCES
second baseline is an implementation that supports clock [1] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” in
gating for the Delta module. Proc. 14th ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2006,
pp. 21–30.
At 5% activity level, the results show that compared with the [2] M. Münch, B. Wurth, R. Mehra, J. Sproch, and N. Wehn, “Automating
first baseline (no clock gating), the proposed architecture cou- RT-level operand isolation to minimize power consumption in data-
pled with clock gating could achieve ∼19% energy savings. paths,” in Proc. Conf. Design, Autom. Test Eur., 2000, pp. 624–633.
[3] Q. Wang, S. Gupta, and J. H. Anderson, “Clock power reduction for
Compared with the second baseline (includes clock gating), the virtex-5 FPGAs,” in Proc. 17th ACM/SIGDA Int. Symp. Field-Program.
proposed architecture could achieve ∼8% additional energy Gate Arrays, 2009, pp. 13–22.
saving. Given that only a small portion of the application [4] T. Tuan, S. Kao, A. Rahman, S. Das, and S. Trimberger, “A 90 nm
low-power FPGA for battery-powered applications,” in Proc. 14th Int.
benefited from the power gating architecture (25% of the Symp. Field-Program. Gate Arrays, 2006, pp. 3–11.
circuit), the results are promising. [5] F. Li, Y. Lin, L. He, and J. Cong, “Low-power FPGA using pre-defined
dual-Vdd/dual-Vt fabrics,” in Proc. 12th ACM/SIGDA Int. Symp. Field-
Program. Gate Arrays, 2004, pp. 42–50.
VII. C ONCLUSION [6] J. Hussein, M. Klein, and M. Hart, “Lowering power at 28 nm with
Xilinx 7 series FPGAs,” Xilinx, Inc., San Jose, CA, USA, Tech. Rep.
We present an FPGA architecture that supports dynamic WP389, Jun. 2011.
power gating. This architecture enables powering down mod- [7] Meeting the Low Power Imperative at 28 nm, Altera Corp., San Jose,
CA, USA, Nov. 2011.
ules in an FPGA when they are idle to reduce their static power [8] S. Henzler, Power Management of Digital Circuits in Deep Sub-Micron
dissipation. The architecture’s flexibility enables the user to CMOS Technologies (Advanced Microelectronics). Secaucus, NJ, USA:
implement an arbitrary number and structure of power-gated Springer-Verlag, 2007.
[9] Y. Lin, F. Li, and L. He, “Routing track duplication with fine-grained
modules, and enables routing power control signals on the power-gating for FPGA interconnect power reduction,” in Proc. Asia
general-purpose routing fabric of an FPGA. We also present a South Pacific Design Autom. Conf., Jan. 2005, pp. 645–650.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

[10] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and Assem A. M. Bsoul (S’07) received the B.Sc. degree
T. Tuan, “Reducing leakage energy in FPGAs using region-constrained in computer engineering from the Jordan University
placement,” in Proc. 12th Int. Symp. Field-Program. Gate Arrays, 2004, of Science and Technology, Irbid, Jordan, in 2006,
pp. 51–58. the M.Sc. degree in electrical engineering from
[11] R. P. Bharadwaj, R. Konar, P. T. Balsara, and D. Bhatia, “Exploiting tem- Queen’s University, Kingston, ON, Canada, in 2009,
poral idleness to reduce leakage power in programmable architectures,” and the Ph.D. degree in electrical and computer
in Proc. Asia South Pacific Design Autom. Conf., 2005, pp. 651–656. engineering from the University of British Columbia,
[12] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA architecture supporting Vancouver, BC, Canada, in 2014.
dynamically controlled power gating,” in Proc. IEEE Int. Conf. Field- He is currently a Post-Doctoral Fellow with the
Program. Technol. (FPT), Dec. 2010, pp. 1–8. University of British Columbia. His current research
[13] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA with power-gated interests include low-power reconfigurable architec-
switch blocks,” in Proc. IEEE Int. Conf. Field-Program. Technol. (FPT), tures and computer-aided design algorithms.
Dec. 2012, pp. 87–94.
[14] C. Li, Y. Dong, and T. Watanabe, “New power-aware placement for
region-based FPGA architecture combined with dynamic power gating
by PCHM,” in Proc. 17th IEEE/ACM Int. Symp. Low-Power Electron.
Design (ISLPED), Aug. 2011, pp. 223–228. Steven J. E. Wilton (S’86–M’97–SM’03) received
[15] C. H. Hoo, Y. Ha, and A. Kumar, “A directional coarse-grained power the M.A.Sc. and Ph.D. degrees in electrical and com-
gated FPGA switch box and power gating aware routing algorithm,” puter engineering from the University of Toronto,
in Proc. 23rd Int. Conf. Field Program. Logic Appl. (FPL), Sep. 2013, Toronto, ON, Canada, in 1992 and 1997, respec-
pp. 1–4. tively.
[16] V. Betz, J. Rose, and A. Marquardt, Eds., Architecture and CAD for He was a co-founder of Veridae Systems, Inc.,
Deep-Submicron FPGAs. Norwell, MA, USA: Kluwer, 1999. Vancouver, BC, Canada, acquired by Tektronix,
[17] J. Luu et al., “VPR 5.0: FPGA CAD and architecture exploration Beaverton, OR, USA, in 2011, which developed
tools with single-driver routing, heterogeneity and process scaling,” in debug solutions for application-specific integrated
Proc. 17th ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2009, circuits, field-programmable gate arrays (FPGAs),
pp. 133–142. and FPGA-based systems. He joined the Department
[18] M. Klein, “Power consumption at 40 and 45 nm,” Xilinx, Inc., San Jose, of Electrical and Computer Engineering, University of British Columbia,
CA, USA, Tech. Rep. WP298, Apr. 2009. Vancouver, in 1997, where he is currently a Professor and an Associate Head.
[19] A. Marquardt, V. Betz, and J. Rose, “Timing-driven placement for His current research interests include the architectures of next-generation
FPGAs,” in Proc. 8th ACM/SIGDA Int. Symp. Field-Program. Gate FPGAs and their associated computer-aided design tools.
Arrays, 2000, pp. 203–213. Dr. Wilton served as the Program and General Chair of the ACM Inter-
[20] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low Power national Symposium on FPGAs, from 2005 to 2006, respectively, and the
Methodology Manual: For System-on-Chip Design. New York, NY, Program Co-Chair of the 2005 International Conference on Field Program-
USA: Springer-Verlag, 2007. mable Logic and Applications and the 2008 IEEE International Conference
[21] A. A. M. Bsoul and S. J. E. Wilton, “A configurable architecture to limit on Application-Specific Systems, Architectures and Processors. He was a
wakeup current in dynamically-controlled power-gated FPGAs,” in Proc. recipient of best paper awards at the International Conference on Field-
Int. Symp. Field-Program. Gate Arrays (FPGA), 2012, pp. 245–254. Programmable Technology in 2003, 2005, 2007, and 2013, respectively, and
[22] A. A. M. Bsoul and S. J. E. Wilton, “A configurable architecture to limit the International Conference on Field-Programmable Logic and Applications
inrush current in power-gated reconfigurable devices,” J. Low Power in 2001, 2004, 2007, and 2008, respectively. He is currently the Editor-in-
Electron., vol. 10, no. 1, pp. 1–15, 2014. Chief of the ACM Transactions on Reconfigurable Technology and Systems.
[23] P. Jamieson, K. B. Kent, F. Gharibian, and L. Shannon, “Odin II—An
open-source Verilog HDL synthesis tool for CAD research,” in Proc.
18th IEEE FCCM, May 2010, pp. 149–156.
[24] ABC: A System for Sequential Synthesis and Verification, Univ. Califor- Kuen Hung Tsoi received the Ph.D. degree from the
nia, Berkeley, CA, USA, 2012. Department of Computer Science and Engineering,
[25] A. R. Marquardt, “Cluster-based architecture, timing-driven packing and Chinese University of Hong Kong, Hong Kong, in
timing-driven placement for FPGAs,” M.S. thesis, Dept. Elect. Comput. 2007.
Eng., Univ. Toronto, Toronto, ON, Canada, 1999. He has been a Post-Doctoral Research Associate
[26] K.-W. Kwok et al., “Dimensionality reduction in controlling articulated with the Custom Computing Group, Department
snake robot for endoscopy under dynamic active constraints,” IEEE of Computing, Imperial College London, London,
Trans. Robot., vol. 29, no. 1, pp. 15–31, Feb. 2013. U.K., since 2008. He is currently with Imagination
[27] J. Detrey and F. de Dinechin. (2004). FPLibrary, a VHDL Library of Technologies Ltd., Kings Langley, U.K.
Parametrisable Floating-Point and LNS Operators for FPGA. [Online].
Available: http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/
[28] Quartus II University Interface Program. [Online]. Available: http://
www.altera.com/education/univ/research/quip/unv-quip.html, accessed
Jan. 28, 2015.
[29] Predictive Technology Model (PTM). [Online]. Available: Wayne Luk (F’09) received the M.A., M.Sc., and
http://ptm.asu.edu/, accessed Jan. 28, 2015. D.Phil. degrees in engineering and computing sci-
[30] E. Hung, S. J. E. Wilton, H. Yu, T. C. P. Chau, and P. H. W. Leong, ence from the University of Oxford, Oxford, U.K.
“A detailed delay path model for FPGAs,” in Proc. Int. Conf. Field- He was a Visiting Professor with Stanford Univer-
Program. Technol. (FPT), 2009, pp. 96–103. sity, Stanford, CA, USA. He is currently a Profes-
[31] N. H. E. Weste and D. M. Harris, CMOS VLSI Design: A Circuits and sor of Computer Engineering with Imperial College
Systems Perspective, 4th ed. Reading, MA, USA: Addison-Wesley, 2011. London, London, U.K. His current research interests
[32] T. Pi and P. J. Crotty, “FPGA lookup table with transmission gate include reconfigurable computing, field program-
structure for reliable low-voltage operation,” U.S. Patent 6 667 635, mable technology, and design automation.
Dec. 23, 2003. Prof. Luk is a fellow of the Royal Academy of
[33] G. Lemieux, E. Lee, M. Tom, and A. Yu, “Directional and single-driver Engineering. He received the Research Excellence
wires in FPGA interconnect,” in Proc. IEEE Int. Conf. Field-Program. Award from Imperial College London for reconfigurable supercomputing,
Technol., Dec. 2004, pp. 41–48. and over 15 awards for his work from international conferences, such as
[34] S.-C. Wong, G.-Y. Lee, and D.-J. Ma, “Modeling of interconnect the Applied Reconfigurable Computing Conference, the Application-Specific
capacitance, delay, and crosstalk in VLSI,” IEEE Trans. Semicond. Systems, Architectures and Processors Conference, the Field-Programmable
Manuf., vol. 13, no. 1, pp. 108–111, Feb. 2000. Custom Computing Machines Conference, the Field Programmable Logic
[35] S. Huda, J. Anderson, and H. Tamura, “Charge recycling for power and Applications Conference, and the Field-Programmable Technology Con-
reduction in FPGA interconnect,” in Proc. 23rd Int. Conf. Field Program. ference. He was the Founding Editor-in-Chief of the ACM Transactions on
Logic Appl. (FPL), Sep. 2013, pp. 1–8. Reconfigurable Technology and Systems.

You might also like