Imp - Powergating Fpga 2
Imp - Powergating Fpga 2
controlled power gating, since once configured, the state of II. BACKGROUND
each part of the chip (ON or OFF) does not change. Statically A. Related Work
controlled power gating is effective for FPGAs, since if
Lin et al. [9] studied fine-grained power gating for FPGAs
the design does not fill an entire FPGA, the remainder of
to turn OFF unused resources at configuration time; their study
the FPGA can be safely turned OFF, saving leakage power.
showed that the area overhead could be >100%, which is
However, if only a small number of resources in an FPGA
undesirable because of the associated degradation in power
are not used, the savings from this technique may be
and timing, and the increase in cost.
limited.
Gayasen et al. [10] proposed coarse-grained power gating
In this paper, we propose dynamically controlled power
using a power switch for a region of logic blocks. The use of
gating (DCPG) in an FPGA. In our architecture, the power
dynamic reconfiguration was suggested to change the power
switches can be turned ON and OFF at run-time under the
state for the different regions in an FPGA based on their
control of other circuitry either running on the FPGA itself,
activity. However, this incurs power overhead and can only
or external to the FPGA. The signals to control the power
be applied at a very coarse granularity.
switches are connected to the general-purpose routing fabric
Tuan et al. [4] proposed power gating for an architecture
of the FPGA.
similar to the Xilinx Spartan-3. Their architecture supports
This paper is based on [12] and [13]. The work in [12]
sleep mode using a sleep signal from an off-chip controller that
focuses on power gating for logic resources in FPGAs, and
is connected to all power switches in the FPGA; this scheme
the work in [13] focuses on power gating for coarse-grained
allows creating one controllable power domain only.
routing resources. Our main additional contributions in this
Bharadwaj et al. [11] proposed synthesizing a power state
paper are as follows.
controller from the data flow graph of an application; this
1) We propose fine-grained power gating for routing controller could exploit the idleness periods of the application
resources. This allows powering down a larger number to reduce the dissipated leakage energy in an FPGA. They
of routing resources at configuration time, and enables used the same architecture in [10].
dynamic power state control for a larger number of Li et al. [14] proposed using a power control hard macro
routing resources at run-time. that is associated with each tile in an FPGA to control its
2) We present a CAD flow that can be used to map the power state (clock and power gating). They assume a power
application circuits that contain power-gated modules to gating architecture similar to that in [12].
the proposed architecture. In this flow, power control sig- Hoo et al. [15] proposed fine-grained power gating for
nals are connected to the different power-gated resources switch blocks (SBs) and a routing algorithm to optimize
to control their power state at run-time using the existing the power savings. The proposed architecture, however, only
general purpose routing fabric of an FPGA. supports powering down unused switches at configuration
3) We propose enhancements to an FPGA routing time.
algorithm that try to minimize the number of routing Dynamic partial reconfiguration is also reported to reduce
resources that cannot be powered down at run-time. the static power at run-time by enabling time sharing of FPGA
4) The presented CAD flow is used to evaluate the best resources [6]. However, swapping reconfigurable modules
granularity of routing resources power gating. happens at the scale of milliseconds, which may result high
5) We evaluate a robot control system used in medical power overhead. In contrast, the proposed architecture enables
applications using the power gating architecture changing the power state at the scale of nanoseconds.
proposed in this paper, and we study its power savings
for different operation activities. B. Architecture Framework
We evaluate the proposed architecture in terms of its area In this paper, we assume a tile-based FPGA architec-
overhead and the amount of leakage power reduction that it ture [16]. An FPGA is composed of an array of tiles; each
can achieve by varying the basic FPGA architecture parame- tile is composed of a logic cluster (LC) and the associated
ters, and by studying different architecture granularity levels. routing resources [two routing channels (RCs) and a SB],
We also use the proposed CAD flow to evaluate the poten- as shown in Fig. 2. An LC is composed of a number of
tial power savings in a set of synthetic benchmark circuits, basic logic elements (BLEs); each BLE is composed of a
in addition to the robot control system mentioned above. lookup table (LUT), a flip-flop (FF), and a multiplexer to
This paper is organized as follows. Section II provides select between the combinational or the registered output.
overview of related works, and describes the FPGA archi- A local switch matrix in the LC is typically included to support
tecture model adopted in this paper. Section III describes routing intracluster connections. Fig. 2 shows an LC composed
the proposed DCPG FPGA architecture. Section IV describes of N BLEs.
the proposed CAD flow and the enhancements to the routing Each LC is surrounded by RCs from its four sides. The
algorithm to maximize the number of resources that can be intersection of two RCs forms a SB that can be configured
turned OFF. In Section V, we describe the different benchmark to route the signals to the different directions. Fig. 3 shows
circuits used to evaluate the proposed architecture. Finally, examples of the connections for switches in an SB.
in Section VI, we experimentally evaluate the proposed A connection from a RC that borders an LC to one of
architecture. its input pins can be made through configurable switches,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 8. Power gating circuit for a SB. SB outputs are pulled down to GND
when the SB’s power is OFF (in sleep mode).
TABLE I
C ONFIGURABLE P OWER M ODES S UPPORTED BY THE
D IFFERENT C OMPONENTS IN A PGR
Fig. 7. Example PGR of 2 × 2 tiles. Internal region’s LCs and CBs share the
power switch. PGR’s SBs and bordering RCs have individual power switches.
internal RCs, within the large, dark box in the figure, share the
same power switch; thus, their power state can be configured
as one unit. The region’s SBs and bordering RCs have their
own power switches and their power states can be configured
separately; however, their power switches can still be them to ground during the sleep mode to ensure proper output
controlled using the region’s control signal (labeled PG_CNTL isolation. The gate input of the pull-down transistors is the
in the figure) when they are configured as DC. The bordering same as the gate input to the power switch.
RCs in the coarse-grained PGRs have the same structure and This scheme enables different power modes for different
functionality described for the RCs in Fig. 5; they can be used components in a PGR. Table I shows the supported
to route the power control signal to a PGR. power modes. For example, if the internal part of a PGR
Different PGR sizes can be realized in the same manner. For (LCs and internal RCs) is configured as DC, there is flexibility
example, a 3 × 3 PGR consists of 3 × 3 tiles (this PGR has in configuring the power state for the individual SBs and bor-
12 bordering RCs). Larger PGRs make it more challenging dering RCs. This flexibility allows some SBs to be always-ON
for the CAD tools to group related blocks in the same PGR, to route important signals, such as power control signals or
resulting in smaller power savings. On the other hand, the area signals that connect between different modules.
and power overheads in smaller PGRs are larger. In terms
of application mapping, a small PGR size means that an D. Fine-Grained Power-Gated Switch Blocks
application occupies a larger number of PGRs; more routing
resources would be needed to route the power control signals, The power-gated SB architecture in Section III-C enables
which may negatively impact routability and requires more configuring an SB’s power state as one unit. However, our
always-ON routing resources. experiments for many application circuits showed that >50%
of the SBs’ switches are not utilized. Supporting finer granu-
larity power gating for SBs, therefore, may result in a larger
C. Coarse-Grained Power-Gated Switch Blocks number of switches that can be turned OFF either statically
In the previous sections, we described the proposed power or dynamically at run-time compared with the coarse-grained
gating architecture for LCs and RCs (track isolation buffers SB power gating. This would result significant reduction of
and CBs). This section focuses on describing the power gating the total leakage power consumption, since an SB consumes
circuitry for SBs in a PGR. ∼70% of a tile’s leakage power.
The example PGR in Fig. 7 has a size of 2 × 2 tiles. The Fig. 9 shows how all switches in a specific SB side are
power control signal that is used to control the power state of grouped into one power gating partition to implement a finer
the LCs region (PG_CNTL) is also used to selectively control granularity power gating. Partitions per side (PPS) is used
the power state of the individual SBs that belong to the same as an architecture parameter to describe the power gating
region. For each LC, the SB that belongs to the same region granularity of an SB. For example, PPS = 1 for the SB
as the LC lies in the right-bottom corner of that LC. in Fig. 9. Increasing PPS results in finer granularity power
Fig. 8 shows the power gating circuitry for an SB. This gating. PPS = 0 represents an architecture where the power
circuitry is similar to that for the other components, as state for an SB is configured as one unit (coarse-grained
described in the previous sections. Minimum-sized pull-down power gating), while PPS = Nswitch indicates the finest power
nMOS transistors are placed at the outputs of the SB to pull gating granularity where the power state for each switch can
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 9. One power gating partition per SB side (details for two sides Fig. 10. CAD flow for the DCPG FPGA architecture.
are shown). The power state for each partition can be configured separately.
TABLE II
G ENERATED S YNTHETIC C IRCUITS
TABLE III
I NFORMATION A BOUT FPGA-BASED iSnake ROBOT C ONTROL S YSTEM
This guarantees that the subcircuits used in stitching closely from the Condition module can be used as a power control
represent independent functional modules in an application. signal for the Delta module.
We assume that the power state for each module in the The FPlibrary [27] was used to implement the required
circuits of Table II can be DC. Thus, a power controller floating point operators in the PQ algorithm. Quartus II
that has an output power control signal for each module is was used to generate a technology-mapped netlist of the
generated for each circuit. Timers are used to generate the circuit [28]. The netlist was then annotated with information
power control signals in a power controller. A timer is sized about the modules of each circuit component. A modified
assuming the sleep signal of a module is asserted after 50 ms. version of T-VPack was then used to pack the circuit; this
This amount has been chosen arbitrarily. Longer timer periods version ensures that each module’s LUTs and FFs are clustered
may increase the number of resources required to implement in the same LCs, thus generating a netlist that contains three
the controller circuit on an FPGA. Notice that more sophis- interconnected modules.
ticated power controllers can be implemented. However, the Table III shows information about the robot control system.
goal of this paper is to use the power controller circuits to The size of the Delta module is ∼25% of the system, indicat-
evaluate the number of resources (especially routing resources) ing that properly managing its power state may result in large
that are occupied by power control signals. This provides an energy reduction. This is investigated in Section VI.
estimate of the FPGA resources that will be in the different
power states, and hence the potential power savings using the VI. E XPERIMENTAL R ESULTS AND D ISCUSSION
DCPG FPGA architecture.
A. Experimental Setup
Unless otherwise indicated, the following FPGA
B. Robot Control System architecture parameters are used: 1) LUT size K = 4;
The application presented in this section is used to evaluate 2) LC size N = 6; 3) inputs per cluster I = 16; 4) RC width
the proposed architecture. This application represents a control W = 90; 5) routing segment length L = 4; 6) switch box
system for a snake-shaped robot, called iSnake, that is used flexibility Fs = 3; 7) input pins CB flexibility Fc,in = 0.2;
in endoscopy [26]. The left side of Fig. 12 shows the robot and 8) output pins CB flexibility Fc,out = 0.1.
inside an organ, in two different states. We used HSPICE simulations to obtain the leakage power of
The robot’s control system provides haptic feedback to the PGRs. We used the number of minimum-width transistors
the surgeon to prevent harming the patient’s organs during as in [16] to estimate the area. We used the 45-nm HP tech-
an operation. A proximity query (PQ) algorithm is used to nology from the predictive technology models website [29],
approximate the distance between the iSnake and the surface with supply voltage VDD = 1 V and temperature T = 85 °C
of the patient’s organ [26], which is computationally intensive to measure the worst case power and timing.
and requires a high-performance implementation. For the power-gated architectures, the threshold voltage of
We developed an FPGA-based implementation of the sleep transistors has been increased by 100 mV by changing
control system. The right side of Fig. 12 shows the main the Vth0 parameter in the technology files. Sleep transistors
modules in the system. The datapath performs stream process- have been iteratively sized to constrain the performance
ing for input data. The Delta module is only activated when the degradation to 10% compared with an architecture that
robot touches the organ’s surface. This module can be put in does not support power gating. We assume 20% activity in
sleep mode when its output is not required. The select output doing this.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 13. Results for sweeping LC’s cluster size (N ). Switch blocks are not included in the results. (a) Area overhead. (b) Leakage power. (c) Leakage power
reduction.
Fig. 14. Results for sweeping RC width (W ). Switch blocks are not included in the results. (a) Area overhead. (b) Leakage power. (c) Leakage power
reduction.
We assume that SRAM cells are built using six 2) Static Gating: This is an architecture that supports
minimum-sized transistors. All multiplexers used in our archi- Statically controlled power gating, such as the one
tecture are based on pass transistors; each multiplexer is presented in [4]. The power state for this architecture
followed by a level restorer [30] and a buffer. can only be set at configuration time.
The SPICE netlists for LCs have been generated as follows. 3) Dynamic Gating: This is the DCPG architecture that we
The size of the last inverter in a buffer is found by dividing proposed in Section III.
the number of equivalent min-sized load inverters by four, and 1) Power-Gated Tiles: We first study a basic architecture
internal stages are sized by a stage ratio of four. Roughly, this that has only one tile (without the SB); we vary two para-
sizing results in minimum delay [31]. LUTs are built using meters, the cluster size (N) and the width of the RCs (W ).
transmission gates as in [32], with inverters inserted after the When varying N, we also vary the number of input pins (I )
second and last stages to reduce the delay of series-connected of the LC. When varying N, we set W = 90, and when
transmission gates. varying W we set N = 6 and I = 16.
The SPICE netlists for SBs have been generated as follows. Figs. 13 and 14 show the effect of the cluster size (N)
Unless otherwise indicated, we assume a RC width (W ) that is and RC width (W ) on power gating. The results shown in the
20% larger than the minimum channel width required to route figures are for a tile that supports power gating, not including
a circuit [16]. We used unidirectional, single driver routing the SB, compared with a tile that does not support power
architecture [33]. The output buffers of SBs are built using gating (ungated).
multiple stages of inverters. The stages are sized using a The area overhead decreases as the cluster size and the
stage ratio of four. The capacitance of wire segments was channel width increase [Figs. 13(a) and 14(a)]; this is because
obtained using the model in [34]. The outputs of the LCs a larger number of circuit components are powered through a
connect directly to the SBs through isolation buffers without single sleep transistor. However, there is no high correlation
the need for output pins connection blocks; this is similar to between the area overhead and the channel width. The area
the architecture assumptions made in VPR 5.0 [17]. overhead for the static-gated architecture is lower than that for
the proposed dynamic-gated architecture. This is because of
B. Architecture Parameters Sweep
the additional circuit components that are required to support
In this section, we study the area and power of the proposed dynamic power state control.
architecture for different architecture parameters. Note that this The leakage power reduction increases as the cluster size
is done without mapping applications to the architecture. The and the channel width increase [Figs. 13(c) and 14(c)]. This
following list defines three architectures that are evaluated in is similar to the area overhead trend. Smaller area overhead
our experiments. results in lower leakage power overhead due to the power
1) Ungated: This is the baseline FPGA architecture that gating circuitry. The results show that the leakage power reduc-
does not support power gating. tion in the OFF-state (compared with an ungated architecture)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 15. Results for sweeping granularity of PGRs. Switch blocks are not included in the results. (a) Area overhead. (b) Leakage power. (c) Leakage power
reduction.
Fig. 16. Results for SBs power gating granularity by sweeping RC width (W ). Segment length (L) = 2. (a) Area overhead. (b) ON-leakage power.
(c) OFF-leakage power. (d) Leakage power reduction.
can be up to ∼91% for a cluster size of 10 (channel width architecture that does not support power gating. In addition
of 160). Figs. 13(b) and 14(b) show that the proposed archi- to varying the RC width (W ), we also vary the power gating
tecture has slightly larger leakage power in the OFF-state than granularity of SBs by varying the number of PPS (larger PPS
the static-gated architecture. This is because the dynamic- means finer granularity). In a fine-grained SB power gating,
gated architecture requires more circuit components to support each output buffer in an SB has a power gating circuit, whereas
controlling its power state dynamically. in a coarse-grained power gating a single power gating circuit
2) Power-Gated Regions: The results of sweeping the is used for all circuit components in an SB.
granularity of the proposed architecture are shown in Fig. 15. Figs. 16 and 17 show that the area overhead and leakage
As the region size increases, the area overhead decreases. power when L = 4 is lower than that for L = 2.
The area overhead includes that of the sleep transistors and This is because an FPGA routing architecture that has
the circuit components required to support configuring the shorter segments contains more circuit components in
different power states of a PGR. The area overhead decreases SBs [33].
as the region size increases because more circuit components The area overhead [Figs. 16(a) and 17(a)] of the power
are powered through a single sleep transistor, and the circuit gating circuitry decreases as the channel width increases. This
components required to support the different power states of a is because as we increase W the sleep transistor size increases
region are shared among larger number of circuit components. at a lower rate than the increase in the number of circuit
The area overhead is as small as 1% for a PGR of 4 × 4 tiles. components that are powered through it.
The leakage power reduction increases as the region size The power gating granularity (PPS) also affects the area
increases [Fig. 15(c)]. This is because larger regions have overhead. Fine-grained power gating results in large area
smaller area overhead, which results smaller leakage power overhead (>60% for L = 4 and ∼100% for L = 2). For
overhead due to the power gating circuitry. The OFF-state large values of W , the area overhead for granularities down to
leakage power of the power-gated architectures is much lower PPS = 4 ranges between 10% and 16%. This area overhead,
than that for the ungated architecture, leading to a leakage however, is only for SBs. The overall area overhead for the
power reduction of >90% (∼95% for a PGR of 4 × 4 tiles). power gating architecture is lower. Recall from Section VI-A
Increasing the region size by more than 4 × 4 tiles does that the area overhead (without SBs) for a PGR of size
not significantly increase the leakage power savings. As can 4×4 tiles is ∼1%. The overall area overhead for the same PGR
be observed in Fig. 15(c), the leakage power reduction in with SBs ranges between 4.7% and 10.3% for PPS from 0 to 4
a static-gated architecture is slightly larger than that in the and W = 90.
proposed dynamic-gated architecture. This is because of the Figs. 16(b) and 17(b) show the ON-leakage power for
additional circuit components that are required in the proposed the gated architecture (for different PPS) and the ungated
architecture to support changing its power state at runtime. architecture. The ON-leakage increases as PPS increases; this
3) Power-Gated Switch Blocks: In this section, we vary is because finer granularity power gating requires more circuit
the architecture parameters that describe the SBs. Results are components and larger sleep transistors. The ON-leakage over-
shown for SB architectures that have different segment lengths, head for W = 100 and PPS between 0 and 4, for example,
L = 2 in Fig. 16 and L = 4 in Fig. 17, compared with an is ∼6%–10% for L = 2 and 3.5%–7.7% for L = 4. As we
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 17. Results for SBs power gating granularity by sweeping RC width (W ). Segment length (L) = 4. (a) Area overhead. (b) ON-leakage power.
(c) OFF-leakage power. (d) Leakage power reduction.
Fig. 18. SBs switches power state averaged over all synthetic benchmarks using the unmodified and power gating-aware routing. (a) Always-ON.
(b) Always-OFF. (c) DC. (d) Always-OFF + DC.
will see later, finer granularity SBs power gating increases the TABLE IV
number of always-OFF resources, resulting in lower total T OTAL A REA OVERHEAD FOR THE P OWER G ATING A RCHITECTURE FOR
ON -leakage power for an application circuit. D IFFERENT A RCHITECTURE G RANULARITIES (W = 80 AND L = 4)
Figs. 16(c) and 17(c) show the leakage power for the
power-gated SBs in the OFF-state, i.e., static power when SBs
are powered down, and Figs. 16(d) and 17(d) show the leakage
power reduction compared with SBs with no power gating.
The OFF-leakage power for SBs with PPS = 0 is the smallest
because all of the SB circuit components are included in
the power-gated circuit. SBs with larger PPS, incur larger leak-
age power overhead in the OFF-state because of the additional
power gating circuit components, and because all the buffers
for incoming wire tracks are designed to be powered during C. Benchmark Circuits Results
the OFF-state (Section III-D). The leakage power reduction In this section, we use the CAD flow in Section IV to
for the proposed architecture is >95% for the coarse-grained place and route the synthetic benchmark circuits presented
power-gated SBs, and could reach >90% for PPS between in Section V. This is done to study the granularity of
1 and 4. For fine-grained power gating, the leakage power SBs power gating in the proposed architecture. For each
reduction is ∼70%. circuit, a power controller has been generated as described in
Section V-A to provide a power control signal for each module
4) Total Area Overhead: Table IV shows the total area in the circuit. Each circuit has been placed and routed on an
overhead for different architecture granularities of the power architecture with a PGR’s size of 4 × 4 tiles, segment length
gating architecture. The area overhead ranges between of 4, and RC width that is 20% larger than the minimum chan-
3.9% and 34.8%. The area overhead results in longer routing nel width required to route the circuit. Multiple architectures
wire segments. This leads to larger interconnect capacitance, with different SB power gating granularities (different PPS)
and potentially larger dynamic power. For example, using have been used. The results are shown for the original VPR’s
PPS = 3 would result in area overhead between 9.2% routing algorithm, and the enhanced power gating-aware
and 10.7%. This translates to 4.5%–5.2% increase in each (PG-aware) algorithm that is described in Section IV-B.
dimension of a tile, assuming square tiles [35]. This represents 1) Breakdown of SBs’ Switches Power States: In this
a loose upper bound on the increase in interconnect capaci- section, we report the percentages of SB switches that can
tance [35]. In this paper, we do not evaluate the effect of the be configured in the different power states.
area overhead on the dynamic power and how this could affect Fig. 18(a) shows the percentage of always-ON switches
the savings achieved by the proposed architecture; we leave for different SB power gating granularities. For finer
this for future work. granularity power gating, the number of always-ON SB
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 19. Leakage power results for (a) SBs only and (b) and (c) SBs and PGRs averaged over all synthetic benchmark circuits.
switches decreases. This is expected since more SB switches OFF -leakage power, which is the leakage power for the circuits
can statically be turned OFF because the SB partitions they assuming that all modules in a circuit are idle and their
belong to are not used to route signals. Furthermore, with DC components are turned OFF. In both cases, the always-OFF
finer granularity, there is a better chance that an SB partition components are assumed to be turned OFF at configuration
is only used to route signals that belong to a single module, time.
which increases the number of DC SB switches. Fig. 19(a) shows the leakage power for the SBs for different
Using the PG-aware routing algorithm results in slightly SB power gating granularities, and the leakage power for the
fewer always-ON switches, but this improvement diminishes SBs when an ungated architecture is used. As can be seen, the
as PPS increases; finer granularity power gating (larger PPS) ON -leakage power for both the PG-aware routing and the orig-
results in larger number of components that can be statically inal routing are roughly equal because the algorithm enhance-
turned OFF. Therefore, the improvement space available for ments do not significantly improve the number of always-OFF
the PG-aware router becomes tighter. The results show that the switches as explained earlier. For finer granularity SBs
PG-aware router reduces the number of always-ON switches power gating, both the ON and OFF-leakage powers decrease.
by 3% for coarse-grained SB power gating (PPS = 0), and The ON-leakage power decreases since finer granularity
∼1% for PPS = 4 of the total number of switches. This results in more unused SB partitions that can be turned OFF
corresponds to reduction of 8% and 14% of the always-ON at configuration time, thus the overall leakage power in
switches for PPS = 0 and PPS = 4, respectively. the ON-state for an application circuit goes down. The
Fig. 18(b) shows the percentage of always-OFF switches OFF -leakage power also decreases with finer granularity
for different SB power gating granularities. As expected, finer power gating because more SB switches can be placed in
granularity power gating increases the always- OFF switches. the DC state. The minimum OFF-leakage power is when
The PG-aware routing has a negligible effect on the number PPS = 3. Larger PPS values result in more leakage power
of always-OFF switches. consumption in the OFF state because of the large overhead
Fig. 18(c) shows the percentage of DC switches for differ- of the power gating circuitry.
ent SB power gating granularities. These are switches that Fig. 19(a) also shows that the PG-aware routing slightly
can be powered OFF at run time when the module they improves the OFF-leakage power compared with the unmod-
belong to becomes idle. The largest number of DC switches ified routing algorithm. This is because PG-aware routing
is when PPS = 1. Finer SB power gating granularity results in more DC switches that can be turned OFF when
reduces the number of DC SB switches; this is because more an application circuit is idle, as shown in Fig. 18(c).
switches can be set as always-OFF at configuration time, Fig. 19(a) shows that ON-leakage power with the gated
which reduces the total number of remaining switches. The architecture is larger than that with the ungated architecture
PG-aware routing helps in slightly improving the percentage for the coarse-grained SBs power gating. However, for finer
of DC switches. This improvement is roughly the same as granularity power gating, the ON-leakage power for the gated
the amount of reduction in the always-ON switches shown architecture is lower than that for the ungated architecture.
in Fig. 18(a). Although a single SB with finer granularity power gating has
Finally, Fig. 18(d) shows the sum of the always- OFF and larger ON-leakage power than an ungated SB, finer granularity
DC switches. As expected, SBs with finer granularity power power gating enables turning OFF unused SB switches, which
gating result in larger number of switches that can be powered results in lower overall ON-leakage power.
down. The PG-aware routing shows slight improvements The total leakage power is shown in Fig. 19(b). This
compared with the original VPR router; these improvements includes the leakage power for both SBs and PGRs. The
diminish as the granularity of power gating decreases as same trends discussed above apply here because a significant
explained above. portion of the leakage power comes from SBs. Comparing
2) Leakage Power: In this section, we report the leakage Fig. 19(a) and (b) shows that SBs contribute to roughly 72%
power averaged over all benchmark circuits. We report the of leakage power in the ungated architecture. However, in the
ON -leakage power, which is the leakage power assuming all gated architecture, SBs contribute to roughly 67%–72% of the
modules in a circuit are idle but not powered OFF, the total leakage power.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 21. Energy reduction for iSnake’s control system in DCPG FPGA.
Fig. 20. Total leakage power reduction for individual synthetic circuits.
Fig. 19(c) shows the upper limit of reduction in the CAD flow that can be used to map applications to the proposed
OFF -leakage power compared with the ungated architecture. architecture, and enhancements to a routing algorithm in order
As expected, the leakage power reduction increases with finer to optimize the power savings of the architecture.
granularity SBs power gating, and the largest reduction is The area overhead of the architecture is ∼10.3% when the
achieved when PPS = 3. For coarse-grained SB power gating, power gating region size is 4×4 tiles and the number of power-
the OFF-leakage power reduction is ∼68%. For PPS = 3, the gated SB PPS is four (W = 90). The potential leakage power
reduction is ∼77%. savings for the studied benchmark circuits are up to 83%.
Fig. 20 shows the individual circuits’ potential power reduc- We also studied the energy savings in a control system for
tion when all modules are turned OFF (PPS = 3). The power a robot that is used in medical applications. Assuming only
reduction ranges between 67% and 81% using the original 25% of the system can be powered down when idle, and it is
routing algorithm, and ranges between 68% and 83% using idle for 95% of the time, we found that ∼8% energy saving
the PG-aware routing. can be achieved by the proposed architecture when compared
with an implementation with only clock gating.
This research provides the basis for a new generation of
D. Example Application Results FPGAs, which are capable of self-optimization. Future work
In this section, we show the energy saving results of using includes automating the process of identifying application
the proposed power gating architecture for the iSnake robot modules that can benefit from the proposed architecture. This
control system described in Section V-B. is suitable for designs that use accelerators with components
Fig. 21 shows the energy savings for the circuit for different that operate for only a small fraction of time. Furthermore,
active times of the Delta module. We compared mapping the enhancements to the CAD tools are required in order to better
application on the proposed power gating architecture with guide the different stages in the flow to increase idle times and
two baseline implementations. The first is a system that has increase the number of resources that can be powered down,
no power optimizations at all. Normally, one would implement while reducing the impact on performance and area.
clock gating for modules that experience inactivity periods.
We assume that the first baseline does not include this. The R EFERENCES
second baseline is an implementation that supports clock [1] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” in
gating for the Delta module. Proc. 14th ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2006,
pp. 21–30.
At 5% activity level, the results show that compared with the [2] M. Münch, B. Wurth, R. Mehra, J. Sproch, and N. Wehn, “Automating
first baseline (no clock gating), the proposed architecture cou- RT-level operand isolation to minimize power consumption in data-
pled with clock gating could achieve ∼19% energy savings. paths,” in Proc. Conf. Design, Autom. Test Eur., 2000, pp. 624–633.
[3] Q. Wang, S. Gupta, and J. H. Anderson, “Clock power reduction for
Compared with the second baseline (includes clock gating), the virtex-5 FPGAs,” in Proc. 17th ACM/SIGDA Int. Symp. Field-Program.
proposed architecture could achieve ∼8% additional energy Gate Arrays, 2009, pp. 13–22.
saving. Given that only a small portion of the application [4] T. Tuan, S. Kao, A. Rahman, S. Das, and S. Trimberger, “A 90 nm
low-power FPGA for battery-powered applications,” in Proc. 14th Int.
benefited from the power gating architecture (25% of the Symp. Field-Program. Gate Arrays, 2006, pp. 3–11.
circuit), the results are promising. [5] F. Li, Y. Lin, L. He, and J. Cong, “Low-power FPGA using pre-defined
dual-Vdd/dual-Vt fabrics,” in Proc. 12th ACM/SIGDA Int. Symp. Field-
Program. Gate Arrays, 2004, pp. 42–50.
VII. C ONCLUSION [6] J. Hussein, M. Klein, and M. Hart, “Lowering power at 28 nm with
Xilinx 7 series FPGAs,” Xilinx, Inc., San Jose, CA, USA, Tech. Rep.
We present an FPGA architecture that supports dynamic WP389, Jun. 2011.
power gating. This architecture enables powering down mod- [7] Meeting the Low Power Imperative at 28 nm, Altera Corp., San Jose,
CA, USA, Nov. 2011.
ules in an FPGA when they are idle to reduce their static power [8] S. Henzler, Power Management of Digital Circuits in Deep Sub-Micron
dissipation. The architecture’s flexibility enables the user to CMOS Technologies (Advanced Microelectronics). Secaucus, NJ, USA:
implement an arbitrary number and structure of power-gated Springer-Verlag, 2007.
[9] Y. Lin, F. Li, and L. He, “Routing track duplication with fine-grained
modules, and enables routing power control signals on the power-gating for FPGA interconnect power reduction,” in Proc. Asia
general-purpose routing fabric of an FPGA. We also present a South Pacific Design Autom. Conf., Jan. 2005, pp. 645–650.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[10] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and Assem A. M. Bsoul (S’07) received the B.Sc. degree
T. Tuan, “Reducing leakage energy in FPGAs using region-constrained in computer engineering from the Jordan University
placement,” in Proc. 12th Int. Symp. Field-Program. Gate Arrays, 2004, of Science and Technology, Irbid, Jordan, in 2006,
pp. 51–58. the M.Sc. degree in electrical engineering from
[11] R. P. Bharadwaj, R. Konar, P. T. Balsara, and D. Bhatia, “Exploiting tem- Queen’s University, Kingston, ON, Canada, in 2009,
poral idleness to reduce leakage power in programmable architectures,” and the Ph.D. degree in electrical and computer
in Proc. Asia South Pacific Design Autom. Conf., 2005, pp. 651–656. engineering from the University of British Columbia,
[12] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA architecture supporting Vancouver, BC, Canada, in 2014.
dynamically controlled power gating,” in Proc. IEEE Int. Conf. Field- He is currently a Post-Doctoral Fellow with the
Program. Technol. (FPT), Dec. 2010, pp. 1–8. University of British Columbia. His current research
[13] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA with power-gated interests include low-power reconfigurable architec-
switch blocks,” in Proc. IEEE Int. Conf. Field-Program. Technol. (FPT), tures and computer-aided design algorithms.
Dec. 2012, pp. 87–94.
[14] C. Li, Y. Dong, and T. Watanabe, “New power-aware placement for
region-based FPGA architecture combined with dynamic power gating
by PCHM,” in Proc. 17th IEEE/ACM Int. Symp. Low-Power Electron.
Design (ISLPED), Aug. 2011, pp. 223–228. Steven J. E. Wilton (S’86–M’97–SM’03) received
[15] C. H. Hoo, Y. Ha, and A. Kumar, “A directional coarse-grained power the M.A.Sc. and Ph.D. degrees in electrical and com-
gated FPGA switch box and power gating aware routing algorithm,” puter engineering from the University of Toronto,
in Proc. 23rd Int. Conf. Field Program. Logic Appl. (FPL), Sep. 2013, Toronto, ON, Canada, in 1992 and 1997, respec-
pp. 1–4. tively.
[16] V. Betz, J. Rose, and A. Marquardt, Eds., Architecture and CAD for He was a co-founder of Veridae Systems, Inc.,
Deep-Submicron FPGAs. Norwell, MA, USA: Kluwer, 1999. Vancouver, BC, Canada, acquired by Tektronix,
[17] J. Luu et al., “VPR 5.0: FPGA CAD and architecture exploration Beaverton, OR, USA, in 2011, which developed
tools with single-driver routing, heterogeneity and process scaling,” in debug solutions for application-specific integrated
Proc. 17th ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2009, circuits, field-programmable gate arrays (FPGAs),
pp. 133–142. and FPGA-based systems. He joined the Department
[18] M. Klein, “Power consumption at 40 and 45 nm,” Xilinx, Inc., San Jose, of Electrical and Computer Engineering, University of British Columbia,
CA, USA, Tech. Rep. WP298, Apr. 2009. Vancouver, in 1997, where he is currently a Professor and an Associate Head.
[19] A. Marquardt, V. Betz, and J. Rose, “Timing-driven placement for His current research interests include the architectures of next-generation
FPGAs,” in Proc. 8th ACM/SIGDA Int. Symp. Field-Program. Gate FPGAs and their associated computer-aided design tools.
Arrays, 2000, pp. 203–213. Dr. Wilton served as the Program and General Chair of the ACM Inter-
[20] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low Power national Symposium on FPGAs, from 2005 to 2006, respectively, and the
Methodology Manual: For System-on-Chip Design. New York, NY, Program Co-Chair of the 2005 International Conference on Field Program-
USA: Springer-Verlag, 2007. mable Logic and Applications and the 2008 IEEE International Conference
[21] A. A. M. Bsoul and S. J. E. Wilton, “A configurable architecture to limit on Application-Specific Systems, Architectures and Processors. He was a
wakeup current in dynamically-controlled power-gated FPGAs,” in Proc. recipient of best paper awards at the International Conference on Field-
Int. Symp. Field-Program. Gate Arrays (FPGA), 2012, pp. 245–254. Programmable Technology in 2003, 2005, 2007, and 2013, respectively, and
[22] A. A. M. Bsoul and S. J. E. Wilton, “A configurable architecture to limit the International Conference on Field-Programmable Logic and Applications
inrush current in power-gated reconfigurable devices,” J. Low Power in 2001, 2004, 2007, and 2008, respectively. He is currently the Editor-in-
Electron., vol. 10, no. 1, pp. 1–15, 2014. Chief of the ACM Transactions on Reconfigurable Technology and Systems.
[23] P. Jamieson, K. B. Kent, F. Gharibian, and L. Shannon, “Odin II—An
open-source Verilog HDL synthesis tool for CAD research,” in Proc.
18th IEEE FCCM, May 2010, pp. 149–156.
[24] ABC: A System for Sequential Synthesis and Verification, Univ. Califor- Kuen Hung Tsoi received the Ph.D. degree from the
nia, Berkeley, CA, USA, 2012. Department of Computer Science and Engineering,
[25] A. R. Marquardt, “Cluster-based architecture, timing-driven packing and Chinese University of Hong Kong, Hong Kong, in
timing-driven placement for FPGAs,” M.S. thesis, Dept. Elect. Comput. 2007.
Eng., Univ. Toronto, Toronto, ON, Canada, 1999. He has been a Post-Doctoral Research Associate
[26] K.-W. Kwok et al., “Dimensionality reduction in controlling articulated with the Custom Computing Group, Department
snake robot for endoscopy under dynamic active constraints,” IEEE of Computing, Imperial College London, London,
Trans. Robot., vol. 29, no. 1, pp. 15–31, Feb. 2013. U.K., since 2008. He is currently with Imagination
[27] J. Detrey and F. de Dinechin. (2004). FPLibrary, a VHDL Library of Technologies Ltd., Kings Langley, U.K.
Parametrisable Floating-Point and LNS Operators for FPGA. [Online].
Available: http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/
[28] Quartus II University Interface Program. [Online]. Available: http://
www.altera.com/education/univ/research/quip/unv-quip.html, accessed
Jan. 28, 2015.
[29] Predictive Technology Model (PTM). [Online]. Available: Wayne Luk (F’09) received the M.A., M.Sc., and
http://ptm.asu.edu/, accessed Jan. 28, 2015. D.Phil. degrees in engineering and computing sci-
[30] E. Hung, S. J. E. Wilton, H. Yu, T. C. P. Chau, and P. H. W. Leong, ence from the University of Oxford, Oxford, U.K.
“A detailed delay path model for FPGAs,” in Proc. Int. Conf. Field- He was a Visiting Professor with Stanford Univer-
Program. Technol. (FPT), 2009, pp. 96–103. sity, Stanford, CA, USA. He is currently a Profes-
[31] N. H. E. Weste and D. M. Harris, CMOS VLSI Design: A Circuits and sor of Computer Engineering with Imperial College
Systems Perspective, 4th ed. Reading, MA, USA: Addison-Wesley, 2011. London, London, U.K. His current research interests
[32] T. Pi and P. J. Crotty, “FPGA lookup table with transmission gate include reconfigurable computing, field program-
structure for reliable low-voltage operation,” U.S. Patent 6 667 635, mable technology, and design automation.
Dec. 23, 2003. Prof. Luk is a fellow of the Royal Academy of
[33] G. Lemieux, E. Lee, M. Tom, and A. Yu, “Directional and single-driver Engineering. He received the Research Excellence
wires in FPGA interconnect,” in Proc. IEEE Int. Conf. Field-Program. Award from Imperial College London for reconfigurable supercomputing,
Technol., Dec. 2004, pp. 41–48. and over 15 awards for his work from international conferences, such as
[34] S.-C. Wong, G.-Y. Lee, and D.-J. Ma, “Modeling of interconnect the Applied Reconfigurable Computing Conference, the Application-Specific
capacitance, delay, and crosstalk in VLSI,” IEEE Trans. Semicond. Systems, Architectures and Processors Conference, the Field-Programmable
Manuf., vol. 13, no. 1, pp. 108–111, Feb. 2000. Custom Computing Machines Conference, the Field Programmable Logic
[35] S. Huda, J. Anderson, and H. Tamura, “Charge recycling for power and Applications Conference, and the Field-Programmable Technology Con-
reduction in FPGA interconnect,” in Proc. 23rd Int. Conf. Field Program. ference. He was the Founding Editor-in-Chief of the ACM Transactions on
Logic Appl. (FPL), Sep. 2013, pp. 1–8. Reconfigurable Technology and Systems.