0% found this document useful (0 votes)
243 views68 pages

Probability-Driven Multi Bit Flip-Flop Design Optimization With Clock Gating

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
243 views68 pages

Probability-Driven Multi Bit Flip-Flop Design Optimization With Clock Gating

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Probability-Driven Multi Bit Flip-Flop

Design Optimization With Clock Gating


ABSTRACT

Data-Driven Clock-Gating (DDCG) and Multi Bit Flip-Flops (MBFFs) in which several
FFs are grouped and share a common clock driver are two effective lowpower design techniques.
Though commonly used by VLSI designers, those are usually separately treated. Past works
focused on MBFF usage in RTL, gate-level and their layout. Though collectively coving the
common design stages, the study of each aspect individually led to conflicts and contradiction
with the others. MBFF internal circuit design, its multiplicity and its synergy to the FFs data
toggling probabilities have not been studied so far. This work attempts to maximize the energy
savings by proposing a DDCG and MBFF combined algorithm, based on Flip-Flops (FFs) data
to-clock toggling ratio. It is shown that to maximize the power savings, the FFs should be
grouped in MBFFs in increasing order of their activities. A power savings model utilizing MBFF
multiplicities and FF toggling probabilities is developed, which was then used by the algorithm
in a practical design flow. We achieved 17% to 23% power savings compared to designs with
ordinary FFs by using Xilinx ISE tool which when compared with Conventional Flip-Flop it was
around 39%.
CHAPTER 1
INTRODUCTION
A recently published paper has emphasized the usage of Multi-Bit Flip-Flops (MBFFs) as
a design technique delivering considerable power reduction of digital systems. The data of digital
systems is usually stored in Flip-Flops (FFs), each having its own internal clock driver. Shown in
Fig. 1.1, an edge-triggered 1-bit FF contains two cascaded master and slave latches, driven by
opposite clocks CLK and CLK. It is shown that most of the FF’s energy is consumed by its
internal clock drivers, which are significant contributors to the total power consumption.

The data of digital systems are usually stored in flip-flops (FFs), each of which has its
own internal clock driver. In an attempt to reduce the clock power, several FFs can be grouped
into a module called a multibit FF (MBFF) that houses the clock drivers of all the underlying
FFs. We denote the grouping of kFFs into an MBFF by a k-MBFF. Kapoor et al. Traditionally,
digital control of SMPS was accomplished by applying a general purpose Digital Signal
Processor (DSP). Attempts were made to use DSPs to carry out the digital control algorithm,
housekeeping, supervisory tasks and communication. Apart from some limited applications, this
approach is unsuitable in most industrial instances due to its many drawbacks and limitations.
These include: the single arithmetic unit that limits the speed of computation resulting in a
limited control bandwidth, excessive delays in a multi converter case, limited capabilities to
generate non-sequential pulse as might be needed in non linear control, limited capabilities to
achieve high resolution of the output driving signal and its degrading as the number of control
channels increases, as well as other shortcomings.

Another approach to modern digital power management is a closed, dedicated controller


for a specific application such as Voltage Regulator Module (VRM). The drawback of this
approach is the fact that it is limited to the specific application for which it was developed.
Hence, application of the unit to solve other power management problems is impossible since a
new Application Specific Integrated Circuit (ASIC) design cycle needs to be initiated for every
case reported a 15% reduction of the total dynamic power in a 90-nm processor design.
Electronic design automation tools, such as Cadence Liberate, support MBFF characterization.
The benefits of MBFFs do not come for free. By sharing common drivers, the clock slew rate is
degraded, thus causing a larger short-circuit current and a longer clock-to-Q propagation delay
tpCQ. To remedy this, the MBFF internal drivers can be strengthened at the cost of some extra
power. It is therefore recommended to apply the MBFF at the RTL design level to avoid the
timing closure hurdles caused by the introduction of the MBFF at the backend design stage. Due
to the fact that the average data-to-clock toggling ratio of FFs is very small, which usually ranges
from 0.01 to 0.1. Clock gating does not come for free. Extra logic and interconnects are required
to generate the clock enabling signals and the resulting area and power overheads must be
considered. In the extreme case, each clock input of a FF can be disabled individually, yielding
maximum clock suppression. This, however, results in a high overhead; thus suggesting the
grouping of several FFs to share a common clock disabling circuit in an attempt to reduce the
overhead. On the other hand, such grouping may lower the disabling effectiveness since the
clock will be disabled only during time periods when the inputs to all the FFs in a group do not
change. In the worst case, when the FFs’ inputs are statistically independent, the clock disabling
probability equals the product of the individual probabilities, which rapidly approaches zero
when the number of involved FFs increases. It is therefore beneficial to group FFs whose
switching activities are highly correlated and derive a joint enabling signal.

The state transitions of FFs in digital systems like microprocessors and controllers
depend on the data they process. Assessing the effectiveness of clock gating requires therefore
extensive simulations and statistical analysis of FFs activity, as presented in this paper. Disabling
the clock input to a group of FFs (e.g., a register) in data-path circuits is very effective since
many bits behave similarly. Registers enabled by the same clock signal yield a high ratio of the
saved power to circuit overhead. Furthermore, the design effort to create the disabling signal is
low. Unlike data-path, control logic requires far greater design effort for successful clock gating.
This stems from the “random” nature of the control logic. The effectiveness of the proposed
gating methodology is demonstrated in this paper through the examples of a 3-D graphics
accelerator and a 16-bit microcontroller. These units were designed with full awareness of the
internal data dependencies and appropriate clock enabling signals were defined within the RTL
code. When the RTL code was then compiled and simulated at gate level, considerable “hidden”
disabling opportunities have been discovered.

The clock power savings always outweigh the short-circuit power penalty of the data
toggling. An MBFF grouping should be driven by logical, structural, and FF activity
considerations. While FFs grouping at the layout level have been studied thoroughly, the front-
end implications of MBFF group size and how it affects clock gating (CG) has attracted little
attention. This brief responds to two questions. The first is what the optimal bit multiplicity k of
data-driven clock-gated (DDCG) MBFFs should be. The second is how to maximize the power
savings based on data-to-clock toggling ratio (also termed activity and data toggling probability).
An MBFF usage at the RTL logic synthesis design stage can be found in Optimization for power
is always one of the most important design objectives in modern nanometer IC design. Recent
studies have shown the effectiveness of applying multi-bit flip-flops to save the power
consumption of the clock network. However, all the previous works applied multi-bit flip-flops
at earlier design stages, which could be very difficult to carry out the trade-off among power,
timing, and other design objectives. This paper presents a novel power optimization method by
incrementally applying more multi-bit flip-flops at the post-placement stage to gain more clock
power saving while considering the placement density and timing slack constraints, and
simultaneously minimizing interconnecting wire length. Experimental results based on the
industry benchmark circuits show that our approach is very effective and efficient, which can be
seamlessly integrated in modern design flow for a 55-nm 230-MHz design of a system on a chip.

In an attempt to reduce the clock power, several FFs can be grouped in a module such
that common clock drivers are shared for all the FFs. Two 1-bit FFs grouped into 2-bit MBFF,
called also dual-bit FF, is shown in Fig. 1.1. In a similar manner, grouping of FFs in 4-bit and 8-
bit MBFFs are possible too. We subsequently denote a k -bit MBFF by k -MBFF. MBFF is not
only reducing the gate capacitance driven by a clock tree. The wiring capacitive load is also
reduced because only a single clock wire is required for multiple FFs. It also reduces the depth
and the buffer sizes of the clock tree and also the number of sub-trees. Beyond clock power
savings those features also reduce the silicon area.
Fig. 1.1. 1-bit FF and 2-MBFF.
An MBFF grouping should be driven by logical, structural, and FF activity
considerations. While FFs grouping at the layout level have been studied thoroughly, the front-
end implications of MBFF group size and how it affects clock gating (CG) has attracted little
attention. This brief responds to two questions. The first is what the optimal bit multiplicity k of
data-driven clock-gated (DDCG) MBFFs should be. The second is how to maximize the power
savings based on data-to-clock toggling ratio (also termed activity and data toggling probability).
An MBFF usage at the RTL logic synthesis design stage can be found in a 55-nm 230-MHz
design of a system on a chip Santos restricted the MBFF grouping into FFs belonging to the
same bus. Both 2-MBFFs and 4-MBFFs were used with a 20% increase in tpCQ. A dynamic
power reduction of 13% was achieved with some degradation in timing convergence. This was
remedied by applying low voltage threshold cells on critical paths, which somewhat increased
the leakage power. The total area was increased by 2.3%, because of the timing fixes.
MBFFs benefits do not come for free. By sharing common drivers, the slopes of the clock
signals become slower, causing larger short-circuit current and clock-to-Q propagation delay
(pCQ t ) degradation, for a design implemented in a 90 nanometer, low-power, high voltage
threshold (HVT) CMOS technology, the 4- MBFFs exhibit a per-bit 30% reduction of dynamic
clock power, and a per-bit 10% area reduction. That came on the expense of a per-bit 20% data
power increase and also 20% degradation of pCQ t. However, due to the fact that the average
data-to-clock toggling ratio of a FF is very small, varying from 0.01 to 0.1 in most designs , the
clock power savings always outweighs the short-circuit power penalty on the data toggling.
This work answers two questions; what should be the optimal bit multiplicity of MBFFs,
and how to leverage from the knowledge of the average data to-clock toggling ratio (called also
activity and data toggling probability) of the FFs in the underlying design. To remedy the short-
circuit power penalty and pCQ t degradation due to the increase of the loads, the MBFF internal
derivers can be somewhat strengthen. This is shown pictorially in Fig. 1.1 by the larger 2-MBFF
drivers compared to 1-bit. The MBFF multiplicity k depends on the data toggling probability p.
Section 2 studies that dependency in an attempt to optimize the MBFF design flow and
maximize the power savings. To our best knowledge, that has not been studied so far. Electronics
Design Automation (EDA) tools, such as Cadence Liberate, support MBFF characterization.
MBFF gate-level design is possible with the latest Cadence and Synopsys HDL compilers. Their
logic-level internal considerations and algorithms of FF grouping into MBFFs have not been
published. In spite of its importance, very little attention has been paid in the literature to MBFF
multiplicity and grouping at the front-end design stage. MBFF grouping should be driven by
logical, structural and FFs activity considerations.

Fig 1.2: Power breakdown of MBFF compared to ordinary 1-bit FFs.


In a design report, 92% of the FFs have been grouped into MBFFs, the majority of which
were 4-MBFFs, while the reset were 2-MBFFs. Fig. 1.2 shows the power breakdown of MBFF
compared to 1-bit FF design. The power is normalized to the total power consumed by a 1-bit
FFs core design (memories and IOs excluded). A 15% reduction of the total dynamic power is
shown. Expectedly, power of the sequential logic and the power of the clock tree decreased,
because the total number of clock drivers and the wire load connected to the MBFF internal
drivers were reduced. The combinational logic power was increased, because some of the logic
has been up-sized to recover of pCQ t increase. To avoid the timing degradation occurred by
pCQ t increase we propose to introduce MBFF at the RTL design level. This will allow the
backend and layout design stage to take pCQ t into account and avoid timing problems upfront.
A work introducing the MBFF at the logic synthesis design stage was presented,
attempting to conclude on the pros and cons compared to a synthesis using ordinary FFs. The
mapping of FFs to MBFF took place at the gate-level design produced by the RTL compiler. A
55nm 230MHz design of a System on a Chip (SoC) was experimented. The authors restricted the
MBFF mapping to FFs belonging to the same bus, where both 2-MBFFs and 4-MBFF were used
with 20% increase of their pCQ. The usage of MBFFs reduced the number of clock sinks by
60%, leading to a simpler clock tree with 35% less clock buffers. That further reduced the clock
skew by 30%. Table 1 summarizes the power savings. A dynamic power reduction of 13% is
shown. Not surprisingly, power savings came on the expense of timing degradation, which has
been remedy by introducing low voltage threshold (LVT) cells on critical paths, indicated by the
increase of the leakage power.
Table 1. Power reduction obtained by MBFF design
While the design flows supported by EDA tools handle MBFF at the RTL synthesis into
gate-level implementation, they take very limited physical layout details into 4 account. Most
importantly, the data toggling probability subsequently affect the MBFF grouping is completely
ignored by those tools. The subsequently overviewed literature is mainly focused on MBFF
physical implementation.
Those works also ignore FFs’ activities, which this paper considers. One of the earliest
works on MBFF grouping at the physical layout stage was described. Each FF has been
associated with time margins obtained from the layout comprising 1-bit FFs. The wires
connected to the data input and output of a FF where anchored on their opposite side to the rest
logic, while the position of the FF was allowed to move around, thus defining the region in
layout where the FF can be displaced without violating timing. The merging of FFs pairs in 2-
MBFFs was formulated as an optimization problem aiming at maximizing the count of merged
FFs, such that the resulting MBFFs locations do not violate timing. There were also congestion
constraints that were handled by dividing the silicon area into bins with limited occupancy of
MBFFs. The problem has been solved with the aid of area proximity analysis. Following the
above ad-hoc approach, a later work presented an algorithm with better computational efficiency
to solve the same problem posed, handling the same timing and area congestion constraints.
The FF clustering approach presented has later been used for replacement of 1-bit FF
with Multi-Bit Pulsed-Latch (MBPL) in the physical layout. MBPL clock power was minimized
by taking advantage of pulsed-latch timing behavior that is similar to a FF, and its time-
borrowing capabilities similar to a latch. That offered more flexibility in meeting the timing
constraints by expanding the allowable region where the original FFs could be displaced and
merged in MBPLs. For further power saving the authors combined clock-gating (CG) into the
MBPL structure. Few CG strategies were mentioned, but no details were provided, and the
relation between the CG strategy, the FFs’ activities and their grouping in MBPLs was not
conclusive. A recent work has addressed the combination of MBFF with CG. CG cloning was
proposed in the decision of MBFF grouping, combined with layout proximity analysis of the 1-
bit FFs. Layout proximity considerations have also decided of whether 2-MBFF or 4-MBFF
grouping was in order. The requirement of having a timing-converged layout as a starting point
for the MBFF design flow is a burden, and practicality very restrictive. Timing constraints may
be very tight, thus limiting the potential FFs merging. Moreover, having a timing-converged
layout in hand reduces the incentive to change the design.
None of considered FFs activity as a factor to drive the MBFF grouping. Our work
proposes a systematic, toggling probability-driven MBFF grouping algorithm, provably
maximizing the expected energy savings. In our mind MBFF should be introduced at the RTL
and logic design level, based on architectural, structural and most importantly, on FFs activity
considerations. The rest of the paper is organized as follows. The effect of data toggling
probabilities on the potential energy savings and Section 3 shows how to combine Data-Driven
Clock-Gating (DDCG) with MBFF. The question of what FFs should be grouped 5 in a DDCG
MBFF.
2. The effect of data toggling probabilities on energy savings
The dependency of the potential MBFF energy savings on its toggling probability is
demonstrated in Fig. 1.3, obtained by SPICE simulations. It shows the energy consumed by 1-bit
FF, 2-MBFF and 4-MBFF. Notice the “base” dynamic energy paid by the clock, regardless of the
input activity. The base energy growth of 2-MBFF shown in Figs. 1.3 (b) and 1.3 (c) compared
to 1-bit FF in Fig.1.3(a) stems from its larger internal load. Expectedly, the energy consumption
grows linearly with the data toggling probability, and it is twice larger when both inputs toggle
simultaneously compared to single input toggling. Similar behavior is shown for 4-MBFF in
Figs.1. 3(d) and 1.3(e).

Fig. 1.3: The dependency of the MBFF energy savings on the toggling probability.
Let p be the data-to-clock toggling probability. Denote by E1 the expected energy
consumed by 1-bit FF. We conclude from Fig. 1.3(a) that
……..(1)
where 1 is the energy of the FF’s internal clock driver, and 1 is the energy of data
toggling. For 2-MBFF there are three possible scenarios: none of the FFs toggle, a single FF
toggles, and both FFs toggle. Assuming data toggling independence, the expected energy
consumption E2 is

……..(2)
where 2 is the energy of the internal clock driver, and 2 is the per-bit data toggling
energy. For the general case of k -MBFF, let k be the energy of the MBFF’s internal clock
driver and k be the per-bit data toggling energy. Considering all the combinations of toggling
FFs, the expected energy is

…………(3)
The equality in (3) is obtained by applying some rearrangements [8].

Fig. 1.4. Energy savings dependency on data toggling probabilities [8].


To assess the potential MBFF energy savings, Fig. 1.4 shows the energy ratio of two and
four 1-bit FFs to that of 2-MBFF and 4-MBFFs. We divide the energy difference between k
individual FFs and k -MBFFs, by the energy of the k individual FFs. For small p it shows
savings of (1.6 − 1) /1.6 = 35% for 2k and (2.2 − 1) /2.2 = 55% for 4k . For large p the
savings is (1.18 − 1) /1.18 = 15% for 2k and (1.3 − 1) /1.3 = 23% for 4k .
In typical VLSI systems the average p does not exceed 0.05, so high savings is realizable.
Section 4 which considers the MBFF energy savings by introducing DDCG generalizes the
energy consumption model into the case of 7 distinct data toggling probabilities. It is also
important to note that the toggling independence is a worst-case assumption, where in reality the
correlation of FFs toggling can be used to yield higher energy savings.
CHAPTER 2
LITERATURE SURVEY

[1] Digital Systems Power Management for High Performance Mixed Signal Platforms
High performance mixed signal (HPMS) platforms require stringent overall system and
subsystem performance. The ability to design ultra-low power systems is used in a wide range of
platforms including consumer, mobile, identification, healthcare products and microcontrollers.
In this paper we present an overview of low power design techniques, challenges and
opportunities faced in an industrial research environment. The paper presents strategies on the
deployment of low power techniques that span from power-performance optimization scenarios
accounting for active and standby operation modes to the development of multi-core
architectures suitable for low voltage operation.
 
[2] The Optimal Fan-Out of Clock Network for Power Minimization by Adaptive Gating
Gating of the clock signal in VLSI chips is nowadays a mainstream design methodology
for reducing switching power consumption. In this paper we develop a probabilistic model of the
clock gating network that allows us to quantify the expected power savings and the implied
overhead. Expressions for the power savings in a gated clock tree are presented and the optimal
gater fan-out is derived, based on flip-flops toggling probabilities and process technology
parameters. The resulting clock gating methodology achieves 10% savings of the total clock tree
switching power. The timing implications of the proposed gating scheme are discussed. The
grouping of FFs for a joint clocked gating is also discussed. The analysis and the results match
the experimental data obtained for a 3-D graphics processor and a 16-bit microcontroller, both
designed at 65-nanometer technology.

[3] Multi-bit flip-flop usage impact on physical synthesis


Reducing clock network power is an efficient way to reduce power consumption of the
high-frequency ASICs since it accounts for a considerable amount of the dynamic chip power.
Recently, the use of multi-bit flip-flops (MBFFs) has been shown to be an effective design
technique to improve clock tree synthesis and can be used either as an alternative or in
conjunction with the well-known clock gating approach targeting clock power reduction. The
idea behind this technique is that clock tree power savings can be achieved by using flip-flop
cells with optimized design and also through a reduced clock tree once the number of clock sinks
is smaller in a design with MBFF cells. Some recent works have been proposing methods to take
advantage of using MBFFs in standard cell based designs, where single-bit flip-flops are
replaced by MBFF cells during logic and/or physical syntheses. However, a more complete
analysis is still needed for different steps of a design flow to help understanding the impact of
MBFFs on the physical design. We present in this work a comprehensive comparison between
traditional flip-flop and MBFF implementations of an industrial 55nm design. Our results
consider area, power and timing as well as some side effects like clock skew, routing congestion
and voltage drop distribution. Finally, this study points to some potential drawbacks of using
MBFFs which may be helpful for designers to make trade-off decisions in high performance SoC
designs.

[4] Construction of constrained multi-bit flip-flops for clock power reduction


Based on the elimination feature of redundant inverters in merging 1-bit flip-flops into
multi-bit flip-flops, given the congested constraint of unallocated bins and the length constraints
of the input and output signals of all the 1-bit flip-flops, an efficient two-phase approach is
proposed to obtain the final multi-bit flip-flops. Compared with the original design in the
numbers of inverters for two tested examples, the experimental results show that our proposed
approach eliminates 68% of inverters to maintain the synchronous designs and saves 19.75% of
the clock power on the average for two tested examples in reasonable CPU time.

[5]INTEGRA: Fast Multibit Flip-Flop Clustering for Clock Power Saving


Clock power is the major contributor to dynamic power for modern integrated circuit
design. A conventional single-bit flip-flop cell uses an inverter chain with high drive strength to
drive the clock signal. Clustering several such cells and forming a multibit flip-flop can share the
drive strength, dynamic power, and area of the inverter chain, and can even save the clock
network power and facilitate the skew control. Hence, in this paper, we focus on postplacement
multibit flip-flop clustering to gain these benefits. Utilizing the properties of Manhattan distance
and coordinate transformation, we model the problem instance by two interval graphs and use a
pair of linear-sized sequences as our representation. Without enumerating all possible
combinations, we identify only partial sequences that are necessary to cluster flip-flops, thus
leading to an efficient clustering scheme. Moreover, our fast coordinate transformation also
makes the execution of our algorithm very efficient. The experiments are conducted on industrial
circuits. Our results show that concise representation delivers superior efficiency and
effectiveness. Even under timing and placement density constraints, clock power saving via
multibit flip-flop clustering can still be substantial at postplacement.

[6] Pulsed-Latch Replacement Using Concurrent Time Borrowing and Clock Gating
Flip-flops are the most common form of sequencing elements; however, they have a
significantly higher sequencing overhead than latches in terms of delay, power, and area. Hence,
pulsed latches are a promising option to reduce power for high-performance circuits. In this
paper, to save power and compensate for timing violations, we fully utilize the intrinsic time
borrowing property of pulsed latches and consider clock gating during pulsed-latch replacement.
Experimental results show that our approach can generate very power efficient results.

[18] Design Flow for Flip-Flop Grouping in Data-Driven Clock Gating


Clock gating is a predominant technique used for power saving. It is observed that the
commonly used synthesis-based gating still leaves a large amount of redundant clock pulses.
Data-driven gating aims to disable these. To reduce the hardware overhead involved, flip-flops
(FFs) are grouped so that they share a common clock enabling signal. The question of what is the
group size maximizing the power savings is answered in a previous paper. Here we answer the
question of which FFs should be placed in a group to maximize the power reduction. We propose
a practical solution based on the toggling activity correlations of FFs and their physical position
proximity constraints in the layout. Our data-driven clock gating is integrated into an Electronic
Design Automation (EDA) commercial backend design flow, achieving total power reduction of
15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40
and 65 manometer process technologies. These savings are achieved on top of the sClock gating
is a predominant technique used for power saving. It is observed that the commonly used
synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating
aims to disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so
that they share a common clock enabling signal. The question of what is the group size
maximizing the power savings is answered in a previous paper. Here we answer the question of
which FFs should be placed in a group to maximize the power reduction. We propose a practical
solution based on the toggling activity correlations of FFs and their physical position proximity
constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design
Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-
20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65
manometer process technologies. These savings are achieved on top of the savings obtained by
clock gating synthesis performed by commercial EDA tools, and gating manually inserted into
the register transfer level design. Savings obtained by clock gating synthesis performed by
commercial EDA tools, and gating manually inserted into the register transfer level design.

[8] Effective and Efficient Approach for Power Reduction by Using Multi-Bit Flip-Flops
Power has become a burning issue in modern VLSI design. In modern integrated
circuits, the power consumed by clocking gradually takes a dominant part. Given a design, we
can reduce its power consumption by replacing some flip-flops with fewer multi-bit flip-flops.
However, this procedure may affect the performance of the original circuit. Hence, the flip-flop
replacement without timing and placement capacity constraints violation becomes a quite
complex problem. To deal with the difficulty efficiently, we have proposed several techniques.
First, we perform a co-ordinate transformation to identify those flip-flops that can be merged and
their legal regions. Besides, we show how to build a combination table to enumerate possible
combinations of flip-flops provided by a library. Finally, we use a hierarchical way to merge
flip-flops. Besides power reduction, the objective of minimizing the total wirelength is also
considered. The time complexity of our algorithm is Θ(n 1.12) less than the empirical complexity
of Θ(n2). According to the experimental results, our algorithm significantly reduces clock power
by 20-30% and the running time is very short. In the largest test case, which contains 1 700 000
flip-flops, our algorithm only takes about 5 min to replace flip-flops and the power reduction can
achieve 21%.
CHAPTER 3
CLOCK GATING FLIP-FLOPS

Flip-Flops: Flip flops are actually an application of logic gates. With the help of Boolean
logic you can create memory with them. Flip flops can also be considered as the most basic idea
of a Random Access Memory [RAM]. When a certain input value is given to them, they will be
remembered and executed, if the logic gates are designed correctly. A higher application of flip
flops is helpful in designing better electronic circuits.

The most commonly used application of flip flops is in the implementation of a feedback
circuit. As a memory relies on the feedback concept, flip flops can be used to design it.

MULTI-BIT FLIP-FLOPS

Multi-Bit Flip-Flops are capable of reducing the power consumption because they have
shared inverter inside the flipflop. Clock skew is also minimized at the same at the same time.
Sngle and multi-bit flip-flop have the same clock condition. set and reset condition is also same.
the example of multi-bit flip-flops is shown in fig 1.1 . 2-bit flip-flop is formed by merging of
single one bit flip flop. It share the clock buffer based and power reduction can be achieved.

Advantages of flip-flop

1. Avoid the duplicate inverters.

2. Total area contributing to flip-flop can be reduced.

3. Power optimization through shared invertors.

ALGORITHM.

Algorithm is split into three steps as, First is to identify the merged flip-flops. In second
we can build the combinational table according to the overlapped region in the first step. We can
build the combinational table in binary tree representation for easy representation. In the third
step, based on the combinational table, merging flip flops is formed.
IDENTIFICATION OF MERGEABLE FLIPFLOP.

Based on the flip-flops used in the digital circuits, identification of flip-flops used for
merging is done. During identification each flip flops have its separate clock

COMBINATIONAL TABLE.

To perform the efficient process we build the combinational table. If we merged the flip-
flops without making combination table it will not be efficient because mergeable flip-flops is
not in intersection value. Combinational table is made on the basis of the library initialization
value. Based on library bit values we build possible combination of flip-flops. The initializations
in algorithm are library is denoted as L, the combinational table. is denoted as T, b (ni) is denote
the bit width and ni denote the one combinational in T. Minimum size is denoted as 1 bit and
minimum library size is initialized by library because we are merge the number of one bit flip-
flops Figure 3.1 shows an example of dual-bit flip-flop cell. It has two data input pins, two data
output pins, one clock pin and reset pin. Use dual-bit flipflop can get the benefits of lower power
consumption then single-bit, and almost no other additional costs to pay. Figure 3.2 shows the
truth table of dual-bit flip-flop cell. when Clock is high , the value of Q1 and Q2 will pass to
D1and D2, or Q1 and Q2 value will remain same.

Figure 3.1: A dual-bit flip-flop cell


Fig.3.2 Table : The true table of dual-bit flip-flop cell

Clock Gating to Flip-Flops.


The clock signals are to be enabled at the process of system level and it can be effectively
capture the functional block modules. This could be need not be clocked. These signals are
activated later into the clock enabling signals in the form of gate level. In the other devices the
clock signals are automatically added by the design consideration. Still the circuit having some
floating at the high level. For this situation we need to calculate the dynamic power consumption
consumed by a circuit when the clock signals are enabled. This period is assessing the clock
gating requires the analysis and the requirements of FF’s Pre-charge and evaluation state.
The clock will be disabled in the next cycle by XOR-ing the output of the present data
input and it will reveal at the output in the next cycle. Then the output of the XOR gates are OR-
ed for generating the gate signal for the FF’s which is to be used to avoid the glitches. The
Integrated clock gate (ICG) can be used by the environmental tools by the combination of
LATCH with the AND gate. These latches could be used in ultra low power applications for a
digital filter. The data driven clock gating signal are being used as an enabling signals in this
applications. There will be a trade off for ICG is the number of clock pulses could be disabled.
The pulses could also be a tradeoff for the hardware overhead. While increase the number of
flipflops the hardware overhead decreases to obtain by ORing the enable signals. The level of
this high and the low state of signals could be processed in the same versa to give the proper
output.
The clock gating signals are not enable as free. The logics and the interconnections are
could be desired to enable those signals and the output can be covered by area and the power
consideration. In some operation individual clock input has been given to the FF’s and it
consumes more amount of power. These clock separations have been yielding more size also.
This could be results in high overhead of the output. Thus the clock load has been reduced by
using the circuits shared by Flip-Flops. This could be consumed small amount of power.
The registers attached to use the clocks and the enable condition used by clock gating. To
achieve the clock gating from the enable conditions in order to use the imperative design. This
process also saves the power as well as large number of MUX’s in the logic circuit. These
circuits are could be replaced by using the Clock gating signals from the CDN. The general form
of the ICG can also to be distributing these signals to the clocks for the level of interchanging as
a part of the CDN. Since the level of the clock gating logic change the clock tree structure and it
will be remain at the same tree.
Clock gating logic levels having the strategy are as follows:
1) The RTL level code has to enable the condition which could be accessed the logic
level synthesis.
2) The design could be specific modules or a registers that can be processed by ICG as a
library function.
3) The automated clock gating has been semi automatically inserted and it will be
generated as an ICG cells. So this will be enable the RTL level or it will be insert into the ICG
level for the optimizations.

Data Driven Clock Gating for single Flip-flop.


Data driven gating is causing area and power overheads that must be considered. In an
attempt to reduce the overhead, it is proposed to group several FFs to be driven by the same
clock signal, generated by bring the enabling signals of the individual FFs. This may however,
lower the disabling effectiveness. It is therefore beneficial to group FFs whose switching
activities are highly correlated and derive a joint enabling signal. In a recent paper, a model for
data driven gating is developed based on the toggling activity of the constituent FFs. The optimal
fan-out of a clock gate yielding maximal power savings is derived based on the average toggling
statistics of the individual FFs, process technology, and cell library in use. In general, the state
transitions of FFs in digital systems depend on the data they process. Assessing the effectiveness
of data-driven clock gating requires, therefore, extensive simulations and statistical analysis of
the FFs’ activity.
The dynamic power consumption could be reduced by using clock gating technique. This
data driven clock gating signals having toggling activity to enable the clock signals. So the flip-
flops and the latches are to be enabled by using the gate signals. The outputs from the X-OR
gates are OR ed to give the combination of output joint gate signals from the flip-flops and then
latched to avoid the glitches presented in the specified units.

Fig 3.3 Clock Gating using for single Flip-flop.


The schematic of a gated latch is shown in Figure 3.3. The latch is positive level-sensitive
(it is transparent when ckg=1 and in hold for ckg=0). The comparison between D and Q is
performed by a XOR gate, while the gating logic is a simple AND gate. The operation of the
circuit is as follows. If ck is 0, then ckg is also 0 and the latch is correctly in hold state. On the
other hand, when ck is high and D is different from Q, the gating logic enables the ckg signal so
that the latch can correctly switch. Note that if D is equal to Q the gating logic inhibits the
propagation of switching activity from ck to ckg.
In existing system power reduction is achieved by using clock gating. With clock gating,
the Clock signals are multiply with an AND gate logic to explicitly predefined enabling signal.
But this clock gating still leaves large number of redundant clock pulses. Although substantially
increasing design productivity, such tools require the employment of a long chain of automatic
synthesis algorithms, from register transfer level (RTL) down to gate level and net list.
Unfortunately, such automation leads to a large number of unnecessary clock toggle, thus
increasing the number of wasted clock pulses at flip-flops (FFs) as shown in this paper through
several industrial examples. Consequently, development of automatic and effective methods to
reduce this inefficiency is desirable. In the sequel, we will use the terms toggling, switching, and
activity interchangeably.
To remedy this, the MBFF internal drivers can be strengthened at the cost of some extra
power. It is therefore recommended to apply the MBFF at the RTL design level to avoid the
timing closure hurdles caused by the introduction of the MBFF at the backend design stage. Due
to the fact that the average data-to-clock toggling ratio of FFs is very small, which usually ranges
from 0.01 to 0.1, the clock power savings always outweigh the short-circuit power penalty of the
data toggling. An MBFF grouping should be driven by logical, structural, and FF activity
considerations. While FFs grouping at the layout level have been studied thoroughly, the front-
end implications of MBFF group size and how it affects clock gating (CG) has attracted little
attention. This brief responds to two questions. The first is what the optimal bit multiplicity k of
data-driven clock-gated (DDCG) MBFFs should be. The second is how to maximize the power
savings based on data-to-clock toggling ratio (also termed activity and data toggling probability).

DISADVANTAGES:
 Power consumption is high
CHAPTER 4
PROPOSED SYSTEM
Multi-bit Flip-Flop method is to eliminate the total inverter number by sharing the
inverters in the flip-flops. Data driven clock gating reduce redundant clock pulses. Combination
of Multi-bit Flip-Flop with Data driven clock gating will increase the further power saving.
Xilinx software tool is used for implementing this proposed system. This paper studies data-
driven clock gating, employed for FFs at the gate level, which is the most aggressive possible.
The clock signal driving a FF is disabled (gated) when the FFs state is not subject to change in
the next clock cycle. Data-driven gating is causing area and power overheads that must be
considered. In an attempt to reduce the overhead, it is proposed to group several FFs to be driven
by the same clock signal, generated by bring the enabling signals of the individual FFs. This may
however, lower the disabling effectiveness. It is therefore beneficial to group FFs whose
switching activities are highly correlated and derive a joint enabling signal. In a recent paper, a
model for data-driven gating is developed based on the toggling activity of the constituent FFs.
Limited power/thermal budgets for modern system on chips (SOCs) which integrate an
increasing number of transistors, power minimization has become one of the most important
objectives in designing SOCs for various applications. High power dissipation of an SOC will
not only increase its system costs but also affect the product lifetime and reliability. To optimize
power consumption in electrical and physical design, many design methodologies have been
introduced, such as creating multi-supply-voltage (MSV) designs replacing non-timing-critical
cells with their high𝑉𝑡 counter parts An electrical and physical design power optimization
methodology and design techniques developed to create an IC with an ARM 1136JF-S
microprocessor in 90-nm standard CMOS are presented. Design technology and methodology
enhancements to enable multiple supply voltage operation, leakage current and clock rate
optimization, single-pass RTL synthesis, VDD selection, power optimization and timing and
electrical closure in a multi-VDD domain design are described. A 40% reduction in dynamic and a
46% reduction in leakage power dissipation has been achieved while maintaining a 355-MHz
operating clock rate under typical conditions. Functional and electrical design requirements were
achieved with the first silicon, Power dissipation is quickly becoming one of the most important
limiters in nanometer IC design for leakage increases exponentially as the technology scaling
down. However, power and timing are often conflicting objectives during optimization. In this
paper, we propose a novel total power optimization flow under performance constraint. Instead
of using placement, gate sizing, and multiple-Vt assignment techniques independently, we
combine them together through the concept of slack distribution management to maximize the
potential for power reduction. We propose to use the linear programming (LP) based placement
and the geometric programming (GP) based gate sizing formulations to improve the slack
distribution, which helps to maximize the total power reduction during the Vt-assignment stage.
Our formulations include important practical design constraints, such as slew, noise and short
circuit power, which were often ignored previously. We tested our algorithm on a set of
industrial-strength manually optimized circuits from a multi-GHz 65nm microprocessor, and
obtained very promising results. To our best knowledge, this is the first work that combines
placement, gate sizing and Vt swapping systematically for total power (and in particular leakage)
management. minimizing clock networks Workload placement on servers has been traditionally
driven by mainly performance objectives. In this work, we investigate the design,
implementation, and evaluation of a power-aware application placement controller in the context
of an environment with heterogeneous virtualized server clusters.
The placement component of the application management middleware takes into account
the power and migration costs in addition to the performance benefit while placing the
application containers on the physical servers. The contribution of this work is two-fold: first, we
present multiple ways to capture the cost-aware application placement problem that may be
applied to various settings. For each formulation, we provide details on the kind of information
required to solve the problems, the model assumptions, and the practicality of the assumptions on
real servers. In the second part of our study, we present the pMapper architecture and placement
algorithms to solve one practical formulation of the problem: minimizing power subject to a
fixed performance requirement. We present an automatic register placement technique that
enables the synthesis of low-power clock trees for low-power ICs. On 7 industrial designs,
comparing to (1) a commercial base flow and (2) the power-aware placement technique in the
technique respectively reduced clock-tree power by 19.0% and 14.9%, total power by 15.3% and
5.2% and WNS under on-chip variation (±10%) by 1.8% and 1.5% on average.
The progress of VLSI technology is facing two limiting factors: power and
variation. Minimizing clock network size can lead to reduced power consumption, less power
supply noise, less number of clock buffers and therefore less vulnerability to variations. Previous
works on clock network minimization are mostly focused on clock routing and the improvements
are often limited by the input register placement. In this work, we propose to navigate registers in
cell placement for further clock network size reduction. To solve the conflict between clock
network minimization and traditional placement goals, we suggest the following techniques in a
quadratic placement framework: (1) Manhattan ring based register guidance; (2) center of gravity
constraints for registers; (3) pseudo pin and net; (4) register cluster contraction. These techniques
work for both zero skew and prescribed skew designs in both wire length driven and timing
driven placement. Experimental results show that our method can reduce clock net wire length
by 16% -33% with no more than 0.5% increase on signal net wire length compared with
conventional approaches and applying multi-bit registers.
We present an automatic register placement technique that enables the synthesis of low-
power clock trees for low-power ICs . Merging 1-bit flip-flops into multi-bit flip-flops in the
post-placement stage is one of the most effective techniques for minimizing clock power. The
obstacles that hinder the merging process for multi-bit flip-flops are (1) the input and output
timing constraint on every flip-flop, (2) the area constraint on every partitioned bin in the
placement plane. Among these methodologies, applying multi-bit flip-flops, or multibit registers
[6], or register banks [4], is one of the most effective methodologies in saving both chip area and
power consumption.
The optimal fan-out of a clock gater yielding maximal power savings is derived based on
the average toggling statistics of the individual FFs, process technology, and cell library in use.
In general, the state transitions of FFs in digital systems depend on the data they process.
Assessing the effectiveness of data-driven clock gating requires, therefore, extensive simulations
and statistical analysis of the FFs’ activity. Another grouping of FFs for clock switching power
reduction, called multi-bit FF (MBFF).
MBFF attempts to physically merge FFs into a single cell such that the inverters driving
the clock pulse into its master and slave latches are shared among all FFs in a group. MBFF
grouping is mainly driven by the physical position proximity of individual FFs, while grouping
for data-driven clock gating should combine toggling similarity with physical position
considerations. The group size that maximizes power savings, this paper studies the questions of:
1) which FFs should be placed in a group to maximize the power reduction and 2) how to
algorithmically derive those groups. We also describe a backend design flow implementation.

Introducing clock-gating into MBFF


The MBFFs discussed so far were driven by a free-running un-gated clock signal. Fig.
4.1 illustrates a DDCG integrated into a k -MBFF. All the shaded circuits reside within a library
cell. It was shown in [2] that given an activity p , the group size k which maximizes the energy
savings solves the equation.

…………………(1)
where CFF and Clatch are the clock input loads of a FF and a latch, respectively. The solution of (4)
for various activities is shown in Table 2 for typical CFF and Clatch .

Fig 4.1: DDCG integrated into k -MBFF.

Table2: Dependency of the optimal MBFF multiplicity on toggling probability.


Unless otherwise stated the MBFFs discussed in the sequel are DDCG. To grasp the
power savings achievable by DDCG of a k -MBFF, Fig. 4.1 has been simulated with SPICE for
various activities p and multiplicities 2,4,8 . Fig. 4.2 shows the powerk consumption of a 2-
MBFF. Line (a) represents the power consumed by two 1-bit FFs driven independently of each
other. The 3.8 W power consumed for zero activity is due to the toggling of the clock driver at
each FF, and it is always being consumed regardless of the activity. Line (b) corresponds to the
ideal case where the two FFs toggle simultaneously. In that case the clock driver shared by the
two FFs either toggles for the sake of the two, or it is disabled by the internal gater shown in Fig.
4.1. Expectedly, the power consumed for zero activity is nearly half compared to two 1-bit FFs.
As the activity increases, the power of (b) is growing faster than (a) since the gating circuit
consumes power proportionally to the activity.
There is no point in using a 2-MBFF beyond the 0.17 activity crossing point, a case
where power starts being lost.

Fig 4.2: Power consumption of 2 FFs vs. 2-MBFF.


Line (c) shows the case where the FFs are toggling disjoint. This is obviously the worst
case since the clock driver works for the two FFs, while only one needs it. As for (b), in case of
disjoint toggling there is no point in using 2-MBFF if the FFs activities are higher than 0.11.
Given an activity, the power savings of 2-MBFF is the distance between line (b) or (c) to (a).
Notice that for zero activity the per-bit power savings is  (3.8- 1.8)/2=1.0w.

Fig 4.3: Power consumption of 4 FFs vs. 4-MBFF.


Fig. 4.3 shows the power consumed by 4-MBFF, where line (a), correspond to four 1- bit
FFs driven independently of each other, line (b) represent the best case of simultaneous toggling
of the 4-MBFF FFs, and line (c) represents the worst case of disjoint toggling. For zero activity
the per-bit power savings is ,   7.4 2.2 4 1.3 W  larger than the 1.0 W obtained for 2-
MBFF. Notice however that for the worst case of disjoint toggling, 4-MBFF stops saving at 0.08
activity, earlier than 0.09 in 2- MBFF. In the best case of simultaneous toggling however, 4-
MBFF is always favored over 2-MBFF. Similar conclusions hold for 8-MBFF shown in Fig. 4.4.
Its per-bit power savings for zero activity is ( 15.3-2.5)/ 8=1.6 W.
Savings of 8-MBFF stops 9 at 0.06 activity in the worst case of disjoint toggling, and at
0.40 in the best case of simultaneous toggling.
Fig 4.4: Power consumption of 8 FFs vs. 8-MBFF.
FFs should be grouped in a DDCG MBFF.
Section 2 quantified the k -MBFF expected energy savings  E p k under the assumption
of toggling independence and free-running un-gated clock. Section 3 showed how toggling
correlation affects the breakeven probability where a MBFF stops saving energy. Clearly, the
best grouping of FFs could be achieved for FFs whose toggling is almost completely correlated.
The problem of FFs grouping yielding maximal toggling correlation, and hence maximal power
savings, has been shown as NP-hard, and a practical solution yielding nearly maximum power
savings was presented in [10]. Its drawback is the requirement of early knowledge of Value
Change Dump (VCD) vectors, derived from many power simulations representing the typical
operation and applications of the design in hand. Such data may not exist in the early design
stage. More common information is the average toggling bulk probability of each FF in the
design, which the following discussion takes advantage of in deriving an optimal toggling
probability-driven FFs grouping. The analysis so far assumed that all the FFs grouped in a MBFF
have same data toggling probability p. FFs’ toggling probabilities are usually different of each
other, and an important question is therefore how the probability varieties affect the FFs
grouping. Past works considered either structural FFs grouping (e.g., successive bits in registers),
or post-layout grouping driven by physical proximity. We subsequently show that data toggling
probabilities matter and should be considered for maximizing energy savings.
Given n FFs 1  FF n i i , consider their grouping in 2-MBFFs. Let a 2-MBFF, denoted  , 
FFi j , comprise FFi and FFj , toggling independently with probabilities i p and j p , respectively.
When none is toggling, the clock of  ,  FFi j is disabled and its internal 10 clock driver does not
consume dynamic energy. When both FFi and FFj are toggling, the clock of  ,  FFi j is enabled
and the clock driver energy is fully useful and there is no waste. A waste happens when one FF is
toggling, while its counterpart does not. There, the clock pulse is enabled, driving both FFs,
whereas only one needs it. A waste i j , W of half of the internal clock driver energy 2 thus
occurs (see (2)), given by

………(5)
Given FFi , FFj , FFk and FFl , their pairing in two 2-MBFFs yields the energy waste

While the term (a) of (9) is independent of the pairing, the term (b) does depend. The expression   i j k l
, , W W  is minimized when (b) is maximized. If ,  i j k l p p p

The generalization for pairing of n FFs is straight forward. Let n be even and    2 , 1 : FF n s t i i i P
be a pairing of FF ,FF , ,FF 1 2 n in n 2 2-MBFFs. The following energy waste  W P results in
……..(10)
Since 1 n j j  p is independent of the pairing,  W P is minimized when 2 1 n  s t i i i p p
is maximized. The optimal pairing minimizing  W P is defined by the following theorem [8].
Theorem 1. Let n be even and let FF ,FF , ,FF 1 2 n be ordered such that their toggling
probabilities satisfy 1 2 n   p p p . The pairing    2 2 1,2 1 : FF n i i i   P of successive
FFs is minimizing  W P given in (10). The above result of grouping in 2-MBFFs is generalized
for grouping in k -MBFFs as follows. 11
Theorem 2. Let n be divisible by k , and let FF ,FF , ,FF 1 2 n be ordered such that their toggling

probabilities satisfy 1 2 n   p p p. The grouping of successive FFs


is minimizing the energy waste incurred by the nk k -MBFFs. The case where n is not divisible
by k has also been addressed in [8].
5. Capturing everything together in a design flow.
It was that the knowledge of the toggling vectors (VCDs) of every FF, derived from
extensive simulations, may obtain the best FF grouping. Such data is not always available, and
we therefore assume the model used assuming that the FFs toggle independently of each other.
The relation between the power savings to FF’s activity p and MBFF multiplicity k that grouping
in monotonic order of p maximizes the power savings. The activity p and the multiplicity k must
therefore be jointly considered in a design flow to maximize the power savings. To this end we
consider Figs. 1.1, 1.2 and 1.3, illustrating the power savings of 2-MBFF, 4- MBFF and 8-
MBFF, respectively. The interim line (d) shown between the extreme cases of simultaneous and
disjoint FFs toggling represents a more realistic operation, where FFs may toggle independently
of each other. Knowing the activity of a FF, the decision in what MBFF size it should be grouped
will follow the interim lines. Fig. 4.5 puts Figs. 1.1, 1.2 and 1.3 on a common scale of per-bit
power consumption, for which they have been divided by their respective multiplicity.
Fig. 4.5. Division of the activity into ranges of maximal savings.
Fig. 4.5 illustrates how the range of FF activity is divided into regions to obtain maximal
power savings. The black line follows the power consumed by a 1-bit ungated FF. The triangular
areas bounded between the black line and the green, blue and red per-bit power consumption
lines, indicate the amount of power savings resulting by grouping a FF in 2-MBFF, 4-MBFF and
8-MBFF, respectively. It shows that for very low activity it pays to group FFs in 8-MBFF. As the
activity increases, there will be some exchange point where 4-MBFF pays more. At some higher
activity 2-MBFF will better pay, up to an activity where the power savings stops. We take
advantage of that behavior in the following MBFF grouping algorithm.
1. Sort the n FFs such that 1 2 n .  p p p
2. Set 1.i
3. Decide on optimal k by i p , based on Fig. 4.5.
4. Group 1 1  FF ,FF , ,FF i i i k in a k -MBFF.
5. Set . i i k
6. If i n stop. Else go to 3.
Placement of Flip-Flop Groups Once the IS of TSFGs is obtained, we should determine a proper
location for the MBFF corresponding to each TSFG with the considerations of both placement
density and interconnecting wire length. 1) Consideration of Placement Density: Before finding a
legal placement for an MBFF corresponding to a TSFG within the tilted rectangular placement
region, the placement bins covered by the tilted rectangular placement region should be
collected. The bins intersected by each boundary of the tilted rectangular placement region are
first identified The bins surrounded by these intersected bins can therefore be found and
collected accordingly. For density-driven placement, the bin with the lowest placement density is
chosen to accommodate an MBFF corresponding to a TSFG.
If there is no valid placement grid in the bin, the bin with the second lowest placement
density is then chosen. The grid-searching process is repeated until a valid placement grid for the
MBFF is found. 2) Consideration of Interconnecting Wire length: In addition to the consideration
of placement density, the reduction of the interconnecting wire length is also very important
during placing an MBFF corresponding to a TSFG. To find a position for the MBFF with shorter
wire length, the area bounded by the median coordinates of all pins connected to the MBFF is
first considered .The median coordinates of the eight pins are 𝑥𝑝4 , 𝑥𝑝5 , 𝑦𝑝4 , and 𝑦𝑝8 in both
directions. If there is no valid placement grid in the bin intersected by both the area bounded by
the coordinates of the pins and the tilted rectangular placement region, the area bounded by the
coordinates of the pins is enlarged to the next pitch which is the closest one from the current
pitches. 𝑦𝑝1 is the closest pitch from 𝑦𝑝8 compared with all the other neighboring pitches. The
enlarged area is then surrounded by 𝑥𝑝4 , 𝑥𝑝5 , 𝑦𝑝4 , and 𝑦𝑝1 . The process is continued until a
valid placement grid for the MBFF is found.
The placements of all flip-flops in each circuit have also been optimized. Table II lists the
names of the benchmark circuits (“Circuit”), the numbers of 1-bit flip-flops (“# of 1-bit FFs”),
the numbers of 2-bit flip-flops (“# of 2-bit FFs”), and the numbers of 4-bit flip-flops (“# of 4-bit
FFs”). A cell library containing 1-bit, 2-bit, and 4-bit flip-flops is also provided with the
specifications of their power consumption and areas. Table III lists the bit numbers of each flip-
flop (“Bit # of Flip-Flop”), and the corresponding the power consumption (“Power”) and areas
(“Area”). We compared the numbers of flip-flops with 1, 2, and 4 bits, the power reduction,
HPWL ratio, and runtime for three different approaches: (1) the proposed approach without
applying the progressive window-based optimization, (2) the proposed approach based on the
progressive window-based optimization with the consideration of placement density only, and
(3) the proposed approach based on the progressive window-based optimization with the
considerations of both placement density and interconnecting wire length. the names of the
benchmark circuits (“Circuit”), the numbers of flip-flops with 1, 2, and 4 bits (“# of FFs (1, 2, 4
bits)”), the power reduction (“Power Red.”), the HPWL ratio between the resulting and input
circuits (“HPWL Ratio”), and the runtimes (“Time”) for the three approaches. The results show
that Approach (2) and (3) outperforms Approach (1) by at least 37222X, which is a significant
improvement based on the progressive window-based optimization. Even for the largest circuit
containing hundred thousands of flip-flops, the runtime based on Approach (3) is only 79
seconds. Although Approach (2) is 7% faster in runtime, the HPWL ratio is 21% worse than
Approach (3). Therefore, the proposed approach based on the progressive window-based
optimization with the considerations of both placement density and interconnecting wire length
is very effective and efficient, which is capable of incrementally merging existing MBFFs in the
design to gain more power saving.
Few practical comments are in order. In addition to toggling probabilities, MBFF
grouping should also consider logical relations and physical place and route constraints. An
example is the pipeline registers of a microprocessor. It makes no sense to mix bits of different
pipeline stages. It is obvious and natural that the place and route tool will put bits belonging to
same register close to each other, while FFs clusters of registers belonging to distinct pipeline
stages will be placed apart of each other. FFs of different pipeline registers should therefore be
not mixed in a MBFF, although from toggling probability standpoint their grouping may be
preferred. Similar arguments hold for other system’s busses and registers such as those storing
data, addresses, counters, and alike. Another example is the FFs of Finite State Machines (FSMs)
in control units, whose MBFF grouping should not cross control logic borders.
Though the proposed algorithm is aimed at RTL or gate design levels, it can also be
combined with the grouping methods. There, an initial placement takes place as a “dry run” to
obtain initial FFs’ layout proximity directives. The toggling probability-driven algorithm can
then consider those to guide the MBFFs grouping. The later real place and route will use MBFF
library cells, unlike which rip up the old FFs and insert MBFFs replacements, a non-trivial and
tedious layout task, which is saved by our design flow.
The proposed MBFF design flow has been used for a 32-bit pipelined MIPS processor,
implemented in TSMC 65nm process technology. Workload of two programs has been used,
shown in Table 3. For each test the average activity of a FF in the pipelined register is shown in
blue color under the name of the pipeline stage. Notice the activity decrease with the progress of
the pipeline stage from instruction fetch (IF) to write-back (WB).
Two MBFF grouping methods are examined. In the first, FFs have been grouped
sequentially according to their bit number in their register. The second method 13 grouped FFs in
increasing order of their activities, shown in Section 4 to be optimal when FFs are assumed to
toggle independently of each other. Both grouping methods adhered to the constraints of not
crossing clock domain boundaries and not mixing FFs of unrelated logic entities. Table 3 shows
for each k -MBFF, 2, 4, 8k the average activity. In most cases grouping by monotonic activity
is preferred (colored in green), though in few cases it worsened (colored in red). That can happen
since the grouping is blind to toggling correlation.

Table 3. Average FF activity of pipeline registers in 32-bit MIPS.


The pipeline registers were then implemented with MBFFs grouped by monotonic order
of their activity. As shown in Fig. 4.5, the grouping starts with 8-MBFFs for the low activities,
and then it is progressing to 4-MBFFs and 2-MBFFs with the FFs activities increase, up to the
zero gain point where grouping stops and the rest FFs stay alone and un-gated. Those could of
course be grouped in un-gated MBFFs, just to reduce the number of internal clock drivers. Table
4 shows the power savings achieved at each of the pipeline registers for the sort and matrix
multiplication weighted workload. The results were measured with SpyGlass [11] simulation
where the MIPS was operated in 1.1V and 200MHz. 34.6% savings was achieved. The pipelined
registers consumed 65% of the entire MIPS power (memory not included), so the total power
reduction of the entire power (CG HW overhead included) was 23%.

Table 4. Power savings in the pipeline registers of a 32-bit MIPS.


We finally show the power savings achieved by the grouping algorithm for a complete
industrial network processor designed in 28nm TSMC process technology, operating in
800MHz. The processor is divided into seven units, named A to G, shown in Table 6. It
consumes a total of 6.2 Watts, in which 45% is charged to the 14 clock network with its
underlying FFs. The original design comprises un-gated MBFFs, so the power savings is purely
due to the addition of the clock gating in Fig. 5, on top of the savings obtained by less drivers in
the un-gated MBFFs that existed in the original design. Furthermore, the original design includes
extensive clock enable logic signals, defined by both RTL compiler and manual insertions. The
activities of the FFs were profiled first and then sorted. Table 5 shows a total of 8% net power
savings, where the power measurements include both dynamic and static components and all the
CG HW overheads. The 8% power savings was obtained on top of 9% savings that had been
achieved by changing from 1-bit FFs to un-gated MBFFs, yielding a 17% combined savings.
Such savings is highly appreciated by the industrial VLSI design community. The area penalty
due to the introduction of clock gating circuitry was 2.3%.

Table 5. Power savings in an 40nm network processor.


The latch and gater (AND gate) overheads are amortized over k FFs.
Let the average toggling probability of a FF (also called activity factor) be denoted by p
(0 <p< 1). Under the worst-case assumption of independent FF toggling, and assuming a uniform
physical clock tree structure, it is shown in [9] that the number k of jointly gated FFs for which
the power savings are maximized is the solution of

Where cFF is the FFs clock input capacitance, cW is the unit-size wire capacitance, and
clatch is the latch capacitance including the wire capacitance of its clk input. Table I shows how
the optimal k depends on p. Such a gating scheme has considerable timing implications, which
are discussed in [9]. We will return to those when discussing the implementation of data-driven
gating as a part of a complete design flow.
4.2 Implementation and Integration in a Design Flow.
In the following, we describe the implementation of data-driven clock gating as a part of
a standard backend design flow. It consists of the following steps.
1) Estimating the FFs toggling probabilities involves running an extensive test bench
representing typical operation modes of the system to determine the size k of a gated FF group
by solving (1).
2) Running the placement tool in hand to get preliminary preferred locations of FFs in the
layout.
3) Employing a FFs grouping tool to implement the model and algorithms presented in
Sections III and IV, using the toggling correlation data obtained in Step 1 and FF locations’ data
obtained in Step 2. The outcome of this step is k-size FF sets (with manual overrides if required),
where the FFs in each set will be jointly clocked by a common gater.
4) Introducing the data-driven clock gating logic into the hardware description (we use
Verilog HDL). This is done automatically by a software tool, adding appropriate Verilog code to
implement the logic described in Fig. 2. The FFs are connected according to the grouping
obtained in Step 3. A delicate practical question is whether to introduce the gating logic into
RTL or gate-level description. This depends on design methodology in use and its discussion is
beyond the scope of this paper. We have introduced the gating logic into the RTL description.
5) Re-running the test bench of Step 1 to verify the full identity of FFs’ outputs before
and after the introduction of gating logic. Although data driven gating, by its very definition,
should not change the logic of signals, and hence FFs toggling should stay identical, a robust
design flow must implement this step.
6) Ordinary backend flow completion. From this point, the backend design flow proceeds
by applying ordinary place and route tools.

Double Edge Triggered Flip-Flop.


An edge-triggered D-flip flop (D-FF) [1] has a clock input (C) and a data input (D).
Immediately after the C-signal changes from 0 to 1, the output Q assumes the value of D and
holds that value until the next positive going C-signal. Such a FF is said to trigger on the leading,
or positive, edge of the clock pulse. Some FF's are designed to trigger on the trailing, or negative,
edges of the C-signals. There are also edge-triggered JK-FF's that respond to the J and K signals
following a C-transition (again they may be of either the positive or negative edge sensing type).
The JK-FF output changes from 0 to I if J = 1 (independent of K) and changes from 1 to 0 if K =
1 (independent of J).
The advantages of edge triggering is that the control inputs (D or J and K) may be
changed at any time not in the neighborhood of a triggering edge of the C-signal. It also reduces
sensitivity to noise pulses.
If the minimum interval between consecutive changes in the state of an edge-triggered FF
is L in a synchronous system, then the clock pulse frequency must be at least 1 /L. During each
clock pulse period, one of the two transitions of the C-signal accomplishes nothing, although it
will produce changes in the outputs of some of the logic elements internal to the FF's. Such
activity is undesirable, since it results in increased power dissipation for virtually every
technology now in use for implementing logic circuits. (In the case of C-MOS logic, there is
essentially no power dissipated except when a transition is occurring.) If FF's trigger on both
edges of C-pulses, then the clock pulse generator operates at half the frequency for the same data
rate. This in itself would reduce the cost and power dissipation of the clock pulse generator and
of the clock pulse distribution system, and would also eliminate meaningless state changes at the
outputs of various gates. One would also expect to be able to increase data rates to some extent.
Several designs are presented here for double-edge-triggered (DET) D-FF's and for DET
JK-FF's. The simplest designs in terms of logic complexity require delay elements which reduce
allowable operating speeds. With the other designs, roughly 50-100 percent more complex than
the corresponding single-edge-triggered circuits, no delay elements are necessary so that
maximum operating speeds are attainable. Only the basic operations are implemented; no set or
clear operations are built in, and the complements of the outputs are not produced. These features
would not be difficult to design in. Practical implementations would, in most cases, also utilize
such elements as NAND and NOR gates or networks of pass transistors, rather than the AND-
OR-INVERTOR logic shown here.
The design of DET FF's is a good application of the theory of asynchronous sequential
switching circuits; of particular interest perhaps is the use of decomposition techniques.
FLOW TABLE DESCRIPTIONS OF DET-D-FF's:
Table 6 is a primitive flow table for a DET-D-FF. Note that simultaneous changes of D
and C are treated as though one of these variables changed first, but that it does not matter which
changed first. This is a realistic assumption, since we can never rely on exact simultaneity for
any pair of events. The option of treating a simultaneous change in either of two ways is left
open for exploitation later in the design process.
Using well-known method it is not difficult to show that there are precisely two minimal-
row covers of this table, as shown in Table 11 (A) and (B). The parenthesized sets of numbers to
the right of each row of (A) and (B) indicate the rows of Table I that are covered by the rows of
the reduced tables.

TABLE 6 PRIMITIVE FLOW TABLE FOR DET-D-FF


TABLE 7 MINIMAL Row COVERS OF TABLE I
Table 7 (A) and (B) are equivalent to one another with respect to single-input-change (SIC)
operation. But they describe different responses to multiple input changes, a difference without
practical significance, but which leads to quite different implementations. Neither has any
essential hazards, but (A) has d-transitions (e.g., from row 1 with CD = 01, when C changes),
while B has none.

CHAPTER 5
HARDWARE REQUIREMENTS
GENERAL
Integrated circuit (IC) technology is the enabling technology for a whole host of
innovative devices and systems that have changed the way we live. Jack Kil by and Robert
Noyce received the 2000 Nobel Prize in Physics for their invention of the integrated circuit;
without the integrated circuit, neither transistors nor computers would be as important as they are
today. VLSI systems are much smaller and consume less power than the discrete components
used to build electronic systems before the 1960s.
Integration allows us to build systems with many more transistors, allowing much more
computing power to be applied to solving a problem. Integrated circuits are also much easier to
design and manufacture and are more reliable than discrete systems; that makes it possible to
develop special-purpose systems that are more efficient than general-purpose computers for the
task at hand.

5.1 APPLICATIONS OF VLSI


Electronic systems now perform a wide variety of tasks in daily life. Electronic systems
in some cases have replaced mechanisms that operated mechanically, hydraulically, or by other
means; electronics are usually smaller, more flexible, and easier to service. In other cases
electronic systems have created totally new applications. Electronic systems perform a variety of
tasks, some of them visible, some more hidden:
 Personal entertainment systems such as portable MP3 players and DVD players perform
sophisticated algorithms with remarkably little energy.
 Electronic systems in cars operate stereo systems and displays; they also control fuel injection
systems, adjust suspensions to varying terrain, and perform the control functions required for
anti-lock braking (ABS) systems.
 Digital electronics compress and decompress video, even at high definition data rates, on-the-fly
in consumer electronics.
 Low-cost terminals for Web browsing still require sophisticated electronics, despite their
dedicated function.
 Personal computers and workstations provide word-processing, financial analysis, and games.
Computers include both central processing units (CPUs) and special-purpose hardware for disk
access, faster screen display, etc.
Medical electronic systems measure bodily functions and perform complex processing
algorithms to warn about unusual conditions. The availability of these complex systems, far from
overwhelming consumers, only creates demand for even more complex systems. The growing
sophistication of applications continually pushes the design and manufacturing of integrated
circuits and electronic systems to new levels of complexity.
And perhaps the most amazing characteristic of this collection of systems is its variety as
systems become more complex, we build not a few general-purpose computers but an ever wider
range of special-purpose systems. Our ability to do so is a testament to our growing mastery of
both integrated circuit manufacturing and design, but the increasing demands of customers
continue to test the limits of design and manufacturing.

5.2ADVANTAGES OF VLSI:
While we will concentrate on integrated circuits in this book, the properties of integrated
circuits what we can and cannot efficiently put in an integrated circuit—largely determine the
architecture of the entire system. Integrated circuits improve system characteristics in several
critical ways. ICs have three key advantages over digital circuits built from discrete components:
 Size: Integrated circuits are much smaller—both transistors and wires are shrunk to micrometer
sizes, compared to the millimeter or centimeter scales of discrete components. Small size leads to
advantages in speed and power consumption, since smaller components have smaller parasitic
resistances, capacitances, and inductances.
 Speed: Signals can be switched between logic 0 and logic 1 much quicker within a chip than
they can between chips. Communication within a chip can occur hundreds of times faster than
communication between chips on a printed circuit board.
The high speed of circuits on-chip is due to their small size—smaller components and wires have
smaller parasitic capacitances to slow down the signal.
 Power consumption: Logic operations within a chip also take much less power. Once again,
lower power consumption is largely due to the small size of circuits on the chip smaller parasitic
capacitances and resistances require less power to drive them.
5.3 VLSI AND SYSTEMS
These advantages of integrated circuits translate into advantages at the system level:
 Smaller physical size: Smallness is often an advantage in itself—consider portable televisions
or handheld cellular telephones.
 Lower power consumption: Replacing a handful of standard parts with a single chip reduces
total power consumption. Reducing power consumption has a ripple effect on the rest of the
system: a smaller, cheaper power supply can be used; since less power consumption means less
heat, a fan may no longer be necessary; a simpler cabinet with less shielding for electromagnetic
shielding may be feasible, too.
 Reduced cost: Reducing the number of components, the power supply requirements, cabinet
costs, and so on, will inevitably reduce system cost. The ripple effect of integration is such that
the cost of a system built from custom ICs can be less, even though the individual ICs cost more
than the standard parts they replace. Understanding why integrated circuit technology has such
profound influence on the design of digital systems requires understanding both the technology
of IC manufacturing and the economics of ICs and digital systems.

5.4 TYPES OF CHIPS


The preponderance of standard parts pushed the problems of building customized
systems back to the board-level designers who used the standard parts.
Since a function built from standard parts usually requires more components than if the
function were built with custom designed ICs, designers tended to build smaller, simpler
systems. The industrial trend, however, is to make available a wider variety of integrated circuits.
The greater diversity of chips includes:
More specialized standard parts:
In the 1960s, standard parts were logic gates; in the 1970s they were LSI components.
Today, standard parts include fairly specialized components: communication network interfaces,
graphics accelerators, floating point processors. All these parts are more specialized than
microprocessors but are used in enough volume that designing special-purpose chips is worth the
effort.
In fact, putting a complex, high-performance function on a single chip often makes other
applications possible—for example, single-chip floating point processors make high-speed
numeric computation available on even inexpensive personal computers.
• Application-specific integrated circuits (ASICs)
Rather than build a system out of standard parts, designers can now create a single chip
for their particular application. Because the chip is specialized, the functions of several standard
parts can often be squeezed into a single chip, reducing system size, power, heat, and cost.
Application-specific ICs are possible because of computer tools that help humans design chips
much more quickly.
• Systems-on-chips (SoCs).
Fabrication technology has advanced to the point that we can put a complete system on a
single chip. For example, a single-chip computer can include a CPU, bus, I/O devices, and
memory. SoCs allow systems to be made at much lower cost than the equivalent board-level
system. SoCs can also be higher performance and lower power than board-level equivalents
because on-chip connections are more efficient than chip-to chip connections.
A wider variety of chips is now available in part because fabrication methods are better
understood and more reliable. More importantly, as the number of transistors per chip grows, it
becomes easier and cheaper to design special-purpose ICs. When only a few transistors could be
put on a chip, careful design was required to ensure that even modest functions could be put on
a single chip. Today’s VLSI manufacturing processes, which can put millions of carefully-
designed transistors on a chip, can also be used to put tens of thousands of less-carefully
designed transistors on a chip.
Even though the chip could be made smaller or faster with more design effort, the
advantages of having a single-chip implementation of a function that can be quickly designed
often outweigh the lost potential performance.
The problem and the challenge of the ability to manufacture such large chips is design—
the ability to make effective use of the millions of transistors on a chip to perform a useful
function.

5.5 FIELD-PROGRAMMABLE GATE ARRAYS (FPGA):


A field-programmable gate array (FPGA) is a block of programmable logic that can
implement multi-level logic functions. FPGAs are most commonly used as separate commodity
chips that can be programmed to implement large functions.
However, small blocks of FPGA logic can be useful components on-chip to allow the user of the
chip to customize part of the chip’s logical function. An FPGA block must implement both
combinational logic functions and interconnect to be able to construct multi-level logic
functions. There are several different technologies for programming FPGAs, but most logic
processes are unlikely to implement anti-fuses or similar hard programming technologies, so we
will concentrate on SRAM-programmed FPGAs.

CHAPTER 6
TOOLS
6.1 Introduction:
The main tools required for this project can be classified into two broad categories.
 Hardware requirement
 Software requirement
6.2 Hardware Requirements:
 FPGA KIT
In the hardware part a normal computer where Xilinx ISE 13.2 software can be easily operated is
required, i.e., with a minimum system configuration Pentium III, 1 GB RAM, 20 GB Hard Disk.
6.3 Software Requirements:
 XILINX 13.2
It requires Xilinx ISE 13.2 version of software where Verilog source code can be used for
design implementation.
6.4 Introduction To XILINX ISE:
This instrument can be utilized to make, execute, reenact, and integrate Verilog outlines
for usage on FPGA chips.
ISE: Integrated Software Environment
 Environment for the improvement and trial of computerized systems configuration
focused to FPGA or CPLD
 Integrated gathering of apparatuses available through a GUI
 Based on an intelligent combination motor (XST: Xilinx Synthesis Technology)
XST underpins diverse dialects:
 Verilog
 VHDL
 XST create a net rundown incorporated with requirements
 Supports every one of the means required to finish the plan:
 Translate, guide, place and course
 Bit stream era
For this situation, it is conceivable to utilize Verilog to compose a test seat to confirm the
usefulness of the outline utilizing documents on the host PC to characterize jolts, to interface
with the client, and to contrast comes about and those normal.
A Verilog show is converted into the "doors and wires" that are mapped onto a
programmable rationale gadget, for example, a CPLD or FPGA, and after that it is the real
equipment being designed, instead of the Verilog code being "executed" as though on some type
of a processor chip.
6.4.1 Implementation:
– Synthesis (XST)
-Produce a netlist file starting from an HDL description
 Translate (NGDBuild)
– Converts all input design netlists and then writes the results into a single merged
file, that describes logic and constraints.
 Mapping (MAP)
– Maps the logic on device components.
– Takes a netlist and groups the logical elements into CLBs and IOBs (components of
FPGA).
 Place And Route (PAR)
– Place FPGA cells and connects cells.
 Bit stream generation
XILINX Design Process
Step 1: Design entry
– HDL (Verilog or VHDL, ABEL x CPLD), Schematic Drawings, Bubble
Diagram
Step 2: Synthesis
– Translates .v, .vhd, .sch files into a netilist file (.ngc)
Step 3: Implementation
– FPGA: Translate/Map/Place & Route, CPLD: Fitter
Step 4: Configuration/Programming
– Download a BIT file into the FPGA
– Program JEDEC file into CPLD
– Program MCS file into Flash PROM
Simulation can occur after steps 1, 2, 3
The tools used in this thesis are XILINX ISE 13.2 for simulation and Synthesis. The
programs are written in verilog language.

6.5 Xilinx Software


Xilinx Tools is a suite of software tools used for the design of digital circuits
implemented using Xilinx Field Programmable Gate Array (FPGA) or Complex Programmable
Logic Device (CPLD). The design procedure consists of (a) design entry, (b) synthesis and
implementation of the design, (c) functional simulation and (d) testing and verification. Digital
designs can be entered in various ways using the above CAD tools: using a schematic entry tool,
using a hardware description language (HDL) – Verilog or VHDL or a combination of both. In
this thesis we will only use the design flow that involves the use of Verilog HDL.

6.5.1 Creating a New Project


Xilinx Tools can be started by clicking on the Project Navigator Icon on the Windows
desktop. This should open up the Project Navigator window on your screen. This window shows
(see Figure 1) the last accessed project.

6.5.2. Opening a project


Select File->New Project to create a new project. This will bring up a new project window
(Figure 2) on the desktop. Fill up the necessary entries as follows:
Project Name: Write the name of your new project
Project Location: The directory where you want to store the new project (Note: DO NOT
specify the project location as a folder on Desktop or a folder in the Xilinx\bin directory. Your
H: drive is the best place to put it. The project location path is NOT to have any spaces in it eg:
H:\Full Adder\F A is NOT to be used).Leave the top level module type as HDL.
Clicking on NEXT should bring up the following window:

For each of the properties given below, click on the ‘value’ area and select from the list of values
that appear.
Device Family: Family of the FPGA/CPLD used. In this thesis we will be using the
Spartan3E FPGA’s.
Device: The number of the actual device. For this lab you may enter XC3S100E (this can be
found on the attached prototyping board)
Package: The type of package with the number of pins. The Spartan FPGA used in this lab is
packaged in VQ100 package.
Speed Grade: The Speed grade is “-5”.
Synthesis Tool: XST [VHDL/Verilog]
Simulator: The tool used to simulate and verify the functionality of the design. Modelsim
simulator is integrated in the Xilinx ISE. Hence choose “Modelsim-XE Verilog” as the simulator
or even Xilinx ISE Simulator can be used.
Then click on NEXT to save the entries.

A project summary window is opened click on finish.


In order to open an existing project in Xilinx Tools, select File->Open Project to show the list
of projects on the machine. Choose the project you want and click OK.
Clicking on NEXT on the above window brings up the following window:
If creating a new source file, Click on the NEW SOURCE.

A window pop up is opened.


Select Verilog Module and in the “File Name:” Enter the name of the Project. Then click on
Next to accept the entries. This pops up the following window.

In the Port Name column, enter the names of all input and output pins and specify the Direction
accordingly. A Vector/Bus can be defined by entering appropriate bit numbers in the MSB/LSB
columns. Then click on Next>to get a window showing all the new source information.
click on Finish to continue.
The source file will now be displayed in the Project Navigator window.
The source file window can be used as a text editor to make any necessary changes to the source
file. All the input/output pins will be displayed. Save your Verilog program periodically by
selecting the File->Save from the menu. You can also edit Verilog programs in any text editor
and add them to the project directory using “Add Copy Source”.

6.5.3. Simulating and Viewing the Output Waveforms


Click on simulation select the existing file and expand ISim Simulator and click on
Behavioral check syntax to check the Errors.
If there are no errors click on simulate behavioral model. A window pop up is opened.

Here we can give the inputs. Right click on the selected input click on force constant and
enter the input value click on Ok.
Click on Run option in the tool bar to check input and output waveforms.

6.5.4. Synthesis and Implementation of the Design


Click on Implementation select the existing file and double click on Synthesize-XST. If
there are errors correct it. If there are no errors click on design summary and reports.

Open the Synthesis Report in the Detailed Reports to see the Device utilization Summary and
Timing Report of the current project.
6.5.5. View RTL Schematic:
Expand Synthesize-XST and click on view RTL Schematic and click ok.

The window with Top module is opened to view the internal modules click on the top module.
6.6 FPGA DESIGN FLOW:
In this part of tutorial we are going to have a short intro on FPGA design flow. A simplified
version of design flow is given in the flowing diagram.

Figure 6.1 FPGA Design Flow


6.7 Design Entry:
There are different techniques for design entry. Schematic based, Hardware Description
Language and combination of both etc. Selection of a method depends on the design and
designer. If the designer wants to deal more with Hardware, then Schematic entry is the better
choice. When the design is complex or the designer thinks the design in an algorithmic way then
HDL is the better choice. Language based entry is faster but lag in performance and density.
HDLs represent a level of abstraction that can isolate the designers from the details of the
hardware implementation.  Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method but rarely
used is state-machines.

It is the better choice for the designers who think the design as a series of states. But the
tools for state machine entry are limited. In this documentation we are going to deal with the
HDL based design entry.
6.8 Synthesis:
The process which translates VHDL or Verilog code into a device netlist format. i.e. a
complete circuit with logical elements( gates, flip flops, etc…) for the design. If the design
contains more than one sub designs, ex. to implement  a processor, we need a CPU as one design
element and RAM as another and so on, then the synthesis process generates netlist for each
design element Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has selected.
The resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx® Synthesis
Technology (XST)).

Figure 6.2 FPGA Synthesis


6.9 Implementation:
In this work, design of a DWT and IDWT is made using Verilog HDL and is synthesized
on FPGA family of Spartan 3E through XILINX ISE Tool. This process includes following:
 Translate
 Map
 Place and Route
6.9.1 Translate:
Process combines all the input netlists and constraints to a logic design file. This
information is saved as a NGD (Native Generic Database) file. This can be done using NGD
Build program. Here, defining constraints is nothing but, assigning the ports in the design to the
physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time
requirements of the design. This information is stored in a file named UCF (User Constraints
File). Tools used to create or modify the UCF are PACE, Constraint Editor etc.

Figure 6.3 FPGA Translate

6.9.2 Map:
  Process divides the whole circuit with logical elements into sub blocks such that they can
be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file
into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks
(IOB)) and generates an NCD (Native Circuit Description) file which physically represents the
design mapped to the components of FPGA.
MAP program is used for this purpose.

Figure 6.4 FPGA map


6.9.3 Place and Route:
PAR program is used for this process. The place and route process places the sub blocks from the
map process into logic blocks according to the constraints and connects the logic blocks. Ex. if a
sub block is placed in a logic block which is very near to IO pin, then it may save the time but it
may affect some other constraint.
So tradeoff between all the constraints is taken account by the place and route process
The PAR tool takes the mapped NCD file as input and produces a completely routed NCD file as
output. Output NCD file consists the routing information.

Figure 6.5 FPGA Place and route


6.10 Device Programming:
Now the design must be loaded on the FPGA. But the design must be converted to a
format so that the FPGA can accept it. BITGEN program deals with the conversion. The routed
NCD file is then given to the BITGEN program to generate a bit stream (a .BIT file) which can
be used to configure the target FPGA device. This can be done using a cable. Selection of cable
depends on the design.
6.10.1 Design Verification:
Verification can be done at different stages of the process steps. 
6.10.2 Behavioral Simulation (RTL Simulation):
This is first of all simulation steps; those are encountered throughout the hierarchy of the
design flow. This simulation is performed before synthesis process to verify RTL (behavioral)
code and to confirm that the design is functioning as intended.
Behavioral simulation can be performed on either VHDL or Verilog designs. In this process,
signals and variables are observed, procedures and functions are traced and breakpoints are set.
This is a very fast simulation and so allows the designer to change the HDL code if the
required functionality is not met with in a short time period. Since the design is not yet
synthesized to gate level, timing and resource usage properties are still unknown.
6.10.3 Functional simulation (Post Translate Simulation):
Functional simulation gives information about the logic operation of the circuit. Designer
can verify the functionality of the design using this process after the Translate process. If the
functionality is not as expected, then the designer has to made changes in the code and again
follow the design flow steps.
Static Timing Analysis:
This can be done after MAP or PAR processes Post MAP timing report lists signal path
delays of the design derived from the design logic. Post Place and Route timing report
incorporates timing delay information to provide a comprehensive timing summary of the
design. 
CHAPTER 7
RESULTS
Simulation.

RTL Schematic.

Technology Schematic.

Design Summary.
CHAPTER 8
CONCLUSION
Clock gating is used in fifo to reduce the power consumption. For further power saving
data driven clock gating and multi-bit flip-flops are used in sequential circuits. Common clock
gating is used for power saving. But clock gating still leaves larger amount of redundant clock
pulses. Multi-bit flip-flop is also used to reduce power consumption. Using of Multi-bit Flip-
Flop method is to eliminate the total inverter number by sharing the inverters in the flip-flops.
Combination of Multi-bit Flip-Flop with Data driven clock gating will increase the further power
saving. Xilinx software tool is used for implementing this proposed system. The combination of
data-driven gating with MBFF in an attempt to yield further power savings.
REFERENCES

1. Kapoor, Ajay, Cas Groot, Gerard Villar Pique, Hamed Fatemi, Juan Echeverri, Leo Sevat,
Maarten Vertregt et al. “Digital systems power management for high performance mixed signal
platforms.” Circuits and Systems I: Regular Papers, IEEE Transactions on 61, no. 4 (2014): 961-
975.
2. Wimer, Shmuel, and Israel Koren. “The optimal fan-out of clock network for power
minimization by adaptive gating.” Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on 20, no. 10 (2012): 1772-1780.
3. Santos, Cristiano, Ricardo Reis, Guilherme Godoi, Marcos Barros, and Fabio Duarte. “Multi-
bit flip-flop usage impact on physical synthesis.” In Integrated Circuits and Systems Design
(SBCCI), 2012 25th Symposium on, pp. 1-6. IEEE, 2012.
4. Yan, Jin-Tai, and Zhi-Wei Chen. “Construction of constrained multi-bit flip-flops for clock
power reduction.” In Green Circuits and Systems (ICGCS), 2010 International Conference on,
pp. 675-678. IEEE, 2010. 15
5. Jiang, IH-R., Chih-Long Chang, and Yu-Ming Yang. “INTEGRA: Fast multibit flip-flop
clustering for clock power saving.” Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on 31, no. 2 (2012): 192-204.
6. Chang, Chih-Long, and Iris Hui-Ru Jiang. “Pulsed-latch replacement using concurrent time
borrowing and clock gating.” IEEE Transactions on ComputerAided Design of Integrated
Circuits and Systems 32, no. 2 (2013): 242-246.
7. Lo, Shih-Chuan, Chih-Cheng Hsu, and Mark Po-Hung Lin. "Power optimization for clock
network with clock gate cloning and flip-flop merging." In Proceedings of the 2014 on
International symposium on physical design, pp. 77-84. ACM, 2014.
8. Wimer, Shmuel, Doron Gluzer and Uri Wimer. “Using well-solvable minimum cost exact
covering for VLSI clock energy minimization.” Operations Research Letters 42, no. 5 (2014):
332-336.
9. Wimer, Shmuel, and Israel Koren. “Design flow for flip-flop grouping in datadriven clock
gating.” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 22, no. 4 (2014):
771-778.
10. Wimer, Shmuel. “On optimal flip-flop grouping for VLSI power minimization.” Operations
Research Letters 41, no. 5 (2013): 486-489.
11.SpyGlass Power [Online]. Available: Using many advanced algorithms and analysis
techniques, the SpyGlass platform provides designers with insight about their design, early in the
process at RTL. It functions like an interactive guidance system for design engineers and

managers, finding the fastest and least expensive path to implementation for complex SoCs .
http://www.atrenta.com/solutions/spyglassfamily/spyglass-power.html

You might also like