0% found this document useful (0 votes)
21 views24 pages

Unit 4 Vlsi Ec3552

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views24 pages

Unit 4 Vlsi Ec3552

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT IV DESIGNING ARITHMETIC BUILDING BLOCKS

Data path circuits:

Fig. Basic DSP architecture


Building Blocks for Digital Architectures include
• Arithmeticunit- Bit-sliced datapath (adder, multiplier, shifter, comparator, etc.)
• Memory- RAM, ROM, Buffers, Shift registers
• Control-Finite state machine (PLA, randomlogic.),Counters
• Interconnect-Switches,Arbiters,Bus

BIT – SLICED DATA PATH ORGANISATION:


Datapaths are often arranged in bit sliced organisation. Data processor in processor is word
based. Typical microprocessor datapaths are 32 bits or 64 bits. Those in DSL modems,
magnetic disk drives,compact disk players are of arbitrary width typically 5 to 24 nits.
Datapath consist of 32 bit slices each operating in single bit. Hence the name Bit-Sliced data
path organization.
Arithmetic Building Blocks of bit-sliced data path organization include
• Datapathelements-registers.
• Adder design
– Staticadder
– Dynamicadder
• Multiplier design
– Arraymultipliers
• Shifters, Paritycircuits

Fig.:Bit sliced Datapath Organisation


ADDERS:
Addition forms the basis for many processing operations, from ALUs to address generation
to multiplication to filtering. As a result, adder circuits that add two binary numbers are of
great interest to digital system designers.
FULL ADDER:

For a full adder, it is sometimes useful to define Generate (G), Propagate (P), and Kill (K)
signals. The adder generates a carry when Cout is true independent of Cin, so G = A · B.
The adder kills a carry when Cout is false independent of Cin, so K = A · B = A + B.
The adder propagates a carry; i.e., it produces a carry-out if and only if it receives a carry-in,
when exactly one input is true: P = A B.
The sum and carry out signals interms of G,P,K can be given by:
Co(G,P)=G+PCi
S(G,P)=P Gi

RIPPLE CARRY ADDER:

The delay of N-bit Ripple carry adder can be given by


tadder= (N-1)t Carry + t sum

There are two significant conclusion from the delay equation


1.The propogation delay of Ripple carry adder is linearly proportional to N. This properties
becomes increasingly important when designing adders for the wide datapaths (N=16,…128)
2.For designing the fast RCA using full adder, it is important to optimize the t carry.

Inverting property of RCA: Inverting all inputs to a full adder results in inverted output and it
can be expressed as
S’(A,B,Ci)=S(A’,B’,Ci’)
Co’(A,B,Ci)=Co(A’,B’,Ci’)

S=A B C S’(A,B,Ci)=S(A’,B’,Ci’)
Co=AB+BCi+ACi Co’(A,B,Ci)=Co(A’,B’,Ci’)

COMPLIMENTARY STATIC CMOS FULL ADDER USING 28 TRANSISTOR:

Fig: Static CMOS Full Adder Using 28 Transistor


• Complimentary Static Full adder consumes 28 transistors . Hence it consumes large area
and the circuit is slow .
• Tall PMOS transistor stacks are present in both carry and sum generation circuits.
• The Intrinsic load capacitance of Co signal is large and consist of two diffusion and six
gate capacitances, plus the wiring capacitance.
• The signal propagates through the inverting stages in the carry generation circuits.
Minimizing the carry path delay is the prime goal of the designer in the high speed adder
circuit .
• The sum generation requires one extra logic stage and is not that significant as the sum
delay factor appears only once in the propagation delay of RCA .
MIRROR ADDER CIRCUIT DESIGN:

Fig: Mirror Adder Design of Full Adder


• The NMOS and PMOS chains are completely symmetrical. This guarantees identical
rising and falling transitions if the NMOS and PMOS devices are properly sized. A
maximum of two series transistor can be observed in the carry – generation circuitry.
• When laying out the cell, the most critical issues is the minimization of capacitance at
node Co. The reduction of diffusion capacitance is particularly important.
• The capacitance at node Co is composed of four diffusion capacitances, two internal
gate capacitances and six gate capacitances in the connecting adder cell.
• The transistors connected to Ci are placed closest to the output.
• Only the transistors in the carry stage have to be optimized for optimal speed. All
transistors in the sum stage can be minimal size.

MANCHESTER CARRY CHAIN ADDER:

Fig: Manchester carry chain Adder

•A Manchester carry chain adder uses a cascade of pass transistors to implement the
carry chain.
• During the precharge phase (Φ=0),all intermediate nodes of the pass transistor carry
chain are precharged to Vdd.
• During evaluation, the nodes are discharged when there is an incoming carry and the
propogate and generate signals are high.
• The worst case delay of carry chain adder is modeled by the linearized RC network.
• Increasing the transistor width reduces the time constant,but it loads the gates in the
previous stage.
• Therefor transistor size is limited by the input loading capacitance
• The distributed nature of RC of the carry chain results in a propogation delay that is
quadratic in the number of nits N.
• To avoid this, it is necessary to insert signal buffering inverters
• Adding inverter makes the overall propagation delay that is quadratic in the
number of bits N.
• Adding inverter makes the overall propogation delay a linear function of N,as is the
case with ripple carry adders.
LOOK AHEAD ADDER DESIGN
LOOK AHEAD –BASIC IDEA
• Carry look ahead logic uses the concepts of generating and propagating carries.
• A carry-lookahead adder improves speed by reducing the amount of time required to
determine carry bits.
• The carry-lookahead adder calculates one or more carry bits before the sum. This
reduces the wait time to calculate the result of larger value bits. The Kogge-stone
adder and Brent-kung adder are examples of this type of adder.
• Carry lookahead depends on two things:
-Calculating for each digit position, whether that position is going to propagate
carry if one comes in from right.
-Combining these calculated values to be able to deduce quickly whether for
each group of digits, that group is going to propagate a carry that comes in
from the right

Suppose that groups of 4 digits are chosen. Then the sequence of events goes something like
this:
-All 1-bit adders calculate their results. Simultaneously, the lookahead units perform their
calculations.
-Suppose that a carry arises in a particular group. Within at most 5 gate delays, that carry will
emerge at the left-hand end of the group and starts propagating through the group to its left.
-If that carry is going to propagate all the way through the next group, the lookahead unit will
already have deduced this. Accordingly, before the carry emerges from the next group the
lookahead unit is immediately (within 1 gate delay) able to tell the next group to the left that it
is going to receive a carry –and, at the same time, to tell the next lookahead unit to the left that
a carry is on its way.
CARRY-LOOK-AHEAD ADDERS:
• Objective-generate all incoming carries in parallel
• Feasible-carries depend only on xn-1,xn-2,,...x0 and yn-1,yn-2,y0-information available to
all stages for calculating incoming carry and sum bit
• Requires large number of inputs to each stage of adder-impractical
• Number of inputs at each stage can be reduced-find out from inputs whether new
carries will be generated and whether they will be propagated.
CARRY PROPAGATION
• If xi=yi=1-carry –out generated regardless of incoming carry-no additional
information needed
• If xi,yi=10 or xiyi=01 – incoming carry propagated
• If xi-yi=0 – no carry propagation
• Gi=xiyi- generated carry;P I=XI+YI –Propagated carry
• Ci+1=xiyi +ci(xi+yi)=Gi + ci Pi
• Substituting ci= GI-1 +ci-1 Pi-1->ci+1 =Gi +Gi-1Pi+ci-1 Pi-1Pi
• Further substitutions –
Ci+1=Gi + Gi-1Pi+Gi-2Pi-1Pi+ci-2Pi-2Pi-1Pi= ....
= Gi +Gi-1Pi+Gi-2Pi-1Pi+ ....+c0P0P1 .... Pi.
• All carries can be calculated in parallel from xn-1,xn-2,...x0,yn-1,yn-2, .. y0 and forced
carry c0
Mirror implementation of Look Ahead Carry Adder
Look-Ahead: Topology

Carry Output equations for 4-bit Look Ahead Adder


c1=Go +c0P0
c2=G1+G0P1+c0P0P1
c3=G2+G1P2+G0P1P2+c0P0P1P2
c4=G3+G2P3+G1P2P3 +G0P1P2P3 +c0P0P1P2P3
4-bit module design
Addition can be reduced to a three-step process:
1. Computing bitwise generate (G) and propagate(P) signals- Bitwise PG logic
2. Combining PG signals to determine group generate(G) and propagate(P) signals-
Group PG Logic
3. Calculating the sums- Sum Logic

Fig: 4-bit Carry Look Ahead Adder Module


16- bit Carry Look Ahead Adder design
In general, a CLA using k groups of n bits each has a delay of
tds = tpg +tpg(n) +[(n-1)+(k-1)]tAO +tXOr
Manchester carry chain implementation of carry bypass adder (carry skip adder)

• Consider the four-bit adder of as in above fig. The values of Ak and Bk (k=0…3)
are such that all propagate signals Pk (k=0…3) are high.

• An incoming carry Ci,0=1 propagates under those conditions through the complete
adder chain and causes an outgoing carry C0,3=1. In other words, If (P0 P1 P2 P3
=1) then C0,3 =Ci,0 else either DELETE or GENERATE occurred.

• This information can be used to speed up the operation of the adder as in fig. When
BP=P0 P1P2P3=1 , the incoming carry is forwarded immediately to next block
through the bypass transistor Mb –hence the name carry-bypass adder or carry-skip
adder.

Fig: Manchester carry chain implementation of carry bypass adder


• Fig. shows the possible carry propagation paths when the full-adder circuit is
implemented in Manchester carry style. This kind of arrangements speeds up
addition.
• The carry propagates either through the bypass path, or carry is generated somewhere
in the chain.
• In both the cases, the delay is smaller than the normal ripple configuration.

Fig.16-bit Carry Bypass adder


Propagation delay of carry bypass adder:

• The delay of N-bit carry skip adder is computed as


tp = tsetup + M tcarry +(N/M-2)tbypass +(M-1)tcarry +tsum .

• tsetup:the fixed overhead time to create the generate and propagate signals.
• tcarry: the propagation delay through a single bit. The worst-case carry-
propagation delaythrough a single stage of M bits is approximately M times
larger.
• tbypass: the propagation delay through the bypass multiplexer of a single stage.
• tmin: the time to generate the sum of final stage.

Fig. ripple adder vs carry bypass adder


Carry –Select Adder:
• In RCA, every FA cell has to wait for the incoming carry before an outgoing
carry is generated.
• Possible values of carry input and result for both possibilities are evaluated in advance.
• Once the real value of incoming carry is known, the correct result is easily
selected with a simple multiplexer stage.
• This implementation idea is called carry-select adder.

Fig: carry select adder


16 –Bit carry select adder:

Propagation delay of carry select adder

MULTIPLICATION
• Multiplication needs M cycles using N-bit adder
• In shift and add
-M partial product added
- Partial product is AND operation of multiplier bit and multiplicand followed by a ‘shift’

PARTIAL PRODUCT-GENERATION:

• Logical AND of multiplicand X and the multiplier bit Yi


• Adding zeros has no impact on results.
• Can reduce no. of partial products by half!!
• Eg.01111110=10000010where1=-1
-
So only two partial products
need to be added! (N-1)/2
• This transformation is Booth’s Recoding
-Leads to less additions with area reduction and higher speed.
-Alternating 10101010 for eight bits is the worst case!
-Multiplying with {-2, -1, 0, 1, 2} versus {1, 0}; needs encoding
-Used modified Booth’s recoding for consistent operation size.

Modified Booth’s Recoding


Partial product Selection table
Multiplier bits Recorded bits
000 0
001 + Multiplicand
010 + Multiplicand
011 +2 *Multiplicand
100 -2 *Multiplicand
101 -Multiplicand
110 -Multiplicand
111 0
• Number of partial
product is half
Eg.01111111isbunched
into
->01(1), 11(1), 11(1), 11(0)
->Multiplier=10 00 00 01 (see table)
->Four partial product is developed instead

THE ARRAY MULTIPLIER:


An array multiplier is a digital combinational circuit that is used for the multiplication of two
binary numbers by employing an array of full adders and half adders. This array is used for the
nearly simultaneous addition of the various product terms involved.

To form the various product terms, an array of AND gates is used before the Adder array. An
array multiplier is a vast improvement in speed over the traditional bit serial multipliers in
which only one full adder along with a storage memory was used to carry out all the bit
additions involved and also over the row serial multipliers in which product rows (also known
as the partial products) were sequentially added one by one via the use of only one multi-bit
adder.

The tradeoff for this extra speed is the extra hardware required to lay down the adder array. But
with the much-decreased costs of these adders, this extra hardware has become quite affordable
to a designer. In spite of the vast improvement in speed, there is still a level of delay that is
involved in an array multiplier before the final product is achieved. Before committing
hardware resources to the circuit, it is important for the designer to calculate the aforementioned
delay in order to make sure that the circuit is compatible with the timing requirements of the
user.
Fig: Array Multiplier
• N partial products of M bit size each.
• NxM two bit AND; N-1 Mbit adders
• Layout need not be straggled, but routing will take care of shift

Carry save multiplier:

Fig: Carry save Multiplier


• Large number of any critical paths are present in the array multiplier
• Increasing the performance of the structure through transistor sizing yields
marginal benefits
• A more efficient realization can be obtained by noticing that the multiplication
result does not change when the output carry bits are passed diagonally
downwards instead of only to the right
• An extra adder called a vector merging adder to generate the final result is
included
• This resulting multiplier is called carry save multiplier. Because he carry bits
are not immediately added but are rather saved for the next adder stage
• In the final stage, carry and sums are merged in a fast carry propagate adder
stage
• It has advantage that its worst-case critical path is shorter.
The delay due to the carry save multiplier is given by the below expression
t mult=(N-1) tcarry+tand+tmerge
Wallace tree multiplier:
A Wallace tree is an efficient hardware implementation of a digital circuit that multiplies two
integers, devised by Australian Computer Scientist Chris Wallace in 1964.
The Wallace tree has three steps:

1. Multiply (that is – AND) each bit of one of the arguments, by each bit of the other,
yielding n2 results. Depending on position of the multiplied bits, the wires carry
different weights, for example wire of bit carrying result is 128
2. Reduce the number of partial products to two by layers of full and half adders.
3. Group the wires in two numbers, and add them with a conventional adder.
The second step works as follows. As long as there are three or more wires with the same
weight add a following layer:

• Take any three wires with the same weights and input them into a full adder. The result
will be an output wire of the same weight and an output wire with a higher weight for
each three input wires.
• If there are two wires of the same weight left, input them into a half adder.
• If there is just one wire left, connect it to the next layer.
• These computations only consider gate delays and don't deal with wire delays, which can
also be very substantial.
• The Wallace tree can be also represented by a tree of 3/2 or 4/2 adders.
• It is sometimes combined with Booth encoding

The advantages of Wallace Tree multipliers are


1. Substantial hardware saving
2. It offers greater speed
The disadvantage is irregular and inefficient layout.
The characteristics of Wallace tree multiplier include
• Final adder choice is critical; depends on structure of accumulator array
• Carry look ahead might be good if data arrives simultaneously
• Place pipeline stage before final addition
• In non-pipelined, other adders can be used with similar performance and less
hardware requirement

DIVIDER:
Unsigned non-restoring division:
Input: An n-bit dividend and a m-bit divisor
Output: The quotient and remainder
Begin:
1. load divisor and dividend into registers M and D, respectively, clear partial remainder
register R and set loop count cnt equal to n-1.
2. left shift register pair R:D one bit.
3.compute R=R-M;
4. Repeat
If(R<0)
begin
D(0)=0;left shift R: D one bit; R=R+M; end
Else begin
D(0)=1 ; let shift R:D one bit ;R=R-M; end
Cnt=cnt-1: until (cnt==0)
5. If(R<0) begin D[0]=0;R=R+M; end else D(0)=1;end
Fig: Sequential Implementation of Non-Restoring Divider.
BARREL SHIFTER:
Any general-purpose n-bit shifter should be able to shift incoming data upto n-1 places in a right shift or
left shift direction. If we now further specify that all shifts should be one end around basis, so that any bit
shifted out at one end of a data word, will be shifted in at the other end of the word, then the problem of left
shift or right shift is greatly eased.
For a 4 it word, a 1bt right shift is equal to a 3bit left shift and a 2bit shift right is equal to a 2bit shift left
etc. Thus, we can achieve a capability to shift left or right by zero, one, two or three places by designing a
circuit which will shift right only by one, two or three places.
Barrel shifter is an adaptation of the crossbar switch which recognizes the fact that we can couple the switch
gates together in groups of four and also form four separate groups corresponding to shifts of zero, one,
two and three bits.
The arrangement is readily adapted so that the in-lines also run horizontally. The resulting arrangement is
known as barrel shifter. This inter bus switches have their gate inputs connected in a staircase fashion in
groups of four and these are now four shift control inputs which must be mutually exclusive in the active
state. The structure of barrel shifter is of high regularity and generality
DESIGNING OF MEMORY AND ARRAY STRUCTURES
Memory Classification: Classification criteria
I. Size: Depending upon the level of abstraction, different means are used to express the size of
a memory unit. The circuit designer tends to define the size of a memory in terms of the
numbering of bits that are equivalent to the number of individual cells (flip flops) needed to
store the data. The chip designer expresses the memory size in bytes or its multiples. The
system designer likes to quote the storage requirement in words.
II. Timing Parameters:
The time it takes to retrieve data (read) from the memory is called read access time which is
equal to the delay between the read request and the moment the data is available at the output.
This time is different from the write-access time which is the time elapsed between a write
request and final writing of the input data into the memory Read or write cycle time of the
memory is the minimum time required between successive reads or writes.

III. Function and Access patterns

IV. Input output architecture


Number of data at the input and output ports (multiport memories)
V. Application
Standalone ICs
Embedded
Secondary or tertiary memories (magnetic and optical disc)
Memory architecture and building blocks:
When implementing an N-word memory where each word is M-bits wide, the most intuitive
approach is to stack the subsequent memory words in a linear fashion one word at a time is
selected for reading or writing with the aid of a select bit (S 0 to SN-1), if we assume that this
module is a single port memory.
A decoder is inserted to reduce the number of select signals a memory word is selected by
providing a binary encoded address word (A0 to AK-1), The decoder translates this address into
N=2K select lines, only one of which is active at a timer. This approach reduces the number of
address lines from N to log2(2K) = K.

This design does not address the issue of memory aspect ratio (height is very large compared
to width). This results in a design which cannot be implemented. Besides the bizzare shape
factor, the resulting design is extremely slow. The vertical wires connecting the storage cells
to the input/output becomes excessively long. To address this problem, memory arrays are
organized so that vertical and horizontal dimensions are of the same order of magnitude, thus
the aspect ratio approaches unity. Multiple words are stored in a single row and are selected
simultaneously. To route the correct word to the input/output terminals, an extra piece of
circuitry called the column decoder is needed. The address word is partitioned into a column
address (A0 to AK-1) and a row address (AK to AL-1). The row address enables one row of the
memory for R/W while the column address picks one particular word from the selected row.
For layer memories. The memory is partitioned into P smaller blocks. The composition of each
of the individual blocks is identical to the above figure. A word is selected based on the row
and column address that are broadcast to all the blocks. An extra address word called the block
address, selects one of the P blocks to be read or written. This approach has a dual advantage.
1. The length of the local word and bitlines i.e. the length of the lines within the blocks is
kept within bounds, results in faster access times.
2. The block address can be used to activate only the addressed block. Non active blocks
are put in power saving mode with sense amplifiers and row and column decoders
disabled. This results in a substantial power saving that is desirable.

The Memory Core


Read only memories
Programs for processors with fixed applications such as washing machines, calculators and
game machines, once developed and debugged need only reading.
ROM cells - An overview:
The cell should be designed so that a 1 or 0 is presented to the bit line upon activation of its
word line. Figure shows several ways to accomplish this.
Diode ROM
• Bit line (BL) is resistively clamped to ground i.e. BL is pulled low through the resistor
connected to ground lacking any other excitations or inputs.
• 0 cell : No physical connection between BL and word line.
• When high voltage is applied to WL of 1 cell, diode is enabled and the WL is pulled up
to VWL-VDON, resulting in a 1 on the BL.
• Disadvantage: does not isolate BL from WL

A better approach is to use an active device in the cell. The diode is replaced by the gate source
connection of an NMOS transistor, whose drain is connected to the supply voltage.
Read Write memories (RAM)
Static RAM (SRAM)
A generic SRAM cell consists of 6 transistors (6T) per bit. Access to the cell is enabled by the
WL, which replaces the clock and controls two pass transistors M5 and M6, shared between
the read and write operation. In contrast to ROM cells, two bit lines transferring both the store
signal and its inverse are required. Doing so improves the noise margin during both read and
write operations.

Operation of SRAM cell


Read operation:
Assume that a 1 is stored at Q. Both bit lines are precharged to 2.5 V before the read operation
is initiated. The read cycle is started by asserting the word line, enabling both pass transistors
M5 and M6 after the initial WL delay. During a correct read operation, the value stored in Q
and Q_BAR are transferred to the bit lines by leaving BL at its precharged value and
discharging BL_BAR through M1 to M5. A careful sizing of the transistor is necessary to avoid
accidentally writing a 1 into the cell. This type of malfunction is frequently called a read upset.
Write operation:
Assume that a 1 is stored in the cell (or Q=1). A 0 is written into the cell by setting BL_BAR
to 1 and BL to 0, which is equivalent to applying a rest pulse to SR latch. This causes the flip
flop to change its state if the devices are properly sized.
Dynamic RAM (DRAM)
3T Dynamic Memory cell:
The cell is written by placing appropriate data value on BL1
and asserting the write WL (WWL). The data is retrieved as
a charge on the capacitance CS once WWL is lowered. When
reading the cell, the RWL is raised. The storage transistor
M2 is either On or Off depending on the stored value. The
Bitline BL2 is either clamped to VDD with the aid of a load
device or is precharged to either VDD or VDD-VT. The
series connection of M2 and M3 pulls BL2 low when a 1 is
stored. BL2 remains high in the opposite case. Notice that
the cell is inverting i.e. the inverse value of the stored signal
is sensed on the BL. The most common approach to
refreshing a cell is to read the stored data, put its inverse on
BL1 and assert WWL in consecutive order.
The properties of 3 T cell
• In contrast to SRAM cell, no constraints exist one the device ratios.
• Reading the 3T cell is non destructive i.e. the data value stored in the cell is not affected
by a read.
• No special process steps are needed. The storage capacitance is nothing more than the
gate capacitance of the readout device.
Memory Peripheral Circuitry (Control Circuitry)
Since the memory core trades performance and reliability for reduced area, memory design
relies exceedingly on the peripheral circuitry to recover both speed and electrical integrity.
The address decoders:
Whenever a memory allows for random address based access, the address decoders must be
present. Two classes of decoders – the row decoder, whose task is to enable one memory row
out of 2M and the column and block decoders which can be described as 2K input multiplexers,
where M and K are the widths of the respective fields in the address word.
Row decoders:
A 1-out-of-2M decoder is nothing less than a collection of 2M complex M-input logic gates.
Consider an 8-bit address decoder. Each of the outputs WLiisa logic function of the 8 input
address signals (A0 to A7). For example, the address 0 and 127 are enabled by the following
logic functions:
WL0=A0’A1’A2’A3’A4’A5’A6’A7’
For a single stage implementation it can be transformed in to a wide NOR using De-Morgan;s
rules
WL0=(A0+A1+A2+A3+A4+A5+A6+A7)’

Static Decoder Design:


Implementing a wide NOR function in complementary CMOS is impractical. Splitting a
complex gate into two or more logic layers most often produces a faster and cheaper
implementation. Segments of the address are decoded in a first layer of logic called the
predecoder. A second layer of logic gates then produces the final word line signals.
WL0={(A0+A1)’+(A2+A3)’+(A4+A5)+(A6+A7)’}’

For this particular case, the address is partitioned into sections of 2 bits that are decoded in
advance. The resulting signals are combined using 4 input NAND gates to produce the fully
decoded array of WL signals.
Dynamic Decoders:
Since only one transition determines the decoder speed, it is interesting to evaluate other circuit
implementations.
Column and Block decoders:
The functionality of a column and block decoder is best described as a 2K input multiplexer
where K stands for the size of the address word. One implementation is based on the CMOS
pass transistor multiplexer. The control signals of the pass transistor are generated using a K-
to-2K predecoder. The schematic of a 4to1 column decoder using only NMOS transistors is
shown. The main advantage of this implementation is its speed. Only a single pass transistor is
inserted in the signal path, which introduces only a minimal extra resistance. The column
decoding is one of the last actions to be performed in the read sequence, so that the predecoding
can be executed in parallel with other operations such as memory access and sensing and can
be performed as soon as the column address is available. Consequently, the propagation delay
does not add to the overall memory access time.
A more efficient implementation is offered by a tree decoder that uses a binary reduction
scheme. Notice that no predecoder is required. The number of devices is drastically reduced as
shown.
Ntree = 2K + 2K-1 + … + 4 + 2 = 2(2K-1)

A 4-to-1 tree based column decoder


Sense Amplifiers:
They perform the following functions:
• Amplification: In certain memory structures such as a 1T RAM, amplification is
required for proper functionality since the typical circuit swing is limited to 100 mV.
• Delay reduction: The amplifier compensates for the restricted fan out driving capability
of the memory cell by accelerating the BL transition, or by detecting and amplifying
small transitions on the BL to large output swings.
• Power reduction: Reducing the signal swing on the bitlines can eliminate a substantial
part of the power dissipation related to the charging and discharging of the bit lines.
• Signal restoration: Because the read and refresh functions are intrinsically linked in 1T
DRAMs, it is necessary to drive the BLs to the full signal range after sensing.
Differential Voltage Sensing Amplifiers:
Effectiveness of a differential amplifier is characterized by
1. Common mode rejection ratio CMRR: ability to amplify the true difference between
the signals and reject the common noise.
2. Power supply rejection ratio PSRR: spikes on the power supply are rejected by this
ratio
Figure shows the most basic sense amplifier. Amplification is achieved with a single stage,
based on current mirroring concept. The input signals are heavily loaded and driven by the
SRAM memory cell. The swing on those lines is small as the small memory cell drives a
large capacitive load. The inputs are fed to the differential input devices (M1 and M2) and
M3 and M4 act as active current mirror load. The amplifier is conditioned by the sense
amplifier enable signal SE. Initially inputs are precharged and equalized to a common value
while SE is low disabling the circuit. Once the read operation is initiated, one of the bit line
drops, SE is enabled when a sufficient differential signal has been established and the
amplifier evaluates.

Power dissipation in memories:


Reduction of power dissipation in memories is becoming of premier importance. Technology
scaling with its reduction in supply and threshold voltages and its deterioration of the off current
of the transistor causes the standby power of the memory to rise.
Sources of power dissipation in memories:
The power consumption in a memory chip can be attributed to three major sources – the
memory cell array, the decoders (block, row, column) and the periphery. A unified active power
equation for a modern CMOS memory array of m columns and n rows is approximately given
by:
For a normal read cycle
P = VDD IDD
IDD = Iarray + I deocde + I periphery= [miact + m(n-1)ihld] + [(n+m)CDEVintf] + [CPTVintf + IDCP]
where iact : effective current of the selected or active cells; ihld : the data retention current of the
inactive cells ; CDE: output capacitance of each decoder ;CPT: the total capacitance of the CMOS
logic and peripheral circuits ; Vint: internal supply voltage ; IDCP: the static or quasistatic current
of the periphery. The major source of this current are the sense amplifiers and the column
circuitry. Other sources are the on chip voltage generator; f: operating frequency
The power dissipation is proportional to the size of the memory. Dividing the memory into
subarrays and keeping n and m small are essential to keep the power within bounds.
In general, the power dissipation of the memory is dominated by the array. The active power
dissipation of the peripheral circuits is small compared to other components. Its standby power
can be high however requiring that circuits such as sense amplifiers are turned off when not in
action. The decoder charging current is also negligibly small in modern RAMs especially if
care is taken that only one out of the n or m nodes is charged at every cycle.
Power reduction Techniques:
1) Partitioning of the memory
A proper division of the memory into submodules goes a long way in confining active power
dissipation to the limited areas of the overall array. Memory units that are not in use should
consume only the power necessary for data retention. Memory portioning is accomplished by
reducing m (the number of cells on a bit line) and/or n (the number of cells on a bit line). By
dividing the word line into several sub word lines that are enabled only when addressed, the
overall switched capacitance per access is reduced.
Partitioning of the bit line reduces the capacitance switched at every read/write operation. An
approach that is often used in DRAM memories is the partially activated bit line. The bit line
is partitioned into multiple sections. All three sections share a common sense amplifier,
column decoder and I/O module.
2) Addressing the active power dissipation
Reducing the voltage levels is one of the most effective techniques to reduce power
dissipation in memories.
SRAM Active power dissipation:
To obtain a fast read operation, the voltage swing on the bit line is made as small as possible
typically between 0.1 and 0.3 V. The resulting signal is sent to the sense amplifier for
restoration. Since the signal is developed as a result of the ratio operation of the bit line load
and the cell transistor, a current flows through the bit line as long as the word line is activated
( t). Limiting t and the bit line swing helps to keep the active dissipation of SRAM low.
The saturation is worse for the write operation. Since BL and BL_BAR have to make a full
excursion. Reduction of the core voltage is the only remedy for this. Ultimately, the reduction
of the core voltage is limited by the mismatch between the paired MOS transistors in the SRAM
cell. Stringent control of the MOS transistor characteristics either at the process time or at the
run time using techniques such as body biasing is essential in low voltage operation mode.
DRAM Active power dissipation:
The destructive readout process of a DRAM necessitates successive operations of readout,
amplification and restoration of the selected cells. Consequently, the bit lines are charged and
discharged over the full swing (VBL) for every read operation. Care should thus be taken to
reduce bit line dissipation charge mCBLVBL, since it dominates the active power. Reducing
CBL (bit line capacitance) is advantageous from both a power and SNR perspective. Reducing
VBL while very beneficial from a power perspective, negatively impacts the SNR ratio.
Voltage reduction thus has to be accompanied by either an increase in the size of the storage
capacitor and/or a noise reduction. A number of techniques have proven to be quite effective.
a) Half-VDD precharge: Precharging the bit lines to VDD/2 helps to reduce active power in
DRAM memories by a factor of almost 2.
b) Boosted word line: Raising the value of the WL above VDD during a write operation
eliminates the threshold drop over the access transistor, yielding a substantial increase in
stored charge.
c) Increased capacitor area or value: Vertical capacitors such as those used in stacked and
trench cells are very effective in increasing the capacitance value. Keeping the ground plate
of the storage capacitor at VDD/2 reduces the maximum voltage over CS, making it possible
to use thinner oxides.
d) Increasing the cell size: Ultra-low voltage DRAM memory operation might require a
sacrifice of the area efficiency, especially for memories that are embedded in a system-on-
chip.
3) Data retention dissipation:
Data retention in SRAMs
In principle an SRAM array should not have any static power dissipation. Yet leakage current
of the cell transistors is becoming a major source of the retention current (duct subthreshold
leakage). Techniques to reduce retention current of SRAM memories:
a) Turning off unused memory blocks:
Memory function such as caches do not fully use the available capacity for most of the time.
Disconnecting unused blocks from the supply rails using high threshold switches reduces their
leakage to very low values. Obviously, the data stored in the memory is lost in this approach.
b) Increasing the threshold by using body biasing:
Negative bias of the non active cells increases the thresholds of the devices and reduces the
leakage current.
c) Inserting extra resistance in the leakage path:
When data retention is necessary, the insertion of a low threshold switch in
the leakage path provides a means to reduce leakage current while keeping
the data intact.While the low threshold device leaks on its own, which is
sufficient to maintain the state in the memory. At the same time, a voltage
drop over the switch introduces a “stacking effect” in the memory cells
connected to it. A reduction of VGS combined with a negative VBS results in
a substantial drop in the leakage current.

d) Lowering supply voltage:

DRAM Retention power:


To combat leakage and loss of signal, DRAMs have to be refreshed
continuously when in data retention mode. The refresh operation is
performed by reading the m cells connected to a word line and restoring
them. This operation is performed for each of the n word lines in a
sequence. The standby power is thus proportional to the bit line
dissipation charge and the refresh frequency.
The secret to leakage minimization in DRAM memories is VT control.
This can be accomplished at the design time (the fixed VT approach) or
dynamically (the variable VT technique). One option to reduce leakage
through the access transistor in the DRAM cell is to turn off the device
hard by applying a negative voltage (-VWL) to the word line of non-
active cells.

You might also like