0% found this document useful (0 votes)
66 views

L15: Custom and ASIC VLSI Integration

- Curt Schurgers Introductory Digital Systems Laboratory 3 Custom Design / Layout Itanium has 6 integer execution units like this 9-1 Mux 5-1 Mux a CARRYGEN g64 node1 ck1 REG sum sumb to Cache SUMGEN + LU Hand crafting the layout to achieve maximum clock rates (> 1Ghz) Exploits regularity in datapath structure to optimize interconnects.

Uploaded by

pinoytsikboy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

L15: Custom and ASIC VLSI Integration

- Curt Schurgers Introductory Digital Systems Laboratory 3 Custom Design / Layout Itanium has 6 integer execution units like this 9-1 Mux 5-1 Mux a CARRYGEN g64 node1 ck1 REG sum sumb to Cache SUMGEN + LU Hand crafting the layout to achieve maximum clock rates (> 1Ghz) Exploits regularity in datapath structure to optimize interconnects.

Uploaded by

pinoytsikboy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

L15: Custom and ASIC VLSI Integration

Acknowledgements:
Materials in this lecture are courtesy of the following people and used with permission.
- Rabaey, J., A. Chandrakasan, B. Nikolic. Digital Integrated Circuits: A Design Perspective.
Prentice Hall, 2003.
- Curt Schurgers

L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 1


Layout 101

Cross-Section VDD p-type substrate

n-type well

metal/pdiff
contact
Wp

Lp

IN OUT

VDD Wn

contact
Ln frommetal
S to ndiff
G Circuit Representation GND

D metal poly n+ p+
diff diff
IN OUT
D Layout
ƒ Follow simple design rules (contract
G
between process and circuit designers)
S
(Courtesy of Chris Terman. Used with permission.)
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 3
Custom Design/Layout
Itanium has 6 integer execution units like this
a
9-1 Mux

5-1 Mux
g64
CARRYGEN

node1

SUMSEL
sum sumb

REG
ck1
to Cache
9-1 Mux

2-1 Mux

SUMGEN s0
+ LU s1
b

LU : Logical
Unit
1000um

From register files / Cache / Bypass

Multiplexers

Shifter

Adder stage 1
Wiring
Die photograph of the
Loopback Bus
Loopback Bus

Loopback Bus

Adder stage 2

Wiring
Itanium integer datapath
Bit slice 63

Bit slice 2
Bit slice 1
Bit slice 0

Adder stage 3

Sum Select Bit-slice Design Methodology


To register files / Cache

ƒ Hand crafting the layout to achieve maximum clock rates (> 1Ghz)
ƒ Exploits regularity in datapath structure to optimize interconnects
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 4
The ASIC Approach
Design Capture Behavioral

Verilog
Verilog(or
(orVHDL
VHDL))
Pre-Layout
Pre-Layout
Simulation Structural
Simulation
Design Iteration

Logic
LogicSynthesis
Synthesis

Floorplanning
Floorplanning
Post-Layout
Post-Layout
Simulation
Simulation Placement
Placement Physical

Circuit
Circuit Routing
Routing
Extraction
Extraction

Tape-out
Most Common Design Approach for Designs up to 500Mhz
Clock Rates
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 5
Standard Cell Example

Power Supply Line (VDD) Delay in (ns)!!

3-input NAND cell


(from ST Microelectronics):
C = Load capacitance
T = input rise/fall time
Ground Supply Line (GND)

ƒ Each library cell (FF, NAND, NOR, INV, etc.) and the variations on size
(strength of the gate) is fully characterized across temperature, loading, etc.
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 6
Standard Cell Layout Methodology

2-level metal technology Current Day Technology

Cell-structure hidden under interconnect layers

ƒ With limited interconnect layers, dedicated routing channels


between rows of standard cells are needed
ƒ Width of the cell allowed to vary to accommodate complexity
ƒ Interconnect plays a significant role in speed of a digital circuit
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 7
Verilog to ASIC Layout
(the push button approach)

After
Synthesis
module adder64 (a, b, sum);
input [63:0] a, b;
output [63:0] sum;

assign sum = a + b;
endmodule

After Routing

After
Placement

L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 8


The “Design Closure” Problem

VDD BUS

CL
d1 l1
CI
λ = =5
d2 l2
CI CL
CL
Wire-to-wire capacitance causes
inter-wire delay dependencies

Iterative Removal of Timing Violations (white lines)


L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 9
Macro Modules

256×32 (or 8192 bit) SRAM Generated by hard-macro module generator

ƒ Generate highly regular structures (entire memories,


multipliers, etc.) with a few lines of code
ƒ Verilog models for memories automatically generated
based on size

L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 10


Clock Distribution

D Q

(Image removed due to copyright considerations.)


D Q

For 1Ghz clock, skew budget is 100ps. IBM Clock Routing


Variations along different paths arise
from:
• Device: VT, W/L, etc.
• Environment: VDD, °C
• Interconnect: dielectric thickness
variation
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 11
The Power Supply Wires are Not Ideal!

To VDD Grid

To VDD Grid

Ccoup
To VDD Grid
Receiver

Cint Rd

Cd

Driver

GROUND GRID

Pad Pad

The IR-drop problem causes internal power supply voltage


to be less than the external source

(Courtesy of Prof. David Blaauw. Used with permission.)


L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 12
Analog Circuits: Clock Frequency
Multiplication (Phase Locked Loop)

up

down

„ VCO produces high frequency square wave


„ Divider divides down VCO frequency
„ PFD compares phase of ref and div
„ Loop filter extracts phase error information
Used widely in digital systems for clock synthesis
(a standard IP block in most ASIC flows)
(Courtesy of Michael Perrott. Used with permission.)
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 13
Behavioral Transformations

ƒ There are a large number of implementations of the same


functionality
ƒ These implementations present a different point in the
area-time-power design space
ƒ Behavioral transformations allow exploring the design
space a high-level

Optimization metrics: power


1. Area of the design
2. Throughput or sample time TS
3. Latency: clock cycles between
the input and associated output
change area
4. Power consumption
5. Energy of executing a task time
6. …

L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 14


Fixed-Coefficient Multiplication

Conventional Multiplication X3 X2 X1 X0
Y3 Y2 Y1 Y0
Z=X·Y
X 3 · Y0 X 2 · Y0 X 1 · Y0 X 0 · Y0
X 3 · Y1 X 2 · Y1 X 1 · Y1 X 0 · Y1
X 3 · Y2 X 2 · Y2 X 1 · Y2 X 0 · Y2
X 3 · Y3 X 2 · Y3 X 1 · Y3 X 0 · Y3
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

Constant multiplication (become hardwired shifts and adds)


X3 X2 X1 X0
Z = X · (1001)2 1 0 0 1
X3 X2 X1 X0
X3 X2 X1 X0
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

X Z
Y = (1001)2 = 23 + 20
<< 3
shifts using wiring
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 15
Transform: Canonical Signed Digits (CSD)

Canonical signed digit representation is used to increase the number of


zeros. It uses digits {-1, 0, 1} instead of only {0, 1}.

Iterative encoding: replace 0 1 1 … 1 1 1 0 0 … 0 -1


string of consecutive 1’s
2N-2 + … + 21 + 20 2N-1 - 20

Worst case CSD has 50% non zero bits

01101111 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0 -1
=

10010001 1 0 0 -1 0 0 0 -1

X << 7 Z
<< 4
Shift translates to re-wiring
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 16
Algebraic Transformations

Commutativity Distributivity
A C B
A B A B
B A
C

A+B=B+A (A + B) C = AB + BC

Associativity Common sub-expressions


A B B C X Y
X Y X
C A

A B
A B
(A + B) + C = A + (B+C)
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 17
Transforms for Efficient Resource Utilization

A B C D E FG H I
Time multiplexing: mapped to
3 multipliers and 3 adders
1

distributivity
A C B D E FG H I

Reduce number of operators


1
to 2 multipliers and 2 adders

L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 18


A Very Useful Transform: Retiming

Retiming is the action of moving delay around in the systems


ƒ Delays have to be moved from ALL inputs to ALL outputs or vice versa

D
D
D
D
D

Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint
partitions of these edges being cut. To retime, delays are moved from the ingoing to the
outgoing edges or vice versa.

D
Benefits of retiming:
• Modify critical path delay
• Reduce total number of registers

(Courtesy of Prof. Charles E. Leiserson. Used with permission.)


L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 19
Retiming Example: FIR Filter

x(n) D D D Symbol for multiplication


Direct form
h(0) h(1) h(2) h(3) K
y (n) = h(n) ⊗ x(n) = ∑ x(n − i ) ⋅ h(i )
y(n) i =0

associativity of
x(n) the addition
D D D

(10) h(0) h(1) h(2) h(3) Tclk = 22 ns

y(n)

(4) retime
x(n)

h(0) h(1) h(2) h(3)


Transposed form Tclk = 14 ns
y(n) D D D

Note: here we use a first cut analysis that assumes the delay of a chain of operators is the sum
of their individual delays. This is not accurate.
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 20
Pipelining, Just Another Transformation
(Pipelining = Adding Delays + Retiming)

Contrary to retiming,
pipelining adds extra registers
to the system

add input
registers
D D How to pipeline:
1. Add extra registers at
D D all inputs
2. Retime

retime

D D

D D

L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 21


The Power of Transforms: Lookahead
y(n) = x(n) + A y(n-1) x(n) y(n)
y(n) loop
x(n)
unrolling D A 2D
A D A

y(n) = x(n) + A[x(n-1) + A y(n-2)]


Try pipelining
this structure distributivity
x(n) y(n)

D 2D
How about pipelining A A A
this structure! associativity

x(n) y(n)
x(n) y(n)
retiming
A D D D D 2D
A A2
A2
precomputed
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 22
Scan Testing
... Idea: have a mode in which all registers are chained
into one giant shift register which can be loaded/
0 read-out bit serially. Test remaining (combinational)
1 logic by
ScanShift (1) in “test” mode, shift in new values for all
shift out register bits thus setting up the inputs to the
combinational logic
0 (2) clock the circuit once in “normal” mode, latching
1 the outputs of the combinational logic back into
CLK the registers
ScanShift
(3) in “test” mode, shift out the values of all
shift in register bits and compare against expected
ScanShift shift in results.

L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 23


Trends: “Chip in a Day”
(Matlab/Simulink to Silicon…)

S reg X reg
Add, Mult2
Sub,
Shift
Mac1 Mac2
Mult1

Map algorithms directly to silicon - bypass writing Verilog!


(Courtesy of R. Brodersen. Used with permission.)
L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 24
Trends: Watermarking of Digital Designs

Fingerprinting is a technique to deter people from illegally


redistributing legally obtained IP by enabling the author of the IP to
uniquely identify the original buyer of the resold copy.
The essence of the watermarking approach is to encode the author's
signature. The selection, encoding, and embedding of the signature
must result in minimal performance and storage overhead.

(Images removed due to copyright considerations.)

L15: 6.111 Spring 2004 Introductory Digital Systems Laboratory 25

You might also like