LECTURE 2. From Combination Alto Processor
LECTURE 2. From Combination Alto Processor
Outline
Introduction Combinational logic Sequential logic FSM design Custom single-purpose processor design RT-level custom single-purpose processor design
(b) Design process proceeds to lower abstraction level, narrowing in on single implementation
modeling cost increases opportunities decrease
idea
back-of-the-envelope sequential program register-transfers logic
idea
implementation (a)
implementation (b)
What is Synthesis
Automatically converting systems behavioral description to a structural implementation
Complex whole formed by parts Structural implementation must optimize design metrics
Gajskis Y-chart
Each axis represents type of description
Behavioral
Defines outputs as function of inputs Algorithms but no implementation
Structural Processors, memories Behavior Sequential programs
Structural
Implements behavior by connecting components with known behavior
Register transfers
Logic equations/FSM Transfer functions Cell Layout Modules
Physical
Gives size/locations of components and wires on chip/board
Introduction
Processor
Digital circuit that performs a computation tasks Controller and datapath CCD General-purpose: variety of computation tasks Single-purpose: one particular lens computation task Custom single-purpose: non-standard task
Digital camera chip
A2D
CCD preprocessor
Pixel coprocessor
D2A
JPEG codec
Microcontroller
Multiplier/Accum
DMA controller
Display ctrl
Memory controller
UART
LCD ctrl
IC package
IC
source
nMOS
pMOS
F = (xy)'
F = (x+y)'
y
Basic gates
Inverter, NAND, NOR
x 0 1
F 0 1
x y F
F=x Driver
F=xy AND
x 0 0 1 1
y 0 1 0 1
F 0 0 0 1
x y
F=x+y OR
x 0 0 1 1
y 0 1 0 1
F 0 1 1 1
x y F
F=xy XOR
x 0 0 1 1
y 0 1 0 1
F 0 1 1 0
x 0 1
F 1 0
x y F
F = x Inverter
F = (x y) NAND
x 0 0 1 1
y 0 1 0 1
F 1 1 1 0
x y
F = (x+y) NOR
x 0 0 1 1
y 0 1 0 1
F 1 0 0 0
x y F
F=x y XNOR
x 0 0 1 1
y 0 1 0 1
F 1 0 0 1
E) Logic Gates a b c
y = a + bc 00 0 0 1 0 01 1 1 11 0 1 10 1 1
z = ab + bc + bc
Combinational components
I(m-1) I1 I0 n S0 n-bit, m x 1 Multiplexor S(log m) n I(log n -1) I0 log n x n Decoder O(n-1) O1 O0 A n n-bit Adder n carry sum less equal greater B n A n B n A n B
n-bit Comparator
O = A op B op determined by S.
Logic synthesis
Logic-level behavior to structural implementation
Logic equations and/or FSM to connected gates
Minimize size
Multilevel minimization
Trade performance for size Pareto-optimal solution
Heuristics
FSM synthesis
State minimization State encoding
14
Two-level minimization
Represent logic function as sum of products (or product of sums)
AND gate for each product OR gate for each sum
Sum of products
F = abc'd' + a'b'cd + a'bcd + ab'cd
Direct implementation a b c d
15
Minimum cover
Minimum # of AND gates (sum of products) Literal: variable or its complement
a or a, b or b, etc.
Cover: set of implicants that covers all minterms of function Minimum cover: cover with minimum # of implicants
16
0 0 1 0
0 0 0 0
1 1 0 1
0 0 0 0
0 0 1 0
0 0 0 0
1 1 0 1
0 0 0 0
Minimum cover
Covering all 1s with min # of circles Example: direct vs. min cover
Less gates
4 vs. 5
Less transistors
28 vs. 40
17
cd
00
01
11
10
0 0 1 0
0 0 0 0
1 1 0 1
0 0 0 0
Implementation a b c d
1 4-input AND gate 2 3-input AND gates F 1 4 input OR gate 26 transistors
Less transistors
26 vs. 28
18
Heuristic
Solution technique where optimal solution not guaranteed Hopefully comes close
19
Reduce
Opposite of expand
Reshape
Expands one implicant while reducing another Maintains total # of implicants
Irredundant
Selects min # of implicants that cover from existing implicants
Synthesis tools differ in modifications used and the order they are used
20
2-level minim.
size
21
multilevel minimized
a b c d e f g h
22
FSM synthesis
FSM to gates State minimization
Reduce # of states
Identify and merge equivalent states
Outputs, next states same for all possible inputs Tabular method gives exact solution Table of all possible state pairs If n states, n2 table entries Thus, heuristics used with large # of states
State encoding
Unique bit sequence for each state If n states, log2(n) bits n! possible encodings Thus, heuristics common
23
Sequential components
I n load clear n-bit Register n Q Q= 0 if clear=1, I if load=1 and clock=1, Q(previous) otherwise. Q = lsb - Content shifted - I stored in msb shift I n-bit Shift register n-bit Counter n Q Q= 0 if clear=1, Q(prev)+1 if count=1 and clock=1. Q
0
a=1 1 a=0 x=0
3
a=1 2 x=0 a=0
a=1
I1 Q1Q0 00 a 0
1
01
11
10
0
0
0
1
1
0
1
1
I0 Q1Q0 00
a
01 1 0
11 1 0
10 0 1 I0 = Q0a + Q0a
I1
0 1
x Q1Q0 00 a 0 1 0 0
I0 01 0 0 11 1 1 10 0 0 x = Q1Q0 Q1 Q0
datapath
registers
state register
functional units
4:
5:
8: x = x - y
a=b
C: cond
!cond
next statement J:
loop-bodystatements
J: next statement
next statement
Datapath
x_ld
4: y = y_i !(x!=y) x!=y 6: x<y y = y -x 6-J: !(x<y) != 5: x!=y x_neq_y x_lt_y y_ld
0: x
0: y
5:
< 6: x<y
subtractor 8: x-y
subtractor 7: y-x
8: x = x - y
9: d d_o
d_ld
1:
1 2: !go_i 2-J: 3: x = x_i !(!go_i)
Controller
0000 0001 1: 1 2: !go_i 0010 2-J: 0011 x_sel = 0 3: x_ld = 1 y_sel = 0 4: y_ld = 1 5: 6:
!1 !(!go_i)
4:
Datapath
!x_neq_y x_sel n-bit 2x1 n-bit 2x1
5:
0110
y_sel x_ld 0: x 0: y
8: x = x - y
y_ld
< 6: x<y
subtractor 8: x-y
subtractor 7: y-x
5-J:
9:
1-J:
9: d d_o
d_ld
1100 1-J:
Controller
0000 0001 1: 1 2: !go_i 0010 2-J: 0011 x_sel = 0 3: x_ld = 1 y_sel = 0 4: y_ld = 1 5: 6:
!1 x_i !(!go_i) x_sel y_sel x_ld y_ld 0: x 0: y n-bit 2x1 n-bit 2x1 y_i
(b) Datapath
< 6: x<y
subtractor 8: x-y
subtractor 7: y-x
9: d d_o
1001 6-J:
1010 5-J: 1011 9: d_ld = 1
1100 1-J:
Outputs
x_sel X X X X 0 X X X X X X 1 y_sel X X X X X 0 X X X X 1 X x_ld 0 0 0 0 1 0 0 0 0 0 0 1 y_ld 0 0 0 0 0 1 0 0 0 0 1 0 d_ld 0 0 0 0 0 0 0 0 0 0 0 0
1
1 1 1 1 1 1
0
0 0 1 1 1 1
0
1 1 0 0 1 1
1
0 1 0 1 0 1
*
* * * * * *
*
* * * * * *
*
* * * * * *
1
0 1 0 0 0 0
0
1 1 0 0 0 0
1
0 0 0 0 0 0
0
1 0 0 0 0 0
X
X X X X X X
X
X X X X X X
0
0 0 0 0 0 0
0
0 0 0 0 0 0
0
0 1 0 0 0 0
state register
functional units
You may be asked in homeworks or exams or projects to optimize the design with some respect such as area, speed , power or testability
Sende r
Bridge A single-purpose processor that converts two 4-bit inputs, arriving one at a time over data_in along with a rdy_in pulse, into one 8-bit output on data_out along with a rdy_out pulse.
rdy_out
Rece iver
data_out(8)
Example
Bus bridge that converts 4-bit bus to 8-bit bus Start with FSMD Known as register-transfer (RT) level Exercise: complete the design
rdy_in=1 RecFirst4End
rdy_in=0 WaitSecond4
rdy_in=1 RecSecond4End
FSMD
Send8End rdy_out=0
Inputs rdy_in: bit; data_in: bit[4]; Outputs rdy_out: bit; data_out:bit[8] Variables data_lo, data_hi: bit[4];
(a) Controller
rdy_in=0 rdy_in=1 WaitFirst4 RecFirst4Start data_lo_ld=1 rdy_in=0 rdy_in=1 RecSecond4Start data_hi_ld=1 RecFirst4End rdy_in=1
rdy_in=0 WaitSecond4
rdy_in=1 RecSecond4End
Send8End rdy_out=0
rdy_in clk
data_in(4)
data_out_ld to all registers data_hi_ld data_hi data_lo data_lo_ld
data_out
(b) Datapath
replace the subtraction operation(s) with modulo operation in order to speed up program
GCD(42, 8) - 9 iterations to complete the loop x and y values evaluated as follows : (42, 8), (43, 8), (26,8), (18,8), (10, 8), (2,8), (2,6), (2,4), (2,2).
separate states
states which require complex operations (a*b*c*d) can be broken into smaller states to reduce hardware size
scheduling
3: 4: 5:
3:
5:
merge state 3 and state 4 assignment operations are independent of one another
merge state 5 and state 6 transitions from state 6 can be done in state 5 eliminate state 5J and 6J transitions from each state can be done from state 7 and state 8, respectively eliminate state 1-J transition from state 1-J can be done directly from state 9
x<y 7: y = y -x
x>y 8: x = x - y
9:
d_o = x
7:
9: 1-J:
Multi-functional units
ALUs support a variety of operations, it can be shared among operations occurring in different states
State minimization
task of merging equivalent states into a single state
state equivalent if for all possible input combinations the two states generate the same outputs and transitions to the next same state
Technology mapping
Library of gates available for implementation
Simple
only 2-input AND,OR gates
Complex
various-input AND,OR,NAND,NOR,etc. gates Efficiently implemented meta-gates (i.e., AND-OR-INVERT,MUX)
Final structure consists of specified librarys components only If technology mapping integrated with logic synthesis
More efficient circuit More complex problem Heuristics required
44
Fast heuristics
Today
Gate delay shrinking as feature size shrinking Wire delay increasing
Performance evaluation needs wire length
Wire Delay Transistor
Transistor placement (needed for wire length) domain of physical design Thus, simultaneous logic synthesis and physical design required for efficient circuits
46
48
Elevator System
CRC cards is a well-known method for analyzing a system and developing an architecture. CRC
Classes: logical groupings of data and functionality Responsibilities: describe what the class do Collaborators: other classes w/ which a given class works
Architectural Classes
Car state, Floor control reader, Car control reader, Car control sender, Scheduler
49
F floors
N hoistways
50
51
52
53
54
Physical Interfaces
Responsibilities: describe what the class do Collaborators: other classes w/ which a given class works Elevator Control Classes Elevator car, Passenger, Floor control, Car control, Car sensors, etc. Architectural Classes Car state, Floor control reader, Car control reader, Car control sender, Scheduler 55
56
Architecture
Computation and I/O occur at:
Floor control panels/displays Elevator cars System controller
57
System Controller
Must take inputs from many sources: Must control cars to hard real-time deadlines User interface, scheduling are soft deadlines Testing
Build an elevator simulator using SystemC, Verilog, VHDL and FPGA
Simulate multiple elevators Simulate real-time control demands
58
Homework 2
The simplest possible custom single-purpose processor
Design a processor to multiply two numbers. The initial data are in registers/counters A and B. The result should be in register/counter C. You have only reversible counters (with reading) to be used in the data path. The counters perform the following operations:
Add one Subtract one Read new value
Invent the algorithm for multiplication. Use minimum number of counters Design the reversible counter by hand using logic gates and D FFs. Design the control unit Design the data path Draw the timing diagram of the whole system. You can use VHDL or Verilog to help you, but I need your design by hand.
Summary
Custom single-purpose processors
Straightforward design techniques Can be built to execute algorithms Typically start with FSMD CAD tools can be of great assistance
Draw the schematic of the FSMD. Explain GCD algorithm of Euclides on examples. Without looking to the slides, convert GCD algorithm to a FSMD. How can we optimize GCD? Apply these ideas to Least Common Multiplier algorithm and FSMD for two numbers.
5.
6. 7.
8.
EECE 353-1 Real-Time Systems T. John Koo Embedded Computing Systems Laboratory Institute for Software Integrated Systems Department of Electrical Engineering and Computer Science Vanderbilt University 5306 Stevenson Center January 16, 2006 [email protected]
Sources