RTL Design Slides PDF
RTL Design Slides PDF
EEE344
M. Hassan Aslam
[email protected]
https://sites.google.com/ciitlahore.edu.pk/mianhassanaslam/
Digital Design
Chapter 5:
Register-Transfer Level
(RTL) Design
Slides to accompany the textbook Digital Design, First Edition,
by Frank Vahid, John Wiley and Sons Publishers, 2007.
http://www.ddvahid.com
Higher levels
Register-
❑ Capture Comb. behavior: Equations, truth tables transfer
❑ Convert to circuit: AND + OR + NOT → Comb. logic level (RTL)
3
Note: Slides with animation are denoted with a small red "a" near the animated items
RTL Design: Capture Behavior, Convert to
Circuit
◼ Recall
❑ Chapter 2: Combinational Logic Design
◼ First step: Capture behavior (using equation or truth table)
◼ Remaining steps: Convert to circuit
❑ Chapter 3: Sequential Logic Design Capture behavior
◼ First step: Capture behavior (using FSM)
◼ Remaining steps: Convert to circuit
4
5.2
RTL Design Method
5
RTL Design Method: “Preview” Example
◼ Soda dispenser s a
❑ c: bit input, 1 when coin
deposited
c Soda
❑ a: 8-bit input having value d dispenser
of deposited coin processor
7
Preview Example: Inputs : c (bit), a(8 bits) , s (8 bits)
c
Add
d=0 c‘ c‘ *(tot<s)
(tot<s)‘
◼ Need 8-bit comparator tot=0
Disp
Datapath 8
8
Preview Example: Step 3 –
Connect Datapath to a Controller s a
8
comparator’s output, s a
Datapath
which we named
8 8
tot_lt_s
◼ Controller’s outputs
❑ External output d c
(dispense soda)
❑ Outputs to datapath d tot_ld
to load and clear the
tot_clr
tot register
Controller Datapath
tot_lt_s
9
Preview Example: Step 4 – Derive the Controller’s FSM
s a
8 8
◼ Same states
and arcs as c
high-level
d tot_ld
state machine
Controller
Datapath
tot_clr
◼ But set/read
tot_lt_s
datapath s a
control Inputs:: c, tot_lt_s(bit)
Outputs:d, tot_ld, tot_clr (bit) tot_ld
signals for all tot_ld tot_clr
ld
clr
tpt
datapath c c
Add
8
8 8
tot_clr
operations d Init Wait
tot_ld=1 tot_lt_s 8-bit
tot_lt_s 8-bit
and d=0 c’*tot_lt_s < adder
tot_clr=1 Datapath 8
conditions Disp
d=1
Controller
10
Preview Example: Completing the Design
◼ Implement the FSM as
a state register and
tot_lt_s
tot_clr
logic
tot_ld
s1 s0 c n1 n0 d
❑ As in Ch3 0 0 0 0 0 1 0 0 1
❑ Table shown on right 0 0 0 1 0 1 0 0 1
Init
0 0 1 0 0 1 0 0 1
0 0 1 1 0 1 0 0 1
Inputs:: c, tot_lt_s (bit)
0 1 0 0 1 1 0 0 0
Outputs: d, tot_ld, tot_clr (bit)
0 1 0 1 0 1 0 0 0
Wait
tot_ld
c c 0 1 1 0 1 0 0 0 0
Add tot_clr 0 1 1 1 1 0 0 0 0
d Init Wait 1 0 0 0 0 1 0 1 0
Add
tot_ld=1
tot_lt_s
d=0 c’*tot_lt_s
1 1 0 0 0 0 1 0 0
Disp
tot_clr=1
Disp
d=1
Controller
11
Step 1: Create a High-Level State Machine
◼ Let’s consider each step
of the RTL design process
in more detail Inputs : c (bit), a (8 bits) , s (8 bits)
Outputs : d (bit)
◼ Step 1 Local reg isters: tot (8 bits)
12
Example: Laser-Based Distance Measurer
T (in seconds)
laser
D
Object of
a
interest
sensor
2D = T sec * 3*108 m/sec
13
Example: Laser-Based Distance Measurer
T (in seconds)
B L
laser from button to laser
Laser-based
distance
sensor D 16 measurer S
to display from sensor
◼ Inputs/outputs
❑ B: bit input, from button, to begin measurement
❑ L: bit output, activates laser
❑ S: bit input, senses laser reflection
❑ D: 16-bit output, to display computed distance
14
Example: Laser-Based Distance Measurer
DistanceMeasurer from button B Laser-
L
to laser
Inputs: B (bit), S (bit) based
distance
Outputs: L (bit), D (16 bits) D 16 S
to display measurer
Local storage: Dreg(16) from sensor
(required)
a
S0 ?
(first state usually
L := '0' // laser off initializes the system)
Dreg := 0 // distance is 0
S0 S1 ?
B
L := '0' // button
Dreg := 0 pressed
16
Example: Laser-Based Distance
Measurer from button B Laser-
based
L
to laser
DistanceMeasurer distance
... B' D 16 S
to display measurer
from sensor
S0 S1 S2 S3
B
L := '0' L := '1' L := '0'
Dreg := 0 // laser on // laser off
17
Example: Laser-Based Distance Measurer
B L
from button to laser
DistanceMeasurer Inputs: B (bit), S (bit) Outputs: L (bit), D (16 bits) Laser-based
Local storage: Dreg, Dctr (16 bits) 16
distance
D measurer S
B' to display from sensor
S' // no reflection
S // reflection
S0 S1 S2 S3 ?
B
L := '0' Dctr := 0 L := '1' L := '0'
Dreg := 0 // reset cycle Dctr := Dctr + 1
count // count cycles
a
18
Example: Laser-Based Distance Measurer
B L
from button Laser- to laser
DistanceMeasurer Inputs: B (bit), S (bit) Outputs: L (bit), D (16 bits) based
Local storage: Dreg, Dctr (16 bits) distance
D 16 S
to display measurer
from sensor
B' S'
S0 S1 S2 S3 S4
B S
L := '0' Dctr := 0 L := '1' L := '0' Dreg := Dctr/2
Dreg := 0 Dctr := Dctr+1 // calculate D
19
Step 2: Create a Datapath
◼ Datapath must
❑ Implement data storage
❑ Implement data computations
◼ Look at high-level state machine, do
three substeps
❑ (a) Make data inputs/outputs be datapath
inputs/outputs
❑ (b) Instantiate declared registers into the
datapath (also instantiate a register for each Instantiate: to
data output)
introduce a new
❑ (c) Examine every state and transition, and
instantiate datapath components and component into a
connections to implement any data design.
computations
20
Step 2 Example: Laser-Based Distance Measurer
Inputs: B, S (1 bit each) Outputs: L (bit), D (16 bits)
(a) Make data Local Registers: Dctr (16 bits)
inputs/outputs be
datapath B‘ S‘
inputs/outputs
(b) Instantiate declared
registers into the S0 S1 S2 S3 S4
B S
datapath (also
instantiate a L=0 Dctr = 0 L=1 L=0 D = Dctr / 2
D=0 Dctr = Dctr + 1 (calculate D)
register for each
data output) a
Datapath
(c) Examine every Dreg_clr
state and Dreg_ld
transition, and
Dctr_clr clear clear I
instantiate Dctr: 16-bit Dreg: 16-bit
Dctr_cnt count load
datapath up-counter register
components and Q Q
connections to
implement any 16
data computations
D
21
Step 2 Example: Laser-Based Distance Measurer
Inputs: B, S (1 bit each) Outputs: L (bit), D (16 bits)
(c) (continued) Local Registers: Dctr (16 bits)
Examine every
state and B‘ S‘
transition, and
instantiate
S0 S1 S2 S3 S4
datapath B S
components and L=0 Dctr = 0 L=1 L=0 D = Dctr / 2
connections to D=0 Dctr = Dctr + 1 (calculate D)
implement any Datapath
a
data computations
Dreg_clr >>1
16
Dreg_ld
16
D
22
Step 2 Example Showing Mux Use
Localregisters:
E, F, G, R (16 bits)
E F G E F G E F G
T0 R = E + F
A B A B add_A_s0 2×1 2×1
+ + add_B_s0
T1 R = R + G A B
a
+
R R
R
(a) (b) (c)
(d)
◼ Introduce mux when one component input can
come from more than one source
23
Step 3: Connecting the Datapath to a
Controller
L
B to laser
from button
Controller from sensor
Dreg_clr S
Dreg_ld
◼ Laser-based distance
measurer example
Dctr_clr Datapath
◼ Easy – just connect all
Dctr_cnt
D control signals
to display between controller and
16 300 M H z Clock
datapath
Datapath
Dreg_clr >>1
Dreg_ld 16
Dreg_ld
B’ S’
Dctr_clr Datapath
Dctr_cnt
D S0 S1 S2 S3 S4
to display B S
16 300 MHz Clock
L=0 Dctr = 0 L=1 L=0 D = Dctr / 2
D=0 Dctr = Dctr + 1 (calculate D)
Inputs: B, S
◼ FSM has same Outputs: L, Dreg_clr, Dreg_ld, Dctr_clr, Dctr_cnt
structure as high-
level state machine B’ S’
a
❑ Inputs/outputs all
bits now S0 S1
B
S2 S3
S
S4
❑ Replace data
operations by bit L=0 L=0 L=1 L=0 L=0
Dreg_clr = 1 Dreg_clr = 0 Dreg_clr = 0 Dreg_clr = 0 Dreg_clr = 0
operations using Dreg_ld = 0 Dreg_ld = 0 Dreg_ld = 0 Dreg_ld = 0 Dreg_ld = 1
datapath Dctr_clr = 0 Dctr_clr = 1 Dctr_clr = 0 Dctr_clr = 0 Dctr_clr = 0
Dctr_cnt = 0 Dctr_cnt = 0 Dctr_cnt = 0 Dctr_cnt = 1 Dctr_cnt = 0
(laser off) (clear count) (laser on) (laser off) (load D reg with Dctr/2)
(clear D reg) (count up) (stop counting)
25
Step 4: Deriving the Controller’s FSM
B’ S’
B S
S0 S1 S2 S3 S4
26
Step 4
B L
from button to laser Datapath
Controller
from sensor
Dreg_clr S
Dreg_clr >>1
Datapath
Dreg_ld 16
Dreg_ld
Dctr_clr
Dctr_clr clear clear I
Dctr_cnt count Dctr: 16-bit Dreg: 16-bit
Dctr_cnt up-counter load register
D
to display Q Q
16 300 MHz Clock 16
16
D
B’ S’
◼ Implement
B S
S0 S1 S2 S3 S4 FSM as state
L=0 Dctr_clr = 1 L=1 L=0 Dreg_ld = 1 register and
Dreg_clr = 1 (clear count) (laser on) Dctr_cnt = 1 Dctr_cnt = 0
(laser off) (laser off)
logic (Ch3) to
(load D reg with Dctr/2)
(clear D reg) (count up) (stop counting) complete the
design
27
RTL Design Examples and Issues
◼ We’ll use several more Master
processor
examples to illustrate RTL
design rd
D
◼ Example: Bus interface 32
4 A
❑ Master processor can read
register from any peripheral Per0 Per1 Per15
◼ Each register has unique 4-bit address
◼ Assume 1 register/periph. to/from processor bus
rd D A
❑ Sets rd=1, A=address
❑ Appropriate peripheral places 32 4
register data on 32-bit D lines Faddr
◼ Periph’s address provided on Faddr inputs Bus interface
(maybe from DIP switches, or another 4
register) Q
32
Main part
Peripheral
28
RTL Example: Bus Interface
Inputs: rd (bit); Q (32 bits); A, Faddr (4 bits)
Outputs: D (32 bits)
Local register: Q1 (32 bits)
rd’ rd
((A = Faddr)
and rd’)
WaitMyAddress SendData
(A = Faddr)
D = “Z” and rd D = Q1
Q1 = Q
30
RTL Example: Bus Interface
Inputs: rd (bit); Q (32 bits); A, Faddr (4 bits)
Outputs: D (32 bits)
Local register: Q1 (32 bits) A Faddr Q
rd’ rd
4 4 32
((A = Faddr)
and rd)’ Q1_ld
ld Q1
WaitMyAddress SendData
(A = Faddr)
D = “Z” and rd D = Q1 = (4-bit)
Q1 = Q 32
A_eq_Faddr
D_en
32
a
31
RTL Example: Bus Interface
Inputs: rd (bit); Q (32 bits); A, Faddr (4 bits)
Outputs: D (32 bits)
Local register: Q1 (32 bits)
rd’ rd A Faddr Q
Inputs: rd, A_eq_Faddr
((A(bit)
= Faddr)
Outputs: Q1_ld, D_en (bit)
and rd)’ 4 4 32
‘
rdSendData Q1_ld
rd WaitMyAddress rd ld
(A = Faddr) Q1
D = “Z” and(A_eq_
rd Faddr D = Q1
Q1 = Q and rd) ‘
= (4-bit) 32
WaitMyAdd ress SendD ata A_eq_Faddr
A_eq_ Faddr
D_en = 0 and rd D_en = 1 D_en
a Q1_ld = 1 Q1_ld = 0 32
Datapath
Bus interface
32
RTL Example: Video Compression – Sum of Absolute Differences
Only difference: ball moving
Frame 1 Frame 2 Frame 1 Frame 2
33
RTL Example: Video Compression – Sum of Absolute Differences
compare Each is a pixel, assume
Frame 1 Frame 2
represented as 1 byte
(actually, a color picture
might have 3 bytes per
pixel, for intensity of
red, green, and blue
components of pixel)
◼ Need to quickly determine whether two frames are similar
enough to just send difference for second frame
❑ Compare corresponding 16x16 “blocks”
◼ Treat 16x16 block as 256-byte array
❑ Compute the absolute value of the difference of each array item
❑ Sum those differences – if above a threshold, send complete frame
for second frame; if below, can use difference method (using
another technique, not described)
34
RTL Example: Video Compression – Sum of
Absolute Differences
A SAD
256-byte array
integer
B sad
256-byte array
go
!(i<256)
35
RTL Example: Video Compression – Sum of Absolute
Differences
A SAD
Inputs: A, B (256 byte memory); go (bit)
Outputs: sad (32 bits)
B sad Local registers: sum, sad_reg (32 bits); i (9 bits)
go
S0 !go
go
◼ S0: wait for go sum = 0 a
S1
i=0
◼ S1: initialize sum and index
(i<256)’
◼ S2: check if done (i>=256) S2
◼ S3: add difference to sum, i<256
sum=sum+abs(A[i]-B[i])
increment index S3
i=i+1
◼ S4: done, write to output
S4 sad_ reg = sum
sad_reg
36
RTL Example: Video Compression – Sum of
Absolute Differences
Inputs: A, B (256 byte memory); go (bit) AB_addr A_data B_data
Outputs: sad (32 bits)
Local registers: sum, sad_reg (32 bits); i (9 bits) i_lt_256
<256 8 8
9
S0 !go i_inc
go i_clr
i –
sum = 0 a
8
S1
i=0
sum_ld
(i<256)’ sum 32 abs
S2 sum_clr
i<256 32 32 8
sum=sum+abs(A[i]-B[i]) sad_reg_ld
S3
i=i+1
sad_reg +
S4 sad_ reg=sum 32
Datapath
sad
◼ Step 2: Create datapath
37
Example: Video Compression – Sum of Absolute Differences
go AB_rd AB_addr A_data B_data
i_lt_256
<256 8 8
S0 go’
9
go i_inc
S1
sum=0 sum_clr=1
i_clr
i –
i=0 i_clr=1
8
S2 sum_ld
? i<256 i_lt_256 sum 32 abs
sum_clr
S3 sum=sum+abs(A[i]-B[i])
sum_ld=1; AB_rd=1 32 32 8
i=i+1 i_inc=1 sad_reg_ld
S4 sad_reg=sum
a
sad_reg +
sad_reg_ld=1
Controller 32
sad
◼ Step 3: Connect to controller
◼ Step 4: Replace high-level state machine by FSM
38
RTL Example: Video Compression – Sum of Absolute
Differences
◼ Comparing software and custom
circuit SAD
❑ Circuit: Two states (S2 & S3) for
each i, 256 i’s→ 512 clock cycles
❑ Software: Loop (for i = 1 to 256), but (i<256)’
S2
for each i, must move memory to
local registers, subtract, compute i<256
sum=sum+abs(A[i]-B[i])
absolute value, add to sum, S3
i=i+1
increment i – say about 6 cycles per
array item → 256*6 = 1536 cycles
❑ Circuit is about 3 times (300%) faster
❑ Later, we’ll see how to build SAD
circuit that is even faster
39
RTL Design Pitfalls and Good Practice
◼ Common pitfall: Assuming
register is update in the
state it’s written
❑ Final value of Q?
❑ Final state?
❑ Answers may surprise you
◼ Value of Q unknown
◼ Final state is C, not D
❑ Why?
◼ State A: R=99 and Q=R happen
simultaneously
◼ State B: R not updated with R+1 until
next clock cycle, simultaneously with
state register being updated
40
RTL Design Pitfalls and Good Practice
◼ Solutions
❑ Read register in
following state
(Q=R)
❑ Insert extra state
so that conditions
use updated value
❑ Other solutions are
possible, depends
on the example
41
RTL Design Pitfalls and Good Practice
◼ Common pitfall: Inputs: A, B (8 bits) Inputs: A, B (8 bits)
Outputs: P (8 bits) Outputs: P (8 bits)
Reading outputs Local register: R (8 bits)
❑ Outputs can only
be written S T S T
❑ Solution: Introduce
P=A P=P+B R=A P=R+B
additional register, P=A
which can be (a) (b)
written and read
42
RTL Design Pitfalls and Good Practice
◼ Good practice: Register B B
all data outputs R R
❑ In fig (a), output P would
show spurious values as
addition computes
◼ Furthermore, longest register-to- + +
register path, which determines
clock period, is not known until
that output is connected to
another component P
❑ In fig (b), spurious outputs (a) Preg
reduced, and longest
register-to-register path is
clear P
(b)
43
Control vs. Data Dominated RTL Design
◼ Filter concept
❑ Suppose X is data from a
temperature sensor, and
particular input sequence is
180, 180, 181, 240, 180, 181 X Y
(one per clock cycle)
❑ That 240 is probably wrong! 12 digital filter 12
◼ Could be electrical noise
clk
❑ Filter should remove such
noise in its output Y
❑ Simple filter: Output average
of last N values
◼ Small N: less filtering
◼ Large N: more filtering, but less sharp
output
45
Data Dominated RTL Design Example: FIR Filter
◼ FIR filter
❑ “Finite Impulse Response” X Y
12 digital filter 12
❑ Simply a configurable
clk
weighted sum of past input
values
y(t) = c0*x(t) + c1*x(t-1) + c2*x(t-2)
❑ y(t) = c0*x(t) + c1*x(t-1) +
c2*x(t-2)
◼ Above known as “3 tap”
◼ Tens of taps more common
◼ Very general filter – User sets the
constants (c0, c1, c2) to define specific
filter
❑ RTL design
◼ Step 1: Create high-level state machine
❑ But there really is none! Data
dominated indeed.
46 ◼ Go straight to step 2
Data Dominated RTL Design Example: FIR Filter
240
180
181 180
181 180
a
47
Data Dominated RTL Design Example: FIR Filter
* * *
Y
48
Data Dominated RTL Design Example: FIR Filter
◼ Step 2: Create X
12 digital filter 12
Y
❑ Instantiate adders
y(t) = c0*x(t) + c1*x(t-1) + c2*x(t-2)
3-tap FIR filter
x(t) x(t-1) x(t-2)
c0 c1 c2
xt0 xt1 xt2
X
clk
a
* * *
+ +
Y
49
Data Dominated RTL Design Example: FIR Filter
clk
* * *
+ + yreg
Y
50
Data Dominated RTL Design Example: FIR Filter
y(t) = c0*x(t) + c1*x(t-1) + c2*x(t-2)
◼ Step 3 & 4: Connect to controller, Create FSM
❑ No controller needed
❑ Extreme data-dominated example
❑ (Example of an extreme control-dominated design – an FSM, with no
datapath)
◼ Comparing the FIR circuit to a software implementation
❑ Circuit
◼ Assume adder has 2-gate delay, multiplier has 20-gate delay
◼ Longest past goes through one multiplier and two adders
❑ 20 + 2 + 2 = 24-gate delay
◼ 100-tap filter, following design on previous slide, would have about a 34-gate
delay: 1 multiplier and 7 adders on longest path
❑ Software
◼ 100-tap filter: 100 multiplications, 100 additions. Say 2 instructions per
multiplication, 2 per addition. Say 10-gate delay per instruction.
◼ (100*2 + 100*2)*10 = 4000 gate delays
❑ Circuit is more than 100 times faster (10,000% faster). Wow.
51
5.4
Determining Clock Frequency
◼ Designers of digital circuits
often want fastest
performance clk a b
❑ Means want high clock
frequency
Frequency limited by longest
◼
register-to-register delay
2 ns +
delay
❑ Known as critical path
❑ If clock is any faster, incorrect
data may be stored into register c
❑ Longest path on right is 2 ns
◼ Ignoring wire delays, and register setup
and hold times, for simplicity
52
Critical Path
◼ Example shows four
paths
❑ a to c through +: 2 ns a b
❑ a to d through + and *: 7
ns 2 ns 5 ns
+ *
delay delay
❑ b to d through + and *: 7 7 ns 7 ns
2 ns
5 ns
7 ns
7 ns
ns
c d
❑ b to d through *: 5 ns Max
(2,7,7,5)
ns
◼ Fastest frequency
❑ 1 / 7 ns = 142 MHz
53
Critical Path Considering Wire Delays
◼ Real wires have delay too
❑ Must include in critical path
◼ Example shows two paths
❑ Each is 0.5 + 2 + 0.5 = 3 ns clk a b
◼ Trend
0.5 ns
❑ 1980s/1990s: Wire delays were tiny 0.5 ns
compared to logic delays
❑ But wire delays not shrinking as fast as + 2 ns
logic delays
◼ Wire delays may even be greater than 0.5 ns
logic delays!
3 ns
3 ns
c 3 ns
◼ Must also consider register setup and
hold times, also add to path
◼ Then add some time to the computed
path, just to be safe
❑ e.g., if path is 3 ns, say 4 ns instead
54
A Circuit May Have Numerous Paths
s a
◼ Paths can exist
❑ In the datapath Combinational logic 8 8
d
❑ In the controller
❑ Between the tot_ld
ld
controller and tot_clr tot
c clr
datapath
(c ) 8
tot_lt_s
❑ May be hundreds n1
or thousands of
paths n0
8-bit 8-bit
< adder
◼ Timing analysis tot_lt_s 8
helpful
55
5.5
Behavioral Level Design: C to Gates
C code
S0 !go
int SAD (byte A[256], byte B[256]) // not quite C syntax
go
{
sum = 0
S1 uint sum; short uint I;
i=0
sum = 0;
(i<256)’ i = 0;
S2 while (i < 256) {
sum = sum + abs(A[i] – B[i]);
i<256
i = i + 1;
sum=sum+abs(A[i]-B[i])
S3 }
i=i+1
return sum;
}
a
S4 sad_ reg = sum
56
Behavioral-Level Design: Start with C (or Similar
Language)
◼ Replace first step of RTL design method by two steps
❑ Capture in C, then convert C to high-level state machine
❑ How convert from C to high-level state machine?
Step 1A: Capture in C
a
Step 1B: Convert to high-level state machine
57
Converting from C to High-Level State Machine
58
Converting from C to High-Level State Machine
◼ If-then-else
!cond
❑ Becomes state with condition if (cond) { cond
check, transitioning to “then” // then stmts
(then stmts) (else stmts)
statements if condition true, or }
else { a
to “else” statements if condition // else stmts (end)
false }
cond
❑ Becomes state with condition while (cond) {
// while stmts (while stmts)
check, transitioning to while }
a
59
Simple Example of Converting from C to
High-Level State Machine
Inputs: uint X, Y
Outputs: uint Max !(X>Y) !(X>Y)
X>Y X>Y
if (X > Y) {
Max = X; (then stmts) (else stmts) Max=X Max=Y
}
else {
Max = Y;
(end) (end)
}
a a
60
Example: Converting Sum-of-Absolute-Differences C code to
High-Level State Machine
Inputs: byte A[256, B[256]
◼ Convert each construct to bit go;
!(!go)
Output: int sad
states main() !go !go go !go go
{
❑ Simplify when possible, uint sum; short uint I;
while (1) {
sum=0 sum=0
i=0
e.g., merge states
while (!go); i=0
(d)
◼ From high-level state sum = 0;
i = 0;
machine, follow RTL design while (i < 256) {
(b)
(c)
(e) sad =
sum
61 (f)
Register Files
◼ MxN register file
component provides er C C
32
efficient access to M N- er t
8
d0d0 load
loadreg0 huge mux
bit-wide registers ?
s t ompu reg0 T
32
ompu
cthe car4×162×4 i0 i0 miT
❑ If we have many registersom com al too much 8 mi T rr
r rt the car'sd1 32-bit
8-bit r
but only need access one al rt F nec 4 a0 loadfanout
reg1 A 16x41×1 r
r
F
or two at a time, a register ecn i0 o
y ve
a
i1
file is more efficient 8 o
v
i3-i0
a1
i1 ae
d dy - DD
d2 load reg2 I 8
❑ Ex: Above-mirror display 32
(earlier example), but this i2
8 congestion
time having 16 32-bit d3 load reg3 M
d15
e load reg15
registers e
◼ Too many wires, and big mux is load i15i3 s1 s0
load 32 8
too slow s3-s0
x y
62
Register File
◼ Instead, want component that has one data input and one data output,
and allows us to specify which internal register to write and which to read
32 32
W_data R_data a
4 4
W_addr R_addr
W_en R_en
16×32
register file
63
Register File Timing Diagram
◼ Can write one
register and read
one register each
clock cycle
❑ May be same
register
32 32
W_data R_data
2 2
W_addr R_addr
W_en R_en
4x32
register file
64
5.6
Memory Components
◼ Register-transfer level
design instantiates datapath
components to create
datapath, controlled by a
M words
controller
❑ A few more components are
often used outside the
controller and datapath
◼ MxN memory
❑ M words, N bits wide each N-bits
wide each
◼ Several varieties of memory,
which we now introduce M×N memory
65
Random Access Memory (RAM)
◼ RAM – Readable and writable memory 32 32
W_data R_data
❑ “Random access memory” 4 4
W_addr R_addr
◼ Strange name – Created several decades ago to
contrast with sequentially-accessed storage like W_en R_en
tape drives 16×32
register file
❑ Logically same as register file – Memory with
address inputs, data inputs/outputs, and control Register file from Chpt. 4
◼ RAM usually just one port; register file usually two or
more
32
❑ RAM vs. register file data
◼ RAM typically larger than roughly 512 or 1024 words 10
addr
◼ RAM typically stores bits using a bit storage 1024 × 32
rw RAM
approach that is more efficient than a flip flop
en
◼ RAM typically implemented on a chip in a square
rather than rectangular shape – keeps longest wires
(hence delay) short
RAM block symbol
66
RAM Internal Structure
32
data
10
addr Let A = log2M wdata(N-1) wdata(N-2) wdata0
1024x32
rw RAM word bit storage
en enable block
d0 (aka “cell”)
addr0 a0 word
addr1 a1 AxM
d1
decoder
data cell
addr
addr(A-1) a(A-1)
word word
e d(M-1) enable enable
clk
en rw data
rw to all cells
word 0
enable
68 word 0
enable
Static RAM (SRAM)
32
data
10
addr
1024x32
rw RAM
en
SRAM cell
◼ “Static” RAM cell
data data’
❑ Reading this cell 1 1
◼ Somewhat trickier d
◼ When rw set to read, the RAM logic sets both
1 0
data and data’ to 1
a
◼ The stored bit d will pull either the left line or
the right bit down slightly below 1 1 1 <1
word
◼ “Sense amplifiers” detect which side is slightly enable
pulled down To sense amplifiers
69
Dynamic RAM (DRAM)
32
data
10
addr
1024x32
rw RAM
en
DRAM cell
data
word
❑ 1 transistor (rather than 6) enable
d
capacitor
❑ Relies on large capacitor to store bit slowly
◼ Write: Transistor conducts, data voltage level gets discharging
stored on top plate of capacitor
(a)
◼ Read: Just look at value of d
◼ Problem: Capacitor discharges over time data
❑ Must “refresh” regularly, by reading d and
enable
then writing it right back
discharges
d
(b)
70
Comparing Memory Types
◼ Register file MxN Memory
❑ Fastest implemented as a:
71
Reading and Writing a RAM
clk clk
1 2 3
addr 9 13 9 addr valid setup
time
data 500 999 Z 500 data valid hold Z 500
time
rw 1 means write setup
rw
time
en access
RAM[9] RAM[13] time
now equals 500 now equals 999
◼ Writing (b)
❑ Put address on addr lines, data on data lines, set rw=1, en=1
◼ Reading
❑ Set addr and en lines, but put nothing (Z) on data lines, set rw=0
❑ Data will appear on data lines
◼ Don’t forget to obey setup and hold times
❑ In short – keep inputs stable before and after a clock edge
72
RAM Example: Digital Sound Recorder
4096× 16
RAM
data
addr
rw
en
wire 16
analog-to- digital-to-
digital 12 analog
ad_buf Ra Rrw Ren wire
microphone converter converter
ad_ld processor da_ld
◼ Behavior speaker
❑ Record: Digitize sound, store as series of 4096 12-bit digital values in RAM
◼ We’ll use a 4096x16 RAM (12-bit wide RAM not common)
❑ Play back later
❑ Common behavior in telephone answering machine, toys, voice recorders
◼ To record, processor should read a-to-d, store read values into
successive RAM words
❑ To play, processor should read successive RAM words and enable d-to-a
73
RAM Example: Digital Sound Recorder
4096x16
◼ RTL design of processor RAM
74
RAM Example: Digital Sound Recorder
4096x16
❑ Now create play behavior RAM data bus
❑ Use local register a again,
create state machine that counts 16
from 0 to 4095 again analog-to-
digital 12
digital-to-
analog
ad_buf Ra Rw Ren
◼ For each a converter converter
❑ Read RAM ad_ld processor da_ld
75
Read-Only Memory – ROM
32
◼ Memory that can only be read from, not data
10
written to addr
1024× 32
❑ Data lines are output only rw RAM
❑ No need for rw input en
76
Read-Only Memory – ROM
32
data
10 1024x32
addr Let A = log2M
ROM
en
word bit storage
enable block
ROM block symbol d0 (aka “cell”)
addr0 a0 word
addr1 a1 AxM
d1
decoder
data
addr
addr(A-1) a(A-1)
word word
e d(M-1) enable enable
clk
en data
77
ROM Types
◼ If a ROM can only be read, how
are the stored bits stored in the
first place?
❑ Storing bits in a ROM known as
programming
❑ Several methods
◼ Mask-programmed ROM 1 data line 0 data line
78
ROM Types
◼ Fuse-Based Programmable
ROM
❑ Each cell has a fuse
❑ A special device, known as a
programmer, blows certain
1 data line 1 data line
fuses (using higher-than-normal
cell cell
voltage) a
◼ Those cells will be read as 0s (involving word
enable
some special electronics)
◼ Cells with unblown fuses will be read as 1s fuse blown fuse
◼ 2-bit word on right stores “10”
79
ROM Types
◼ Erasable Programmable ROM
(EPROM)
❑ Uses “floating-gate transistor” in each cell
❑ Special programmer device uses higher-
than-normal voltage to cause electrons to
tunnel into the gate
floating-gate
◼ Electrons become trapped in the gate data line data line
transistor
◼ Only done for cells that should store 0 cell cell
◼ Other cells (without electrons trapped in 1 0
gate) will be 1 o
tr
word eÐeÐ
❑ 2-bit word on right stores “10” enable
ting
ar
◼ Details beyond our scope – just general idea tea t trapped electrons
is necessary here g
❑ To erase, shine ultraviolet light onto chip
◼ Gives trapped electrons energy to escape
◼ Requires chip package to have window
80
ROM Types
◼ Electronically-Erasable Programmable ROM
(EEPROM)
❑ Similar to EPROM
◼ Uses floating-gate transistor, electronic programming to
trap electrons in certain cells
❑ But erasing done electronically, not using UV light
❑ Erasing done one word at a time
◼ Flash memory
❑ Like EEPROM, but all words (or large blocks of words)
can be erased simultaneously 32
data
❑ Become common relatively recently (late 1990s) 10
addr
◼ Both types are in-system programmable en 1024x32
❑ Can be programmed with new stored bits while in the EEPROM
write
system in which the ROM operates
◼ Requires bi-directional data lines, and write control input busy
81
ROM Example: Digital Telephone Answering Machine
Using a Flash Memory
◼ Want to record the outgoing
announcement 4096x16 Flash
❑ When rec=1, record digitized “We’re not home.”
sound in locations 0 to 4095
busy
❑ When play=1, play those
stored sounds to digital-to- 16
analog converter analog-to-
digital 12 digital-to-
ad_buf Ra Rrw Ren er bu
◼ What type of memory? converter analog
ad_ld processor converter
❑ Should store without power
da_ld
supply – ROM, not RAM
❑ Should be in-system rec
programmable – EEPROM or record play
Flash, not EPROM, OTP microphone speaker
ROM, or mask-programmed
ROM
❑ Will always erase entire
memory when
reprogramming – Flash better
than EEPROM
82
ROM Example: Digital Telephone Answering
Machine Using a Flash Memory
◼ High-level state machine 4096x16 Flash
83
Blurring of Distinction Between ROM
and RAM ROM RAM
◼ We said that Flash a
EEPROM NVRAM
❑ RAM is readable and writable
❑ ROM is read-only
◼ But some ROMs act almost like RAMs
❑ EEPROM and Flash are in-system programmable
◼ Essentially means that writes are slow
❑ Also, number of writes may be limited (perhaps a few million times)
◼ And, some RAMs act almost like ROMs
❑ Non-volatile RAMs: Can save their data without the power supply
◼ One type: Built-in battery, may work for up to 10 years
◼ Another type: Includes ROM backup for RAM – controller writes RAM contents to ROM
before turning off
◼ New memory technologies evolving that merge RAM and ROM benefits
❑ e.g., MRAM
◼ Bottom line
❑ Lot of choices available to designer, must find best fit with design goals
84
Hierarchy and Abstraction
◼ Abstraction
❑ Hierarchy often involves not just grouping
items into a new item, but also associating
higher-level behavior with the new item,
known as abstraction
◼ e.g., an 8-bit adder has an understandable a7.. a0 b7.. b0
high-level behavior – it adds two 8-bit binary
numbers 8-bit adder ci
❑ Frees designer from having to remember, or
co s7.. s0
even from having to understand, the lower-
level details
85
Hierarchy and Composing Larger
Components from Smaller Versions
◼ A common task is to compose smaller components
into a larger one a
❑ Gates: Suppose you have plenty of 3-input AND gates,
but need a 9-input AND gate
◼ Can simple compose the 9-input gate from several 3-input
gates
❑ Muxes: Suppose you have 4x1 and 2x1 muxes, but
need an 8x1 mux
◼ s2 selects either top or bottom 4x1
◼ s1s0 select particular 4x1 input
◼ Implements 8x1 mux – 8 data inputs, 3 selects, one output
P
ro
vin
ce 1
86
Hierarchy and Composing Larger Components
from Smaller Versions
◼ Composing memory very common
◼ Making memory words wider
❑ Easy – just place memories side-by-side until desired width obtained
❑ Share address/control lines, concatenate data lines
❑ Example: Compose 1024x8 ROMs into 1024x32 ROM
10
addr addr addr addr
1024x8 1024x8 1024x8 1024x8
addr ROM ROM ROM ROM
en en en en
data data data data
en
8 8 8 8
data(31..0)
10
1024x32
ROM
data
32
87
Hierarchy and Composing Larger
Components from Smaller Versions 11
a9..a0
◼ Creating memory with more words addr
❑ Put memories on top of one another until the addr a10 1x2 d0 1024x8
number of desired words is achieved i0 dcd ROM
❑ Use decoder to select among the memories e d1 en data
◼ Can use highest order address input(s) as 8
decoder input
◼ Although actually, any address line could be en addr
used 1024x8
11 ROM
❑ Example: Compose 1024x8 memories into
2048x8 memory 2048x8 en data
ROM
a10 a9 a8 a0 8
0 0 0 0 0 0 0 0 0 0 0 data
0 0 0 0 0 0 0 0 0 0 1 addr
8
0 0 0 0 0 0 0 0 0 1 0 1024x8
a
ROM
0 1 1 1 1 1 1 1 1 1 0 en data
a10 just chooses
0 1 1 1 1 1 1 1 1 1 1 a
which memory To create memory with more
to access 1 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1 addr words and wider words, can first
1 0 0 0 0 0 0 0 0 1 0 1024x8 compose to enough words, then
ROM widen.
1 1 1 1 1 1 1 1 1 1 0 en data
88 1 1 1 1 1 1 1 1 1 1 1
Chapter Summary