Digital Design With Implicit State Machines: Fengyun Liu
Digital Design With Implicit State Machines: Fengyun Liu
2 Fengyun Liu
3 EPFL, Switzerland
4 [email protected]
5 Aleksandar Prokopec
6 Oracle Labs, Switzerland
7 [email protected]
8 Martin Odersky
9 EPFL, Switzerland
10 [email protected]
11 Abstract
12 Claude Shannon, in his famous thesis (1938), revolutionized circuit design by showing that Boolean
13 algebra subsumes all ad-hoc methods that are used in designing switching circuits, or combinational
14 circuits as they are commonly known today. But what is the calculus for sequential circuits?
15 Finite-state machines (FSM) are close, but not quite, as they do not support arbitrary parallel and
16 hierarchical composition like that of Boolean expressions. We propose an abstraction called implicit
17 state machine (ISM) that supports parallel and hierarchical composition. We formalize the concept
18 and show that any system of parallel and hierarchical ISMs can be flattened into a single flat FSM
19 without exponential blowup. As one concrete application of implicit state machines, we show that
20 they serve as an attractive abstraction for digital design and logical synthesis.
21 2012 ACM Subject Classification Replace ccsdesc macro with valid one
24 1 Introduction
25 Claude Shannon [26] revolutionized circuit design by showing that Boolean algebra subsumes
26 all ad-hoc methods that are used in designing switching circuits, or combintional circuits as
27 they are commonly known today. In contrast to combinational circuits which only contain
28 stateless gates, sequential circuits may also contain stateful elements, like registers. But what
29 is the calculus for sequential circuits? Finite-state machines (FSM) are close, but not quite.
30 A good abstraction in programming should be composable. In a Boolean expression
31 a ∨ b, the sub-expression a and b can be arbitrary Boolean expressions. We may also
32 put two Boolean expression side by side to achieve parallel composition. Essentially, any
33 combinational circuit design will eventually result in a Boolean expression, regardless of
34 whether the design language is in VHDL, Verilog, or Chisel [1]. The composability of Boolean
35 expression ensures that any combinational circuit can be represented.
36 If we turn to sequential circuits, which may contain state elements and cycles, what is the
37 calculus that all sequential circuits can compile to, like Boolean algebra for combinational
38 circuits? Finite-state machines are close to fulfill the role, but not quite. Classic FSMs
39 support neither hierarchical composability nor parallel composition. The milestone paper
40 by Benveniste and Berry [2] argued that the lack of support for hierarchical design and
41 concurrency is mentioned in as a major drawback of FSMs.
42 Conceptually, we may compose FSMs side by side or in a nested way, which leads to
43 parallel and hierarchical FSMs. In a hierarchical FSM, the behavior of the outer FSM
44 depends on that of the inner FSM, and the inner FSM has a privileged access to the current
45 state of the outer FSM. Parallel FSMs run side-by-side and respond to inputs concurrently.
© Fengyun Liu, Aleksandar Prokopec and Martin Odersky;
licensed under Creative Commons License CC-BY
42nd Conference on Very Important Topics (CVIT 2016).
Editors: John Q. Open and Joan R. Access; Article No. 23; pp. 23:1–23:25
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
23:2 Digital Design with Implicit State Machines
46 If one FSM can be in state a, b, the other can be in state c, d, then their parallel composition
47 may be in states ac, ad, bc, bd.
48 There has been proposals for programming with hierarhical and parallel FSMs [7, 8, 12, 19],
49 but so far no proposals address the two problems below:
52 While experts in logic verification and synthesis usually work with flat FSMs for its
53 simplicity and expressiveness, digital designers primarily work with hierarchical FSMs to
54 decompose the complexity of a system. It is unknown how to support hierarchical and parallel
55 composition of FSMs in a language, and then transform it into a flat FSM to facilitate formal
56 verification such as model checking [5], and optimizations such as state encoding [10, 30].
57 The flattening of hierarchical and parallel FSMs generally results in exponential blowup
58 in the size of their representation, e.g. flattening of 32 parallel 2-state FSMs would result in
59 a flat FSM with 232 states. Existing programming models with FSMs require one case for
60 each state in the code [7, 8, 12, 19], consequently, the exponential blowup cannot be avoided
61 in such languages. This creates a gap between a complex system of parallel and hierarchical
62 FSMs and a flat FSM. Despite its simplicity and mathematical elegance, we still do not know
63 how to make FSMs a first-class construct for programming, optimization and verification
64 due to the lack of efficient composability and flattening.
65 To bridge the gap, we propose a novel abstraction, called implicit state machine (ISM),
66 that supports arbitrary parallel and hierarchical composition of FSMs. Implicit state machines
67 do not mandate states to be explicitly specified in the program, which avoids the exponential
68 blowup when flattening a complex system of FSMs. This flexible composability makes
69 implicit state machine an elegant first-class programming construct for digital design, and
70 the avoidance of exponential blowup in flattening makes implicit state machines an attractive
71 intermediate language for compilation, optimization and verification.
72 From the perspective of circuit design, the flattening keeps the area and the delay, the two
73 optimization goals of logic synthesis, unchanged. The result implies that any synchronous
74 sequential circuits is equivalent to a circuit with all state elements at the boundary, and a big
75 combinational core at the center. We conjecture this result will lead to more optimization
76 opportunities. For example, now combinational techniques may be used to optimize the
77 whole circuit, while it was previously convenient to optimize only combinational fragments
78 using the fundamental techniques. It may also give rise to novel hardware architectures. For
79 example, FPGAs no longer need to scatter state elements (e.g. D flip-flops) in its layout.
80 Our contributions are listed below:
81 We introduce the concept of implicit state machines, and formalize the concept in a
82 declarative calculus. Implicit state machines support parallel and hierarchical composition,
83 and we may optimize and reason about the code by equational reasoning.
84 We show that any parallel and hierarchical FSMs can be flattened into a flat implicit state
85 machine in polynomial time and code size. As far as we know, this is the first abstraction
86 for hierarchial and parallel FSMs that avoids exponential blowup in flattening.
87 To the best of our knowledge, we are the first to theorize that any synchronous sequential
88 circuits is equivalent to a circuit with all state elements at the boundary and a big
89 combinational core at the center with the same area and delay.
90 We create an embedded DSL in Scala based on implicit state machines, and the initial
91 experiments show positive results when implicit state machine is used as a programming
92 model and an intermediate representation for logic synthesis.
F. Liu, A. Prokopec, M. Odersky 23:3
94 2.1 Introduction
95 Finite-state machines are widely used in the design and verification of reactive and real-
96 time systems, which include critical systems that control nuclear plants, airplanes, trains, cars,
97 etc. As a mathematical model, finite-state machines can precisely and succinctly characterize
98 the behaviors of such systems, which forms the basis to formally verifying that the systems
99 work reliably in accordance with the specification.
100 Mathematically, a finite state machine is usually represented as a quintuple (I, S, s0 , σ, O):
106 FSM can also be represented graphically by state-transition diagrams, as the following
107 figure shows:
0/1 1/1
0/0
1/0
start q1 q2 q3
0/1, 1/1
108
109 In the state machine above, q1 is the initial state, and each edge denotes a transition:
110 the label 0/1 on the edge means the transition happens when the input is 0, and it outputs 1
111 when the transition occurs.
Implicit state machines are based on a reflection on the essence of FSM: a mapping
from input and state to the next state and output. The first insight towards implicit state
machines is that the mapping function does not have to be represented as a set whose
size correlates with the size of the state space, as it is the case in existing languages for
programming with FSMs [12, 8, 7, 19]. In a declarative language, the mapping functionality
can be represented by any expression. This gives us a tentative representation as follows:
The body (t1 , t2 ) enforces that the output and next state are implemented as two functions.
This imposes unnecessary constraints. If we introduce tuples in the language, we can replace
(t1 , t2 ) just by t:
λx:I × S. t : I ×S →S×O
The second insight is that the state is neither an input to an FSM nor an output of an
FSM, but a self reference. It leads us to the following representation with the state variable
s:
λx:I. fsm { s ⇒ t } : I→O
112 In the above, the term t still has the type S × O, but seen from outside, a state machine
113 just maps input to output, which corresponds to our intuition.
CVIT 2016
23:4 Digital Design with Implicit State Machines
The last insight is that the inputs do not need to be represented explicitly, they can be
captured from the lexical scope:
fsm { s ⇒ t } : O
We still miss the initial state, so we use the value v to denote the initial state of the FSM:
fsm { v | s ⇒ t } : O
Voila! Suppose we are working in the domain of digital circuits, a one-bit D flip-flop with
an input signal d can be represented as follows:
fsm { 0 | s ⇒ (d, s) }
114 It takes the value d as the next state, and outputs the last state on every clock. We may
115 compose several such flip-flops to implement a shift register for a given input d:
121 An equivalent flat FSM that implements the 4-bit shift register is shown below:
123 Implicit state machines are just expressions, thus they may appear anywhere that an ex-
124 pression is allowed. In particular, we may nest them to get another equivalent implementation
125 of the shift register:
t ::= terms
a, b, c external input
x, y, z, s variables
let x = t in t let binding
β Boolean value
t∗t 1 bit and
t+t 1 bit or
135 !t 1 bit not
(t, . . . , t) tuple
t.i projection
fsm { v | s ⇒ t } implicit state machine
136 Beyond the basic elements of Boolean algebra, we also introduce let-bindings, which is a
137 basic abstraction and reuse mechanism. Tuples and projections are introduced for parallel
138 composition and decomposition. In a projection t.i, the index i must be a statically known
139 number. For implicit state machines, we require that the initial state is a value.
140 A circuit usually has external inputs, which is represented by variables a, b, c. By
141 convention, we use x, y, z for let-bindings, and s for the binding in implicit state machines.
142 We choose Boolean algebra as the domain theory, but it can also be other mathematical
143 structures, like groups or abelian groups. Our transform does not assume properties of
144 mathematical structures as long as we may substitute equals for equals [29].
CVIT 2016
23:6 Digital Design with Implicit State Machines
σ,ρ
v −→ v | ∅ (E-Value)
v = ρ(a)
σ,ρ (E-Input)
a −→ v | ∅
σ,ρ σ,ρ
t1 −→ v1 | σ 0 [x 7→ v1 ]t2 −→ v2 | σ 00
σ,ρ (E-Let)
let x = t1 in t2 −→ v | σ 0 ∪ σ 00
σ,ρ σ,ρ
t1 −→ v1 | σ1 ... tn −→ vn | σn
σ,ρ (E-Tuple)
(t1 , . . . , tn ) −→ (v1 , . . . , vn ) | σ1 ∪ · · · ∪ σn
σ,ρ
t −→ (v1 , . . . , vi , . . . , vn ) | σ 0
σ,ρ (E-Project)
t.i −→ vi | σ 0
σ,ρ σ,ρ
t1 −→ β1 | σ 0 t2 −→ β2 | σ 00 β = and(β1 , β2 )
σ,ρ (E-And)
0 00
t1 ∗ t2 −→ β | σ ∪ σ
σ,ρ σ,ρ
t1 −→ β1 | σ 0 t2 −→ β2 | σ 00 β = or(β1 , β2 )
σ,ρ (E-Or)
0 00
t1 + t2 −→ β | σ ∪ σ
σ,ρ
t −→ β | σ 0 β 0 = not(β)
σ,ρ (E-Not)
!t −→ β 0 | σ 0
σ,ρ
v = σ(s) [s 7→ v]t | −→ (v1 , v2 ) | σ 0
σ,ρ (E-Fsm)
fsm { v | s ⇒ t } −→ v2 | { s 7→ v1 } ∪ σ 0
162 E-Or. Similar as above, but use the helper function or to compute the resulting value.
163 E-Not. Similar as above, but use the helper function not to compute the resulting value.
164 E-Fsm. First look up the value for the current state from the state map σ. Then evaluate
165 the body of the state machine to a pair value (v1 , v2 ). The output is v2 , and the next
166 state is v1 .
167 The reduction relation only defines one-step semantics. The semantics of a system is
168 defined by the trace of a given input series ρ0 , ρ1 , · · · . We define it formally below:
169 I Definition 1 (Trace). The trace of a system t with respect to an input sequence ρ0 , ρ1 , · · ·
170 is the sequence o0 , o1 , · · · such that
σ0 ,ρ0
171 t −→ o0 | σ1
172 ...
σi ,ρi
173 t −→ oi | σi+1
174 ...
Γ ` t : Bool Γ ` t1 : T1 Γ, x:T1 ` t2 : T2
(T-Not) (T-Let)
Γ `!t : Bool Γ ` let x = t1 in t2 : T2
CVIT 2016
23:8 Digital Design with Implicit State Machines
197 In the above, α ranges over inputs a and state variables s, and ξ ranges over input map
198 ρ and state map σ.
201 The proof follows from the following lemma by induction on the length of the input
202 sequence:
212 For the purposes of the transformation, we first define the FSM-free fragment of the
213 language, which is represented by e. Lifting will result in lifted normal form (N), where all
214 FSMs are at the nested at the top of the program, with an FSM-free fragment in the middle.
215 The relation t1 ;L t2 says that the term t1 takes a lifting step to t2 . Lifting is defined
216 with the help of the lifting context L. The lifting context specifies that the transform follows
217 the order left-right and top-down. The actual lifting happens with the function J·K, which
218 transforms the source program to the expected form. We explain the concrete transform
219 rules below:
220 fsm { v | s ⇒ e1 } ∗ t2 . The FSM absorbs t2 into its body. The symmetric case, and the
221 cases for AND and OR are similar.
222 let x = fsm { v | s ⇒ e1 } in t2 . It pulls the let-binding into the body. The case in
223 which FSM is in the body of let-binding is similar.
224 fsm { v | s ⇒ e }.i. It pulls the projection into the body of FSM.
225 (ē, fsm { v | s ⇒ e }, t̄). It pulls the tuple into the body of FSM.
226 Once all FSMs are nested at the top-level after lifting, flattening takes place. The relation
227 t1 ;F t2 says that the term t1 takes a flattening step to t2 . Flattening is defined with
228 the help of the flattening context F . The flattening context specifies that the flattening
229 happens from inside towards outside. The actual merging step is quite straightforward: it
230 just combines the initial states v1 and v2 , as well as merges s1 and s2 into s.
231 We use the notation t1 ; t2 to mean that t1 takes either a lifting step (;L ) or a
232 flattening step (;F ) to t2 . We write t1 ;∗ t2 to mean 0 or multiple such transform steps.
233 For simplicity of presentation, we omit the formal definitions.
F. Liu, A. Prokopec, M. Odersky 23:9
FSM-free Fragment
N ::= e | fsm { v | s ⇒ N }
Lifting
JtK = fsm { v | s ⇒ t0 }
L[t] ;L L[fsm { v | s ⇒ t0 }]
JN K = fsm { v | s ⇒ e }
F [N ] ;F F [fsm { v | s ⇒ e }]
234 I Theorem 4 (Complexity). If the term t contains FSMs, then there exists e such that
235 t ;∗ fsm { v | s ⇒ e } in O(m ∗ n) steps where m is the size of the term t, and n is the
236 number of state machines in the code.
237 Sketch. During lifting, each step moves some code that pre-exists in t inside another FSM.
238 Thus, the worse case is O(m ∗ n). During flattening, each step reduces one FSM, thus it
239 takes n steps for flattening. Therefore, the complexity is O(m ∗ n). J
240 A tighter bound is O(d ∗ n), where d is the max depth of FSM from the root (if we see a
241 term t as an abstract syntax tree), n is the number of FSMs. However, as lifting introduces
CVIT 2016
23:10 Digital Design with Implicit State Machines
242 let-bindings which changes the height of the tree, technically it is more complex to establish
243 the bound, we thus leave it to future work.
244 Meanwhile, the complexity also establishes the bound for the resulting code size after
245 flattening: for each lifting and flattening step, the code size increase by a small constant
246 (usually an additional let-binding and tuple), thus code size increase is also bound by O(m∗n).
247 I Corollary 5 (Code Size). If the term t contains FSMs, and there exists e such that
248 t ;∗ fsm { v | s ⇒ e }, then the code size increase of e compared to e is bounded by
249 O(m ∗ n), where m is the size of the term t, and n is the number of state machines in the
250 code.
251 I Theorem 6 (Semantic Preserving). If t ; t0 , then they have the same trace for any given
252 input sequence ρ0 , ρ1 , · · · .
253 It follows from the following lemmas by induction on the length of the trace:
σ,ρ σ,ρ
254 I Lemma 7. If t ;L t0 , t −→ v | σ1 , then t0 −→ v | σ1 .
255 Sketch. First perform induction on the lifting contexts, then perform case analysis on the
256 concrete transform rules. J
260 Sketch. Perform induction on the flattening contexts. Note that for the initial states σ0 and
261 σ00 specified in N and N 0 respectively, f (σ0 ) = σ00 holds trivially.
262 J
284 In the above, we the when construct to define one transition for each state. We implement
285 when as a syntactic sugar in our DSL and use it to decode controller instructions (Section 4).
286 Note that outside the setting of formal verification and theory of computation, the term
287 finite-state machine is sometimes used in programming to loosely mean any machine that has
288 a finite set of states. In the rest of the paper, when there is no danger of misunderstanding,
289 we use the term FSM in the loose sense.
314 Most languages take the last assignment as effective in the case of double assignment.
315 The fact that such code is supported is a little counter-intuitive as all registers are refreshed
316 exactly once on each clock tick in synchronous digital circuits. What is worse is that double
317 assignment could be mistakes made by the programmer, for which the compiler is helpless to
318 address.
319 Such problems are inherent in imperative programming with states. However, a stateful
320 computation does not need to be in imperative style. The synchronous dataflow model in
321 Lustre [6] and Signal [4] is one evidence for this. Yet it is unknown how to make programming
322 with FSMs declarative, as they are stateful computation by nature, and past proposals on
323 programming with FSMs are all in imperative style [19, 8]. With implicit state machines, we
324 show how to program with FSMs in declarative style.
325 It is reported that dataflow programming is a good fit for dataflow-dominated applica-
326 tions, while FSM-based imperative programming is a more suitable for control-dominated
327 applications [3, 8]. The FSM extension to Lustre [8] comes from the need to support both
328 styles in the same language, in which FSMs desugar to a core dataflow calculus. Our calculus
329 of implicit state machines can be seen as another synergy of dataflow programming and
330 imperative programming. The expression-oriented nature of the calculus makes dataflow
CVIT 2016
23:12 Digital Design with Implicit State Machines
331 programming easy. Meanwhile, an implicit state machine with an explicit case for each state
332 is a good fit for control-dominated applications.
355 However, explicit representation in a truth table would take several lines:
s d s’ Q
0 0 0 0
356 0 1 1 0
1 0 0 1
1 1 1 1
357 The D flip-flop is so simple that digital designers seldom think them as an FSM in
358 programming. Programming with FSM in Verilog and VHDL is just a design methodology,
359 with implicit state machines, it becomes a reality.
input output
Combinational Logic
A B C
Figure 4 FSM composition. (A) An FSM in circuit, where the combinational logic is acyclic. (B)
The connection of two FSMs results in combinational cycles. (C) The connection does not result
in combinational cycles, as the feedback to the upper FSM only goes to the state element, which
breaks the loop.
369 require circuits without combinational cycles as input. In our calculus, there are no combina-
370 tional cycles by construction. To compose two FSMs as in Figure 4B, a digital designer has
371 to write the following code:
377 In the code above, another FSM is created with the state name s3, which is the shared
378 state that decouples the combinational loop.
379 In the case where the connection in Figure 4C does not result in combinational cycles, i.e.
380 one feedback only goes into the state elements but not output, there is no need to create an
381 additional FSM:
387 In the above, the next state and output of the inner FSM, i.e. t2, may depend on
388 o1. Meanwhile, the next state of the outer FSM, i.e. t3, may depend on o2. The code is
389 guaranteed to be acyclic by construction.
CVIT 2016
23:14 Digital Design with Implicit State Machines
C1 delay = 1 C2
delay = 3 delay = 5
412
413 The circuit above shows that two outputs of the sub-circuit C1 go to two different registers.
414 The output of the two registers go to an AND gate and its output in turns goes to the
415 sub-circuit C2. The critical path of the circuit has the delay 6. The critical path is the path
416 in the circuit that has the maximum delay between an input signal or a register read, to
417 an output signal or a register write. The period of clock in a synchronous circuit has to be
418 bigger than the delay of the critical path.
419 Using retiming, we can push the two registers after the AND gate, which results in the
420 following network:
C1 delay = 1 C2
delay = 3 delay = 5
421
422 Now the critical path of the circuit has a delay of 5 instead of 6, and it saves one register.
423 If we represent the circuit C1 by the term t1, and the circuit C2 by the term t2, then the
424 circuit before the retiming optimization can be expressed as follows:
425 let x = t1 in
426 let y = fsm { (0, 0) | s =>
427 (x, s.1 & s.2)
F. Liu, A. Prokopec, M. Odersky 23:15
428 }
429 in t2
430 In the above, x represents the two output signals of the circuit C1, and the input signal
431 to the circuit C2 is represented by the variable y. The circuit after the retiming optimization
432 can be expressed as follows:
433 let x = t1 in
434 let y = fsm { 0 | s =>
435 (x.1 & x.2, s)
436 }
437 in t2
438 If the AND gate in the original circuit is a XOR gate, then we also need to change the
439 initial state of the transformed FSM in the above.
440 If we see it from another perspective, retiming transforms are just usage of laws of implicit
441 state machines. In addition to the transformations presented in lifting and flattening, the
442 following transformations may also serve as laws because they are semantic-preserving:
444 The essence of retiming is succinctly expressed by the last rule, except the subtlety about
445 the initial state: it requires that t2 should evaluate to a value v 0 given the initial states for
446 all FSMs in the program σ0 . The empty environment enforces that t2 may not depend on
447 external inputs. Otherwise, we do not see how to preserve semantics in the transform.
460 In the code above, the type Signal[Bit] means that a is a signal of 1 bit. The type
461 Signal[Vec[2]] means a signal of width 2. Here we take advantage of literal types in Scala,
CVIT 2016
23:16 Digital Design with Implicit State Machines
462 which supports the usage of a literal constant as a type. The type Bit means the same as
463 Vec[1]:
464
465
466
1 type Bit = Vec[1]
467 The DSL supports common bit-wise operations like XOR (^), AND (&), OR (|), ADD
468 (+), SUB (-), SHIFT (<< and >>), MUX (if/then/else). The operator ++ concatenates two bit
469 vector to form a bigger bit vector. All these operations are supported in Verilog [28], and
470 they follow the same semantics as in Verilog.
471 We may compose two half adders to create a full adder, which takes a carry cin as input:
472
473 1 def full(a: Signal[Bit], b: Signal[Bit], cin: Signal[Bit]): Signal[Vec[2]] = {
474 2 val ab = halfAdder(a, b)
475 3 val s = halfAdder(ab(0), cin)
476 4 val cout = ab(1) | s(1)
477 5 cout ++ s(0)
478
479
6 }
480 In the above, we make two calls to halfAdder. Each call will create a copy of the half
481 adder circuit to be composed in the fuller adder. It returns the carry and the sum. We may
482 compose them further to create a 2-bit adder:
483
484 1 def adder2(a: Signal[Vec[2]], b: Signal[Vec[2]]): Signal[Vec[3]] = {
485 2 val cs0 = full(a(0), b(0), 0)
486 3 val cs1 = full(a(1), b(1), cs0(1))
487 4 cs1(1) ++ cs1(0) ++ cs0(0)
488
489
5 }
490 To actually generate a representation of the circuit, we need to specify the input signals:
491
492 1 val a = variable[Vec[2]]("a")
493 2 val b = variable[Vec[2]]("b")
494
495
3 val circuit = adder2(a, b)
500 For testing purposes, we can call the interpreter to get the result for a specific input:
501
502 1 val add2 = circuit.eval(a, b)
503 2 val Value(c1, s1, s0) = add2(Value(1, 0) :: Value(0, 1) :: Nil)
504 3 assertEquals(c1, 0)
505 4 assertEquals(s1, 1)
506
507
5 assertEquals(s0, 1)
508 You might be wondering, what about a generic adder that generates circuits for a given
509 width? This can be implemented with a recursion on the number of bits:
510
511 1 def adderN[N <: Num](lhs: Signal[Vec[N]], rhs: Signal[Vec[N]])
512 2 : Signal[Bit ~ Vec[N]] = {
513 3 val n: Int = lhs.size
514 4 def recur(index: Int, cin: Signal[Bit], acc: Signal[Vec[_]]) =
515 5 if (index >= n) cin ~ acc.as[Vec[N]]
516 6 else {
517 7 val cs: Signal[Vec[2]] = full(lhs(index), rhs(index), cin)
518 8 recur(index + 1, cs(1), (cs(0) ++ acc.as[Vec[N]]).asInstanceOf)
519 9 }
520 10
522
523
12 }
524 In the code above, the type Signal[Bit ~ Vec[N]] means a signal that is a pair, the left
525 is one bit, the right is a bit vector of length N. To construct a signal of such a type, we just
526 connect two signals with ~ as it is used at line 5. At line 8, we used several type cast in the
527 code, due to the fact that Scala currently does not support arithmetic operations at type
528 level.
530 For the input Xi , the output Yi also depends on the previous values Xi−1 and Xi−2 . The
531 FSM that delays a given signal by one clock can be implemented as follows:
532
533 1 def delay[T <: Type](sig: Signal[T], init: Value): Signal[T] =
534 2 fsm("delay", init) { (last: Signal[T]) =>
535 3 sig ~ last
536
537
4 }
538 In the code above, we declare an implicit state machine with the specified initial state
539 init. The body of the FSM is a pair sig ~ last, where the first part becomes the next state,
540 and the second part becomes the output. This is exactly the D flip-flop.
541 Now we may create the circuit for the moving average:
542
543 1 def movingAverage(in: Signal[Vec[8]]): Signal[Vec[8]] = {
544 2 let(delay(in, 0.toValue(8))) { z1 =>
545 3 let(delay(z1, 0.toValue(8))) { z2 =>
546 4 (z2 + (z1 << 1) + in) >> 2.W[2]
547 5 }
548 6 }
549
550
7 }
551 In the code above, we first create an instance of the delay circuit and bind it to the
552 variable z1. Then we delay the signal z1, and bind it to z2. Finally, the computation is
553 expressed on bit vectors.
554 Note that it is tempting to implement the same circuit without using the let-bindings:
555
556 1 def movingAverage(in: Signal[Vec[8]]): Signal[Vec[8]] = {
557 2 val z1 = delay(in, 0.toValue(8))
558 3 val z2 = delay(z1, 0.toValue(8))
559 4 (z2 + (z1 << 1) + in) >> 2.W[2]
560
561
5 }
562 The circuit, though functions the same, will need more gates to implement. The reason is
563 that, in our DSL, the variable definition z1 represents the D flip-flop circuit (not the signal),
564 each usage of the variable z1 will create a copy of the circuit. It is used twice, the circuit is
565 thus duplicated twice. The way to avoid duplication is to use let-bindings, which serves the
566 same role as that of wires: a bound variable may be used multiple times, just like a wire
567 may forward the same signal to multiple gates.
568 The adder example in the previous section also suffers from this problem. However, to
569 our surprise, the version without let-binding is optimized better by synthesis tools from our
570 testing. This problem is common in meta-programming, i.e. write a program to generate
CVIT 2016
23:18 Digital Design with Implicit State Machines
571 another program (possibly in another language). We believe linear type systems might be
572 useful in such settings to ensure that method call results are used linearly, as a method
573 usually synthesize some piece of code, duplicate usage or no usage are usually mistakes.
574 Meanwhile, method arguments should be non-linear, i.e., they may be used multiple times.
601 As expected, a lot of unnecessary let-bindings are introduced, and the flattening of FSMs
602 will introduce several more let bindings. To eliminate such bindings, we first transform the
603 code into A-normal form (ANF), then perform detupling that reduces pairs to bit vectors,
604 and finally inline trivial let-bindings. In the end, we get the following compact code:
605
606 1 fsm { 0 | state =>
607 2 a ++ state(15..8) ++ ((state(7..0) + (state(15..8) << 1) + a) >> 2)
608
609
3 }
610 Eventually, the generated Verilog code looks like the following:
611
612 1 module Filter (CLK, a, out);
613 2 input CLK;
614 3 input [7:0] a;
615 4 output [7:0] out;
616 5 wire [7:0] out;
617 6 reg [15:0] state;
618 7
629 In the Verilog code above, only the following line updates the state of the FSM, other
630 lines compute the next state and output:
631
632 1 always @ (posedge CLK)
633 2 state <= { a, state[15:8] };
634
635
3 endmodule
636 This is the typical code generated by our DSL compiler, all the code is combinational
637 except one line, no matter how complex the circuit is. Is the generated Verilog efficient? For
638 curiosity, we implemented the moving average filter in Chisel:
639
640 1 class MovingAverage3 extends Module {
641 2 val io = IO(new Bundle {
642 3 val in = Input(UInt(8.W))
643 4 val out = Output(UInt(8.W))
644 5 })
645 6 val z1 = RegNext(io.in)
646 7 val z2 = RegNext(z1)
647 8 io.out := (io.in + (z1 << 1.U) + z2) >> 2.U
648
649
9 }
650 Chisel generates the following Verilog code after removing comments and the reset input:
651
652 1 module MovingAverage3(
653 2 input clock,
654 3 input [7:0] io_in,
655 4 output [7:0] io_out
656 5 );
657 6 reg [7:0] z1;
658 7 reg [7:0] z2;
659 8 wire [8:0] _GEN_0;
660 9 wire [8:0] _T_12;
661 10 wire [8:0] _GEN_1;
662 11 wire [9:0] _T_13;
663 12 wire [8:0] _T_14;
664 13 wire [8:0] _GEN_2;
665 14 wire [9:0] _T_15;
666 15 wire [8:0] _T_16;
667 16 wire [8:0] _T_18;
668 17 assign _GEN_0 = {{1’d0}, z1};
669 18 assign _T_12 = _GEN_0 << 1’h1;
670 19 assign _GEN_1 = {{1’d0}, io_in};
671 20 assign _T_13 = _GEN_1 + _T_12;
672 21 assign _T_14 = _GEN_1 + _T_12;
673 22 assign _GEN_2 = {{1’d0}, z2};
674 23 assign _T_15 = _T_14 + _GEN_2;
675 24 assign _T_16 = _T_14 + _GEN_2;
676 25 assign _T_18 = _T_16 >> 2’h2;
677 26 assign io_out = _T_18[7:0];
678 27 always @(posedge clock) begin
679 28 z1 <= io_in;
680 29 z2 <= z1;
681 30 end
682
683
31 endmodule
CVIT 2016
23:20 Digital Design with Implicit State Machines
684 Now we run the synthesis tool Yosys1 on both files, we get the following result:
686 For all columns, lower is better. The most important is last column cells, which says
687 the number of gates required to implement the circuit. The column wires means the total
688 number of wires in the synthesized design, the column wire bits means the total number of
689 wires in bits, as wires may be wider than 1 bit. The column public wires means the wires
690 that exist in the original design, i.e. not created by Yosys, the column public wire bits is
691 similar.
692 The difference between the first two lines comes from the fact that Chisel handles << by
693 incrementing the width of the result, it thus increases wires and gates. Our DSL follows the
694 semantics of Verilog, i.e. to keep the result the same width as the shifted bit vector. After
695 the correction of the semantics for <<, Chisel uses the same number of gates as our DSL, and
696 our DSL still performs better on wire bits. This shows that at least for simple circuits, our
697 DSL compiler generates efficient circuits on par with the industry-level DSL.
701 NOP, ADD, ADDI, SUB, SUBI, SHL, SHR, LD, LDI, ST, AND, ANDI, OR, ORI,
702 XOR, XORI, BR, BRZ, BRNZ, EXIT
711 The controller interfaces with a bus, which make the requested data on bus in the next
712 clock cycle:
713
714 1 type BusOut = Vec[8] ~ Bit ~ Bit ~ Vec[32] // addr ~ read ~ write ~ writedata
715
716
2 type BusIn = Vec[32] // read data
721 It takes a program prog to store in a on-chip instruction memory, which is different from
722 the external memory connected by the bus. Note that the output type is BusOut ~ Debug,
723 where we add Debug for testing purposes:
1
https://github.com/YosysHQ/yosys
F. Liu, A. Prokopec, M. Odersky 23:21
724
725
726
1 type Debug = Vec[32] ~ Vec[_] ~ Vec[16] ~ Bit // acc ~ pc ~ instr ~ exit
727 Note that the width of the program counter PC is unspecified, because it depends on the
728 size of the given program. If the program size is 62, then the width is 6.
729 At the high-level, the microcontroller is an FSM which contains three architectural states:
730
731 1 fsm("processor", pc0 ~ acc0 ~ pending0) { (state: Signal[PC ~ ACC ~ INSTR]) =>
732 2 val pc ~ acc ~ pendingInstr = state
733
734
3 }
735 The variable pc refers to the program counter, acc is the accumulator register, pendingInstr
736 is the instruction from the last cycle waiting for data from the external memory. The type
737 ACC and INSTR are aliases of Vec[32] and Vec[16] respectively. The type PC is an alias of
738 Vec[addrWidth.type], where addrWidth is a local variable computed from the program size.
739 The skeleton of the implementation is as follows:
740
741 1 let("pcNext", pc + 1.W[addrWidth.type]) { pcNext =>
742 2 let("instr", instrMemory(addrWidth, prog, pc)) { instr =>
743 3 let("stage2Acc", stage2(pendingInstr, acc, busIn)) { acc =>
744 4 when (opcode === ADDI.W[8]) {
745 5 val acc2 = acc + operand
746 6 next(acc = acc2)
747
748
7 } /* ... */ } }
749 It first increments the program counter pc and bind the result to pcNext. Then it binds
750 the current instruction to instr. Next, it gets the updated value of the accumulator from
751 the pending instruction. At the circuit-level, the three operations are executed in parallel.
752 Finally, the instruction is decoded and executed in a series of when constructs. The when
753 construct is a syntactic sugar created from the built-in multiplexer that supports selecting
754 one of two n-bit inputs by a single bit control. Eventually, each branch calls the local method
755 next with appropriate arguments:
756
757 1 def next(
758 2 pc: Signal[PC] = pcNext,
759 3 acc: Signal[ACC] = acc,
760 4 pendingInstr: Signal[INSTR] = 0.W[16],
761 5 out: Signal[BusOut] = defaultBusOut,
762 6 exit: Boolean = false
763 7 ): Signal[(PC ~ ACC ~ INSTR) ~ (BusOut ~ Debug)] = {
764 8 val debug = acc ~ (pc.as[Vec[_]]) ~ instr ~ exit
765 9 (pc ~ acc ~ pendingInstr) ~ (out ~ debug)
766
767
10 }
768 As can be seen from above, the method next defines default values for all arguments, such
769 that each branch may only specify parameters that are different. For example, the following
770 are the code for unconditional jump BR and indirect addition ADD:
771
772 1 } .when (opcode === BR.W[8]) {
773 2 next(pc = jmpAddr)
774 3 } .when (opcode === ADD.W[8]) {
775 4 next(out = loadBusOut, pendingInstr = instr)
776
777
5 }
778 The implementation for the method stage2 just checks the pending instructions, and
779 computes the updated accumulator value from the bus input. If the pending instruction is
780 NOP, it simply returns the current value of the accumulator.
CVIT 2016
23:22 Digital Design with Implicit State Machines
781 The on-chip instruction memory is implemented by generating nested conditional expres-
782 sions. Each condition tests whether the input address is equal to a memory address, if true,
783 the instruction at the address is returned in the same clock cycle (they are combinational
784 circuits):
785
786 1 def instrMemory(addrWidth: Int, prog: Array[Int],
787 2 addr: Signal[Vec[addrWidth.type]]): Signal[Vec[16]] = {
788 3 val default: Signal[Vec[16]] = 0.W[16]
789 4 (0 until (1 << addrWidth)).foldLeft(default) { (acc, curAddr) =>
790 5 when[Vec[16]] (addr === curAddr.W[addrWidth.type]) {
791 6 if (curAddr < prog.size) prog(curAddr).W[16]
792 7 else default
793 8 } otherwise {
794 9 acc
795 10 }
796 11 }
797
798
12 }
799 We test the implementation with small assembly programs. Despite the allure of success-
800 fully running simple assembly programs, we are aware that the microcontroller is still too
801 simple and it may not match quality standards. Our next goal is to implement RISC-V cores
802 and compare with the state-of-the-art open source implementations by standard metrics.
804 Statecharts [12] is a visual formalism which supports hierarchical states and orthogonal states.
805 Its formal semantics is subtle, and was given several years later after its first introduction
806 [14, 24, 11, 13]. Hierarchical states do not automatically give rise to hierarchical FSMs
807 required for hierarchical module composition in circuit design. In a sense, hierarchical states
808 and hierarchical FSMs are two orthogonal concepts, as hierarchical FSMs do not imply
809 hierarchical states either. Implicit state machines do not support hierarchical states natively,
810 but such an extension is conceptually possible, though what they should look like and
811 whether they are useful in digital design is open to debate. Implicit state machines just
812 do not mandate one separate case for each state in the program, but do not forbid them,
813 hierarchical or not.
814 An extension of hierarchical FSMs [8] is experimented in Lucid Synchrone [7] and
815 integrated in the declarative dataflow language Lustre [6]. The extension is in imperative
816 style, and it desugars to a core dataflow calculus. Since the state machines need to define a
817 transition for each state separately, their code representation suffers from exponential blowup
818 after flattening.
819 Caisson [19] is an imperative language for digital design, which supports nested states and
820 parameterized states. The language contains both registers and FSM as primitive constructs.
821 In contrast, our approach is more fundamental in that it makes implicit state machines as
822 the only primitive construct.
823 Malik [23] proposed the usage of combinational techniques to optimizing sequential
824 circuits by pushing registers to the boundary of the circuit network, and cut the loops when
825 needed. The approach is based on a technique called retiming [17], which changes the timing
826 behaviors of the circuit by moving registers around in the circuit network. We achieve the
827 same goal without changing timing behavior of the circuit. The retiming optimization can
828 be expressed on top of implicit state machines.
F. Liu, A. Prokopec, M. Odersky 23:23
829 6 Conclusion
830 It is well-known that Boolean algebra is the calculus for combinational circuits. In this paper,
831 we propose implicit state machines as the calculus for sequential circuits. Implicit state
832 machines do not mandate one separate case for each state in the specification of an FSM.
833 Compared to classic FSMs, implicit state machines support arbitrary parallel and hierarhical
834 composition, which is crucial for real-world programming.
835 Compared to explicit state machines that require one separate case for each state, implicit
836 state machines enjoy a nice property: any system of parallel and hierarchical implicit state
837 machines may be flattened to a single implicit state machine without exponential blowup. For
838 digital circuits, this means that any sequential circuit can be transformed into an equivalent
839 circuit with state elements at the boundary, and a big combinational core in the center. This
840 creates more optimization opportunities for digital circuits, and logic synthesis experts no
841 longer need to worry about combinational boundaries anymore.
842 There are two directions for future work. First, implicit state machines, due to their
843 composability, will make integrated and compositional specification in complex systems
844 easier. Meanwhile, flattening may also flatten the specifications, which can then be fed into
845 off-the-shelf verification tools, together with the flattened FSMs. In this sense, implicit state
846 machines bridge the gap between complex systems and verification tools.
847 Second, implicit state machines may lead to new hardware architectures. For example,
848 in FPGA architectures, currently state elements are scattered across the chip to support
849 different kinds of sequential circuits. This architecture is still not flexible enough, and it is a
850 waste of resource when the distribution of the state elements diverges too big from the circuit
851 to be implemented on the FPGA chip. A possibility is to centralize all state elements, as any
852 circuit is equivalent to a circuit with state elements at the boundary and a combinational
853 core, of the same delay and area.
854 References
855 1 Jonathan Bachrach, Huy Vo, Brian C. Richards, Yunsup Lee, Andrew Waterman, Rimas
856 Avizienis, John Wawrzynek, and Krste Asanovic. Chisel: Constructing hardware in a scala
857 embedded language. DAC Design Automation Conference 2012, pages 1212–1221, 2012.
858 2 A. Benveniste and G. Berry. The synchronous approach to reactive and real-time systems. Pro-
859 ceedings of the IEEE, 79(9), September 1991. URL: http://ieeexplore.ieee.org/document/
860 97297/, doi:10.1109/5.97297.
861 3 A. Benveniste, P. Caspi, S.A. Edwards, N. Halbwachs, P. Le Guernic, and R. de Simone. The
862 synchronous languages 12 years later. Proceedings of the IEEE, 91(1), January 2003. URL:
863 http://ieeexplore.ieee.org/document/1173191/, doi:10.1109/JPROC.2002.805826.
864 4 Albert Benveniste, Paul Le Guernic, and Christian Jacquemot. Synchronous programming
865 with events and relations: the SIGNAL language and its semantics. Science of Computer Pro-
866 gramming, 16(2), September 1991. URL: http://www.sciencedirect.com/science/article/
867 pii/016764239190001E, doi:10.1016/0167-6423(91)90001-E.
868 5 Jerry R Burch, Edmund M Clarke, Kenneth L McMillan, David L Dill, and Lain-Jinn Hwang.
869 Symbolic model checking: 1020 states and beyond. Information and computation, 98(2), 1992.
870 6 P. Caspi, D. Pilaud, N. Halbwachs, and J. A. Plaice. LUSTRE: A Declarative Language for
871 Real-time Programming. In Proceedings of the 14th ACM SIGACT-SIGPLAN Symposium
872 on Principles of Programming Languages, POPL ’87, New York, NY, USA, 1987. ACM.
873 event-place: Munich, West Germany. URL: http://doi.acm.org/10.1145/41625.41641,
874 doi:10.1145/41625.41641.
875 7 Paul Caspi, Gregoire Hamon, Marc Pouzet, and Univ Paris-Sud. Synchronous Functional
876 Programming: The Lucid Synchrone Experiment. 2008.
CVIT 2016
23:24 Digital Design with Implicit State Machines
877 8 Jean-Louis Colaço, Bruno Pagano, and Marc Pouzet. A conservative extension of synchronous
878 data-flow with state machines. In Proceedings of the 5th ACM international conference on
879 Embedded software - EMSOFT ’05, Jersey City, NJ, USA, 2005. ACM Press. URL: http:
880 //portal.acm.org/citation.cfm?doid=1086228.1086261, doi:10.1145/1086228.1086261.
881 9 G. De Micheli. Synchronous logic synthesis: algorithms for cycle-time minimization. IEEE
882 Transactions on Computer-Aided Design of Integrated Circuits and Systems, 10(1), January
883 1991. URL: http://ieeexplore.ieee.org/document/62792/, doi:10.1109/43.62792.
884 10 Giovanni De Micheli, Robert K Brayton, and Alberto Sangiovanni-Vincentelli. Optimal
885 state assignment for finite state machines. IEEE Transactions on Computer-Aided Design of
886 Integrated Circuits and Systems, 4(3):269–285, 1985.
887 11 Willem-Paul de Roever, Gerald Lüttgen, and Michael Mendler. What Is in a Step: New Per-
888 spectives on a Classical Question. In Zohar Manna and Doron A. Peled, editors, Time for Veri-
889 fication, volume 6200. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. URL: http://link.
890 springer.com/10.1007/978-3-642-13754-9_15, doi:10.1007/978-3-642-13754-9_15.
891 12 David Harel. Statecharts: a visual formalism for complex systems. Science of Computer
892 Programming, 8(3), June 1987. URL: https://linkinghub.elsevier.com/retrieve/pii/
893 0167642387900359, doi:10.1016/0167-6423(87)90035-9.
894 13 David Harel and Hillel Kugler. The Rhapsody Semantics of Statecharts (or, On the Executable
895 Core of the UML). In David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg,
896 Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan,
897 Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Dough Tygar, Moshe Y. Vardi, Gerhard
898 Weikum, Hartmut Ehrig, Werner Damm, Jörg Desel, Martin Große-Rhode, Wolfgang Reif,
899 Eckehard Schnieder, and Engelbert Westkämper, editors, Integration of Software Specification
900 Techniques for Applications in Engineering, volume 3147. Springer Berlin Heidelberg, Berlin,
901 Heidelberg, 2004. URL: http://link.springer.com/10.1007/978-3-540-27863-4_19, doi:
902 10.1007/978-3-540-27863-4_19.
903 14 David Harel and Amnon Naamad. The STATEMATE Semantics of Statecharts. ACM Trans.
904 Softw. Eng. Methodol., 5(4), October 1996. URL: http://doi.acm.org/10.1145/235321.
905 235322, doi:10.1145/235321.235322.
906 15 A. Izraelevitz, J. Koenig, P. Li, R. Lin, A. Wang, A. Magyar, D. Kim, C. Schmidt, C. Markley,
907 J. Lawson, and J. Bachrach. Reusability is firrtl ground: Hardware construction languages,
908 compiler frameworks, and transformations. In 2017 IEEE/ACM International Conference
909 on Computer-Aided Design (ICCAD), pages 209–216, Nov 2017. doi:10.1109/ICCAD.2017.
910 8203780.
911 16 M. Keating. The simple art of soc design. 2011.
912 17 Charles E. Leiserson and James B. Saxe. Retiming synchronous circuitry. Algorithmica,
913 6(1-6), June 1991. URL: http://link.springer.com/10.1007/BF01759032, doi:10.1007/
914 BF01759032.
915 18 Patrick S. Li, Adam M. Izraelevitz, and Jonathan Bachrach. Specification for the firrtl language.
916 Technical Report UCB/EECS-2016-9, EECS Department, University of California, Berkeley,
917 Feb 2016. URL: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-9.html.
918 19 Xun Li, Mohit Tiwari, Jason K Oberg, Vineeth Kashyap, Frederic T Chong, Timothy Sherwood,
919 and Ben Hardekopf. Caisson: A Hardware Description Language for Secure Information Flow.
920 20 Dan Luu. Verilog is weird. https://danluu.com/why-hardware-development-is-hard/.
921 Accessed: 2019-12-24.
922 21 Dan Luu. Writing safe verilog. https://danluu.com/pl-troll/. Accessed: 2019-12-24.
923 22 S. Malik. Analysis of cyclic combinational circuits. Proceedings of 1993 International Conference
924 on Computer Aided Design (ICCAD), pages 618–625, 1993.
925 23 Sharad Malik, Ellen M Sentovich, and Robert K Brayton. Retiming and Resynthesis: Optim-
926 izing Sequential Networks with Combinational Techniques.
F. Liu, A. Prokopec, M. Odersky 23:25
927 24 A. Pnueli and M. Shalev. What is in a step: On the semantics of statecharts. In Takayasu Ito
928 and Albert R. Meyer, editors, Theoretical Aspects of Computer Software, Lecture Notes in
929 Computer Science, Berlin, Heidelberg, 1991. Springer. doi:10.1007/3-540-54415-1_49.
930 25 Daniel Sanchez. Minispec reference guide. https://6004.mit.edu/web/_static/fall19/
931 resources/references/minispec_reference.pdf, 2019. Accessed: 2019-12-24.
932 26 Claude E. Shannon. A symbolic analysis of relay and switching circuits. Transactions of the
933 American Institute of Electrical Engineers, 57:713–723, 1938.
934 27 T.R. Shiple, V. Singhal, R.K. Brayton, and A.L. Sangiovnni-Vincentelli. Analysis of
935 combinational cycles in sequential circuits. In 1996 IEEE International Symposium on
936 Circuits and Systems. Circuits and Systems Connecting the World. ISCAS 96, volume 4,
937 Atlanta, GA, USA, 1996. IEEE. URL: http://ieeexplore.ieee.org/document/542093/,
938 doi:10.1109/ISCAS.1996.542093.
939 28 IEEE Computer Society. IEEE Standard for Verilog Hardware Description Language. IEEE,
940 2005.
941 29 Harald Søndergaard and Peter Sestoft. Referential transparency, definiteness and unfoldability.
942 Acta Informatica, 27:505–517, 1990.
943 30 Lin Yuan, Gang Qu, Tiziano Villa, and Alberto Sangiovanni-Vincentelli. An fsm reengineering
944 approach to sequential circuit synthesis by state splitting. IEEE Transactions on Computer-
945 Aided Design of Integrated Circuits and Systems, 27(6):1159–1164, 2008.
CVIT 2016