0% found this document useful (0 votes)
59 views48 pages

HardwareSoftware Co-Design Principles and Practice

Uploaded by

nguyenhieuu1402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views48 pages

HardwareSoftware Co-Design Principles and Practice

Uploaded by

nguyenhieuu1402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2354061

Hardware/Software Co-Design: Principles and Practice

Article · September 1997


DOI: 10.1007/978-1-4757-2649-7_1 · Source: CiteSeer

CITATIONS READS
26 13,174

3 authors, including:

Daniel Gajski Jianwen Zhu


University of California, Irvine University of Toronto
487 PUBLICATIONS 12,785 CITATIONS 89 PUBLICATIONS 1,689 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Daniel Gajski on 26 March 2015.

The user has requested enhancement of the downloaded file.


Essential Issues in Codesign
Daniel D. Gajski
Jianwen Zhu
Rainer Domer

Technical Report ICS-97-26


June, 1997
Department of Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425, USA
(714) 824-8059
[email protected]
[email protected]
[email protected]

Abstract
In this report we discuss the main models of computation, the basic types of architectures, and
language features needed to specify systems. We also give an overview of a generic methodology for
designing systems, that include software and hardware parts, from executable speci cations.
Contents
1 Models 1
1.1 Model and architecture de nition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
1.2 Model taxonomy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3
1.3 Finite-state machine : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
1.4 Finite-state machine with datapath : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
1.5 Petri net : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
1.6 Hierarchical concurrent nite-state machine : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
1.7 Programming languages : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
1.8 Program-state machine : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9
2 Architectures 10
2.1 Controller architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10
2.2 Custom Datapath architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
2.3 FSMD architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13
2.4 CISC architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13
2.5 RISC architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14
2.6 VLIW architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
2.7 Parallel architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16
3 Languages 17
3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17
3.2 Characteristics of system models : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18
3.3 Concurrency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18
3.4 State transitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
3.5 Hierarchy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
3.6 Programming constructs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23
3.7 Behavioral completion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23
3.8 Exception handling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 24
3.9 Timing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 24
3.10 Communication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25
3.11 Process synchronization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 26
3.12 SpecC+ Language description : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28
4 Generic codesign methodology 33
4.1 System speci cation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33
4.2 Allocation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35
4.3 Partitioning and the model after partitioning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35
4.4 Scheduling and the scheduled model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36
4.5 Communication synthesis and the communication model : : : : : : : : : : : : : : : : : : : : : : : : 37
4.6 Analysis and validation ow : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 39
4.7 Backend : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 39
5 Conclusion and Future Work 40
6 Index 43

i
List of Figures
2 Implementation architectures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2
1 Conceptual views of an elevator controller : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3
3 FSM model for the elevator controller. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
4 State-based FSM model for the elevator controller. (y) : : : : : : : : : : : : : : : : : : : : : : : : : 5
5 FSMD model for the elevator controller. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
6 A Petri net example. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7
7 Petri net representations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7
8 Statecharts: hierarchical concurrent states. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
9 An example of program-state machine. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9
10 A generic controller design : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10
11 An example of a custom datapath. (z) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
12 Simple datapath with one accumulator : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12
13 Two di erent datapaths for FIR lter : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12
14 Design model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13
15 CISC with microprogrammed control. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14
16 RISC with hardwired control. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
17 An example of VLIW datapath. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
19 Some typical con gurations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17
18 A heterogeneous multiprocessor : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18
20 Data-driven concurrency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19
21 Pipelined concurrency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
22 Control-driven concurrency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
23 State transitions between arbitrarily complex behaviors. (y) : : : : : : : : : : : : : : : : : : : : : : 21
24 Structural hierarchy. (y) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
25 Sequential behavioral decomposition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
26 Behavioral decomposition types : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
27 Code segment for sorting. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23
28 Behavioral completion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23
29 Exception types : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 24
30 Timing diagram : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25
31 Communication model. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25
33 Integer channel. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 26
34 A simple synchronous bus protocol : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 26
32 Examples of communication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27
35 Protocol description of the synchronous bus protocol. : : : : : : : : : : : : : : : : : : : : : : : : : : 27
36 Control synchronization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28
37 Data-dependent synchronization in Statecharts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28
38 A graphical SpecC+ speci cation example. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28
39 A textual SpecC+ speci cation example. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 29
40 Component wrapper speci cation. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30
41 Source code of the component wrapper speci cation. : : : : : : : : : : : : : : : : : : : : : : : : : : 31
42 Common con gurations before and after channel inlining : : : : : : : : : : : : : : : : : : : : : : : : 32
43 Timing speci cation of the SRAM read protocol. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 32
44 Timing implementation of the SRAM read protocol. : : : : : : : : : : : : : : : : : : : : : : : : : : 32
46 Conceptual model of speci cation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33
45 Generic methodology. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 34
47 Conceptual model after partitioning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35
48 Conceptual model after scheduling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36
49 Conceptual model after communication synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38

ii
Essential Issues in Codesign
Daniel D. Gajski, Jianwen Zhu, Rainer Domer
Department of Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425, USA

Abstract needed for specifying systems. In the fourth section


In this report we discuss the main models of compu- we present a generic codesign methodology.
tation, the basic types of architectures, and language
features needed to specify systems. We also give an 1.1 Model and architecture de nition
overview of a generic methodology for designing sys- System design is the process of implementing a de-
tems, that include software and hardware parts, from sired functionality using a set of physical components.
executable speci cations. Clearly, then, the whole process of system design must
begin with specifying the desired functionality. This
1 Models is not, however, an easy task. For example, consider
the task of specifying an elevator controller. How do
In the last ten years, VLSI design technology, and the we describe its functionality in sucient detail that we
CAD industry in particular, have been very successful, could predict with absolute precision what the eleva-
enjoying an exceptional growth that has been paral- tor's position would be after any sequence of pressed
leled only by the advances in IC fabrication. Since buttons? The problem with natural-language speci -
the design problems at the lower levels of abstrac- cations is that they are often ambiguous and incom-
tion became humanly intractable earlier than those at plete, lacking the capacity for detail that is required
higher abstraction levels, researchers and the industry by such a task. Therefore, we need a more precise
alike were forced to devote their attention rst to lower approach to specify functionality.
level problems such as circuit simulation, placement, The most common way to achieve the level of pre-
routing and oorplanning. As these problems became cision we need is to think of the system as a collection
more manageable, CAD tools for logic simulation and of simpler subsystems, or pieces, and the method or
synthesis were developed successfully and introduced the rules for composing these pieces to create system
into the design process. As design complexities have functionality. We call such a method a model.
grown and time-to-market requirements have shrunk To be useful, a model should possess certain quali-
drastically, both industry and academia have begun to ties. First, it should be formal so that it contains no
focus on system levels of design since they reduce the ambiguity. It should also be complete, so that it can
number of objects that a designer needs to consider describe the entire system. In addition, it should be
by an order of magnitude and thus allow the design comprehensible to the designers who need to use it,
and manufacturing of complex application speci c in- as well as being easy to modify, since it is inevitable
tegrated circuits (ASICs) in short periods of time. that, at some point, they will wish to change the sys-
The rst step in designing a system is specifying tem's functionality. Finally, a model should be natural
its functionality. To help us understand and organize enough to aid, rather than impede, the designer's un-
this functionality in a systematic manner, we can use derstanding of the system.
a variety of conceptual models. In this chapter, we It is important to note that a model is a formal sys-
will survey the key conceptual models that are most tem consisting of objects and composition rules, and
commonly used for hardware and for software systems, is used for describing a system's characteristics. Typ-
while in the next section we examine the various archi- ically, we would use a particular model to decompose
tectures that are used in implementing those models, a system into pieces, and then generate a speci cation
while in the third section we survey language features by describing these pieces in a particular language. A

1
language can capture many di erent models, and a
model can be captured in many di erent languages.
The purpose of a conceptual model is to provide
an abstracted view of a system. Figure 1, for exam-
ple, shows two di erent models of an elevator con-
troller, whose English description is in Figure 1(a).
The di erence between these two models is that Fig-
ure 1(b) represents the controller as a set of program-
ming statements, whereas Figure 1(c) represents the
controller as a nite state machine in which the states
indicate the direction of the elevator movement.
As you can see, each of these models represents a rfloor cfloor rfloor cfloor

set of objects and the interactions among them. The


state-machine model, for example, consists of a set of
states and transitions between these states; the pro-
gramming model, in contrast, consists of a set of state-
ments that are executed under a control sequence that
uses branching and looping. The advantage to having Next State Output
these di erent models at our disposal is that they al- State
Logic
Register Logic

low designers to represent di erent views of a system,

. . .
thereby exposing its di erent characteristics. For ex-
ample, the state-machine model is best suited to rep-
resent a system's temporal behavior, as it allows a
designer to explicitly express the modes and mode-
transitions caused by external or internal events. The
algorithmic model, on the other hand, has no explicit d

states. However, since it can specify a system's input- (a)


output relation in terms of a sequence of statements,
it is well-suited to representing the procedural view of
the system. Processor
Designers choose di erent models in di erent
Memory

phases of the design process, in order to emphasize


those aspects of the system that are of interest to them Bus
at that particular time. For example, in the speci -
cation phase, the designer knows nothing beyond the rfloor
functionality of the system, so he will tend to use a d
model that does not re ect any implementation infor- cfloor Interface

mation. In the implementation phase, however, when


information about the system's components is avail- (b)

able, the designer will switch to a model that can cap-


ture the system's structure. Figure 2: Architectures used in: (a) a register-level im-
Di erent models are also required for di erent ap- plementation, (b) a system-level implementation. (z)
plication domains. For example, designers would
model real-time systems and database systems di er-
ently, since the former focus on temporal behavior,
while the latter focus on data organization.
Once the designer has found an appropriate model
to specify the functionality of a system, he can de-
scribe in detail exactly how that system will work. At
that point, however, the design process is not com-
plete, since such a model has still not described ex-
actly how that system is to be manufactured. The

2
"If the elevator is stationary and the floor
requested is equal to the current floor, loop
then the elevator remains idle. if (rfloor = cfloor) then
d := idle;
If the elevator is stationary and the floor elsif (rfloor < cfloor) then
requested is less than the current floor, d := down;
then lower the elevator to the requested floor. elsif (rfloor > cfloor) then
d := up;
If the elevator is stationary and the floor end if;
requested is greater than the current floor, end loop;
then raise the elevator to the requested floor."
(a) (b)

(rfloor < cfloor) (rfloor = cfloor) (rfloor > cfloor)


/ d := down / d := idle / d := up

(rfloor < cfloor) (rfloor > cfloor)


/ d := down / d := up
Down Idle Up
(rfloor = cfloor) (rfloor = cfloor)
/ d := idle / d := idle
(rfloor < cfloor) / d := up
(rfloor < cfloor) / d := down

(c)

Figure 1: Conceptual views of an elevator controller: (a) desired functionality in English, (b) programming model,
(c) state-machine model. (y)

next step, then, is to transform the model into an sign process proceeds, an architecture will begin to
architecture, which serves to de ne the model's im- emerge, with more detail being added at each step in
plementation by specifying the number and types of the process. Generally, designers will nd that certain
components as well as the connections between them. architectures are more ecient in implementing cer-
In Figure 2, for example, we see two di erent architec- tain models. In addition, design and manufacturing
tures, either of which could be used to implement the technology will have a great in uence on the choice of
state-machine model of the elevator controller in Fig- an architecture. Therefore, designers have to consider
ure 1(c). The architecture in Figure 2(a) is a register- many di erent implementation alternatives before the
level implementation, which uses a state register to design process is complete.
hold the current state and the combinational logic to
implement state transitions and values of output sig-
nals. In Figure 2(b), we see a processor-level imple- 1.2 Model taxonomy
mentation that maps the same state-machine model System designers use many di erent models in their
into software, using a variable in a program to repre- various hardware or software design methodologies.
sent the current state and statements in the program In general, though, these models fall into ve dis-
to calculate state transitions and values of output sig- tinct categories: (1) state-oriented; (2) activity-orien-
nals. In this architecture, the program is stored in the ted; (3) structure-oriented; (4) data-oriented; and (5)
memory and executed by the processor. heterogeneous. A state-oriented model, such as a
Models and architectures are conceptual and imple- nite-state machine, is one that represents the sys-
mentation views on the highest level of abstraction. tem as a set of states and a set of transitions between
Models describe how a system works, while architec- them, which are triggered by external events. State-
tures describe how it will be manufactured. The de- oriented models are most suitable for control systems,
sign process or methodology is the set of design such as real-time reactive systems, where the system's
tasks that transform a model into an architecture. temporal behavior is the most important aspect of
At the beginning of this process, only the system's the design. An activity-oriented model, such as a
functionality is known. The designer's job, then, is data ow graph, is one that describes a system as a set
to describe this functionality in some language which of activities related by data or execution dependencies.
is based on the most appropriate models. As the de- This model is most applicable to transformational sys-

3
tems, such as digital signal processing systems, where state-based or Moore-type, for which h is de ned
data passes through a set of transformations at a xed as a mapping S ! O. In other words, an output sym-
rate. Using a structure-oriented model, such as a bol is assigned to each state of the FSM and outputed
block diagram, we would describe a system's physical during the time the FSM is in that particular state.
modules and interconnections between them. Unlike The other type is an input-based or Mealy-type
state-oriented and activity-oriented models which pri- FSM, for which h is de ned as the mapping S I ! O.
marily re ect a system's functionalities, the structure- In this case, an output symbol in each state is de ned
oriented models focus mainly on the system's physical by a pair of state and input symbols and it is outputed
composition. Alternatively, we can use a data-orien- while the state and the corresponding input symbols
ted model, such as an entity-relationship diagram, persist.
when we need to represent the system as a collection According to our de nition, each set S; I; and O
of data related by their attributes, class membership may have any number of symbols. However, in real-
and interactions. This model is most suitable for infor- ity we deal only with binary variables, operators and
mation systems, such as databases, where the function memory elements. Therefore, S; I; and O must be
of the system is less important than the data organi- implemented as a cross-product of binary signals or
zation of the system. Finally, a designer could use a memory elements, whereas functions f and h are de-
heterogeneous model { one that integrates many of ned by Boolean expressions that will be implemented
the characteristics of the previous four models { when- with logic gates.
ever he needs to represent a variety of di erent views
in a complex system.
In the rest of this section we will describe some r1/n r2/n
frequently used models. start S1
r2/u1
S2
r1/d1

1.3 Finite-state machine

d1
r3
/u2

r2/

u1
A nite-state machine (FSM) is an example of a r1/

r3/
d2
state-oriented model. It is the most popular model S3
for describing control systems, since the temporal be-
havior of such systems is most naturally represented r3/n
in the form of states and transitions between states.
Basically, the FSM model consists of a set of states, Figure 3: FSM model for the elevator controller. (y)
a set of transitions between states, and a set of ac-
tions associated with these states or transitions.
The nite state machine can be de ned abstractly In Figure 3, we see an input-based FSM that models
as the quintuple the elevator controller in a building with three oors,
as described in Section 1.1. In this model, the set of
< S; I; O; f; h > inputs I = fr1; r2; r3g represents the oor requested.
For example, r2 means that oor 2 is requested. The
where S; I; and O represent a set of states, set of inputs set of outputs O = fd2; d1; n; u1; u2g represents the
and a set of outputs, respectively, and f and h rep- direction and number of oors the elevator should go.
resent the next-state and the output functions. The For example, d2 means that the elevator should go
next state function f is de ned abstractly as a map- down 2 oors, u2 means that the elevator should go
ping S  I ! S . In other words, f assigns to every up 2 oors, and n means that the elevator should stay
pair of state and input symbols another state sym- idle. The set of states represents the oors. In Fig-
bol. The FSM model assumes that transitions from ure 3, we can see that if the current oor is 2 (i.e., the
one state to another occur only when input symbols current state is S2 ), and oor 1 is requested, then the
change. Therefore, the next-state function f de nes output will be d1.
what the state of the FSM will be after the input sym- In Figure 4 we see the state-based model for the
bols change. same elevator controller, in which the value of the out-
The output function h determines the output val- put is indicated in each state. Each state has been
ues in the present state. There are two di erent types split into three states representing each of the output
of nite state machine which correspond to two di er- signals that the state machine in Figure 3 will output
ent de nitions of the output function h. One type is a when entering that particular state.

4
Each state, input and output symbols are de ned by
r1
r1 a cross-product of Boolean variables. More precisely,
r1 r3

start
S11/d2 S21/d1 S31 /n I =A1  A2  : : : Ak
S =Q1  Q2  : : : Qm
r2

r2

O=Y1  Y2  : : : Yn
r2 r2
r1
r3 r3
r1 r2

S12 /d1 r1 S22 /n r3 S32 /u1 where Ai ; 1  i  k; is an input signal, Qi ; 1  i  m


is the ip- op output and Yi ; 1  i  n is an output
r1
r1
r2 r3
r3
signal.
In order to include a datapath, we must extend the
r2 r2

S13 /n r2 S23 /u1 S33 /u2 above FSM de nition by adding the set of datapath
variables, inputs and outputs. More formally, we de-
ne a variables set
r1 r3
r3
r3

V = V 1  V2  : : : V q
Figure 4: State-based FSM model for the elevator con-
troller. (y) which de nes the state of the datapath by de ning the
values of all variables in each state.
In the same fashion, we can separate the set of
In practical terms, the primary di erence between FSMD inputs into a set of FSM inputs IC and a set
these two models is that the state-based FSM may of datapath inputs ID . Thus,
require quite a few more states than the input-based
model. This is because in a input-based model, there I = I C  ID
may be multiple arcs pointing to a single state, each
arc having a di erent output value; in the state-based where IC = A1  A2  : : : Ak as before and ID =
model, however, each di erent output value would re- B1  B2  : : : Bp .
quire its own state, as is the case in Figure 4. Similarly, the output set consists of FSM outputs
OC and datapath outputs OD . In other words,
1.4 Finite-state machine with datap- O = OC  OD
ath
In cases when a FSM must represent integer or where OC = Y1  Y2 : : : Yn as before and OD =
oating-point numbers, we could encounter a state- Z1  Z2  : : : Zr . However, note that Ai ; Qj and
explosion problem, since, if each possible value for a Yk represent Boolean variables while Bi ; Vi and Zi
number requires its own state, then the FSM could are Boolean vectors which in turn represent integers,
require an enormous number of states. For exam- oating-point numbers and characters. For example,
ple, a 16-bit integer can represent 216 or 65536 dif- in a 16-bit datapath, Bi ; Vi and Zi would be 16 bits
ferent states. There is a fairly simple way to eliminate wide and if they were positive integers, they would be
the state-explosion problem, however, as it is possi- able to assume values from 0 to 216?1 .
ble to extend a FSM with integer and oating-point Except for very trivial cases, the size of the data-
variables, so that each variable replaces thousands of path variables, and ports makes speci cation of func-
states. The introduction of a 16-bit variable, for ex- tions f and h in a tabular form very dicult. In or-
ample, would reduce the number of states in the FSM der to be able to specify variable values in an ecient
model by 65536. and understandable way in the de nition of an FSMD,
In order to formally de ne a FSMD[Gaj97], we we will specify variable values with arithmetic expres-
must extend the de nition of a FSM introduced in sions.
the previous section by introducing sets of datapath We de ne the set of all possible expressions, Expr,
variables, inputs and outputs that will complement over the set of variables V , to be the set of all con-
the sets of FSM states, inputs and outputs. stants K of the same type as variables in V , the set
As we mentioned in the previous section an FSM is of variables V itself and all the expressions obtained
a quintuple by combining two expressions with arithmetic, logic,
< S; I; O; f; h > or rearrangement operators.

5
More formally, Note, again that variables in OC are Boolean variable
and that variables in OD are Boolean vectors.
Expr(V )=K [ V [ f(ei 2ej ) j ei ; ej Using this kind of FSMD, we could model the el-
2 Expr; 2 is an acceptable operatorg evator controller example in Figure 3 with only one
state, as shown in Figure 5. This reduction in the
Using Expr(V ) we can de ne the values of the number of states is possible because we have desig-
status signals as well as transformations in the dat- nated a variable cfloor to store the state value of the
apath. Let STAT = fstatk = ei 4ej j ei ; ej ; 2 FSM in Figure 3 and rfloor to store the values of r1,
Expr(V ); 4 2 f; <; =; 6=; >; gg be the set of all r2 and r3.
status signals which are described as relations be-
tween variables or expressions of variables. Examples
of status signals are Data 6= 0; (a ? b) > (x + y) and
(counter = 0)AND(x > 10). The relations de ning
(cfloor != rloor) / cfloor:=rfloor; output := rfloor − cfloor

status signals are either true, in which case the status start S1

signal has value 1 or false in which case it has value 0. (cfloor = rfloor) / output := 0

With formal de nition of expressions and relations


over a set of variables we can simplify function f :
(S  V )  I ! S  V by separating it into two parts: Figure 5: FSMD model for the elevator controller. (y)
fC and fD .. The function fC de nes the next state of
the control unit
In general, the FSM is suitable for modeling con-
fC : S  IC  STAT ! S trol-dominated systems, while the FSMD can be suit-
able for both control- and computation-dominated
while the function fD de nes the values of datapath systems. However, it should be pointed out that nei-
variables in the next state ther the FSM nor the FSMD model is suitable for
fD : S  V  ID ! V complex systems, since neither one explicitly supports
concurrency and hierarchy. Without explicit support
In other words, for each state si 2 S we compute for concurrency, a complex system will precipitate an
a new value for each variable Vj 2 V in the datap- explosion in the number of states. Consider, for exam-
ath by evaluating an expression ej 2 Expr(V ). Thus, ple, a system consisting of two concurrent subsystems,
the function fD is represented by a set of simpler func- each with 100 possible states. If we try to represent
tions, in which each function in the set de nes variable this system as a single FSM or FSMD, we must rep-
values for the state si resent all possible states of the system, of which there
are 100  100 = 10; 000. At the same time, the lack
fD :=ffDi : V  ID ! V : of hierarchy would cause an increase in the number of
fVj = ej jVj 2 V; ej 2 Expr(V  ID )gg arcs. For example, if there are 100 states, each requir-
ing its own arc to transition to a speci c state for a
In other words, function fD is decomposed into a set of particular input value, we would need 100 arcs, as op-
functions fDi , where each fDi assigns one expression posed to the single arc required by a model that can
ek to each variable Vj in the datapath in state si . hierarchically group those 100 states into one state.
Therefore, new values for all variables in the datapath The problem with such models, of course, is that once
are computed by evaluating expressions ej , for all j they reach several hundred states or arcs, they become
such that 1  j  q. incomprehensible to humans.
Similarly, we can decompose the output function
h : S  V  I ! O into two di erent functions, hC
and hD where hC de nes the external control outputs 1.5 Petri net
OC as in the de nition of an FSM and hD de nes The Petri net model [Pet81, Rei92] is another type
external datapath outputs. of state-oriented model, speci cally de ned to model
Therefore, systems that comprise interacting concurrent tasks.
hC : S  IC  STAT ! OC The Petri net model consists of a set of places, a set
of transitions, and a set of tokens. Tokens reside in
and places, and circulate through the Petri net by being
hD : S  V  I D ! O D consumed and produced whenever a transition res.

6
More formally, a Petri net is a quintuple which transition t1 res after transition t2. In Fig-
ure 7(b), we see the modeling of non-deterministic
< P; T; I; O; u > (1) branching, in which two transitions are enabled but
only one of them can re. In Figure 7(c), we see
where P = fp1 ; p2 ; : : : ; pm g is a set of places, T = the modeling of synchronization, in which a transi-
ft1 ; t2 ; : : : ; tn g is a set of transitions, and P and T are tion can re only after both input places have tokens.
disjoint. Further, the input function, I : T ! P + , Figure 7(d) shows how one would model resource con-
de nes all the places providing input to a transition, tention, in which two transitions compete for the same
while the output function, O : T ! P + , de nes all the token which resides in the place in the center. In Fig-
output places for each transition. In other words, the ure 7(e), we see how we could model concurrency, in
input and output functions specify the connectivity of which two transitions, t2 and t3, can re simultane-
places and transitions. Finally, the marking function ously. More precisely, Figure 7(e) models two concur-
u : P ! N de nes the number of tokens in each place, rent processes, a producer and a consumer; the token
where N is the set of non-negative integers. located in the place at the center is produced by t2
and consumed by t3.
p2 Petri net models can be used to check and validate
certain useful system properties such as safeness and
t4 liveness. Safeness, for example, is the property of
p1 t1 p5 t2 p4
Petri nets that guarantees that the number of tokens
in the net will not grow inde nitely. In fact, we cannot
p3 construct a Petri net in which the number of tokens
is unbounded. Liveness, on the other hand, is the
Net = (P, T, I, O, u) property of Petri nets that guarantees a dead-lock free
P = {p1, p2, p3, p4, p5}
T = {t1, t2, t3, t4}
t3
operation, by ensuring that there is always at least one
transition that can re.
I: I(t1) = {p1} O: O(t1) = {p5} u: u(p1) = 1
I(t2) = {p2,p3,p5} O(t2) = {p3,p5} u(p2) = 1
I(t3) = {p3} O(t3) = {p4} u(p3) = 2
I(t4) = {p4} O(t4) = {p2,p3} u(p4) = 0
u(p5) = 1

t1 t2 t1 t2 t1

Figure 6: A Petri net example. (y)


In Figure 6, we see a graphic and a textual repre- (c)
sentation of a Petri net. Note that there are ve places
(a) (b)

(graphically represented as circles) and four transi-


tions (graphically represented as solid bars) in this
Petri net. In this instance, the places p2, p3, and p5
provide inputs to transition t2, and p3 and p5 are the t1 t2 t1
output places of t2. The marking function u assigns
t2 t3 t4

one token to p1, p2 and p5 and two tokens to p3, as


denoted by u(p1; p2; p3; p4; p5) = (1; 1; 2; 0; 1).
As mentioned above, a Petri net executes by means
of ring transitions. A transition can re only if it is
(d) (e)

enabled { that is, if each of its input places has at least


one token. A transition is said to have red when it
has removed all of its enabling tokens from its input Figure 7: Petri net representing: (a) sequencing, (b)
places, and then deposited one token into each output branching, (c) synchronization, (d) contention, (e)
place. In Figure 6, for example, after transition t2 concurrency. (y)
res, the marking u will change to (1; 0; 2; 0; 1).
Petri nets are useful because they can e ectively Although a Petri net does have many advantages
model a variety of system characteristics. Figure 7(a), in modeling and analyzing concurrent systems, it also
for example, shows the modeling of sequencing, in has limitations that are similar to those of an FSM:

7
it can quickly become incomprehensible with any in-
crease in system complexity. Y
A D

1.6 Hierarchical concurrent B E

nite-state machine a(P)/c b r


u

The hierarchical concurrent nite-state ma-


F

chine (HCFSM) is essentially an extension of the FSM


s
a
C

model, which adds support for hierarchy and con-


G

currency, thus eliminating the potential for state and


arc explosion that occurred when describing hierarchi-
cal and concurrent systems with FSM models. Figure 8: Statecharts: hierarchical concurrent states.
Like the FSM, the HCFSM model consists of a set (y)
of states and a set of transitions. Unlike the FSM,
however, in the HCFSM each state can be further
decomposed into a set of substates, thus modeling Because of its hierarchy and concurrency con-
hierarchy. Furthermore, each state can also be de- structs, the HCFSM model is well-suited to represent-
composed into concurrent substates, which execute ing complex control systems. The problem with this
in parallel and communicate through global variables. model, however, is that, like any other state-oriented
The transitions in this model can be either structured model, it concentrates exclusively on modeling con-
or unstructured, with structured transitions allowed trol, which means that it can only associate very sim-
only between two states on the same level of hierarchy, ple actions, such as assignments, with its transitions
while unstructured transitions may occur between any or states. As a result, the HCFSM is not suitable for
two states regardless of their hierarchical relationship. modeling certain characteristics of complex systems,
One language that is particularly well-adapted to which may require complex data structures or may
the HCFSM model is Statecharts [Har87], since it can perform in each state an arbitrarily complex activity.
easily support the notions of hierarchy, concurrency For such systems, this model alone would probably
and communication between concurrent states. Stat- not suce.
echarts uses unstructured transitions and a broadcast
communication mechanism, in which events emitted 1.7 Programming languages
by any given state can be detected by all other states.
The Statecharts language is a graphic language. Programming languages provide a heterogeneous
Speci cally, we use rounded rectangles to denote model that can support data, activity and control
states at any level, and encapsulation to express a hi- modeling. Unlike the structure chart, programming
erarchical relation between these states. Dashed lines languages are presented in a textual, rather than a
between states represent concurrency, and arrows de- graphic, form.
note the transitions between states, each arrow being There are two major types of programming lan-
labeled with an event and, optionally, with a paren- guages: imperative and declarative. The imperative
thesized condition and/or action. class includes languages like C and Pascal, which use
Figure 8 shows an example of a system represented a control-driven model of execution, in which state-
by means of Statecharts. In this gure, we can see that ments are executed in the order written in the pro-
state Y is decomposed into two concurrent states, A gram. LISP and PROLOG, by contrast, are exam-
and D; the former consisting of two further substates, ples of declarative languages, since they model exe-
B and C , while the latter comprises substates E , F , cution through demand-driven or pattern-driven com-
and G. The bold dots in the gure indicate the start- putation. The key di erence here is that declarative
ing points of states. According to the Statecharts lan- languages specify no explicit order of execution, focus-
guage, when event b occurs while in state C , A will ing instead on de ning the target of the computation
transfer to state B . If, on the other hand, event a through a set of functions or logic rules.
occurs while in state B , A will transfer to state C , In the aspect of data modeling, imperative pro-
but only if condition P holds at the instant of occur- gramming languages provide a variety of data struc-
rence. During the transfer from B to C , the action c tures. These data structures include, for example, ba-
associated with the transition will be performed. sic data types, such as integers and reals, as well as
8
composite types, like arrays and records. A pro- completed its computation. Finally, at the bottom
gramming language would model small activities by of the hierarchy, we have the leaf program-states
means of statements, and large activities by means whose computations are described through program-
of functions or procedures, which can also serve as ming language statements.
a mechanism for supporting hierarchy within the sys- When we are using the program-state machine as
tem. These programming languages can also model our model, the system as an entity can be graphically
control ow, by using control constructs that spec- represented by a rectangular box, while the program-
ify the order in which activities are to be performed. states within the entity will be represented by boxes
These control constructs can include sequential com- with curved corners. A concurrent relation between
position (often denoted by a semicolon), branching program-substates is denoted by the dotted line be-
(if and case statements), looping (while, for, and tween them. Transitions are represented with directed
repeat), as well as subroutine calls. arrows. The starting state is indicated by a triangle,
The advantage to using an imperative programming and the completion of individual program-states is in-
language is that this paradigm is well-suited to model- dicated by a transition arc that points to the com-
ing computation-dominated behavior, in which some pletion point, represented as a small square within the
problem is solved by means of an algorithm, as, for ex- state. TOC arcs are those that originate from a square
ample, in a case when we need to sort a set of numbers inside the source substate, while TI arcs originate from
stored in an array. the perimeter of the source substate.
The main problem with programming languages is
that, although they are well-suited for modeling the
data, activity, and control mechanism of a system,
they do not explicitly model the system's states, which Y variable A: array[1..20] of integer

is a disadvantage in modeling embedded systems.


A D

1.8 Program-state machine variable i, max: integer ;

A program-state machine (PSM) [GVN94] is an


B
max = 0;
instance of a heterogeneous model that integrates for i = 1 to 20 do

an HCFSM with a programming language paradigm. e1 e2 if ( A[i] > max ) then


max = A[i] ;
This model basically consists of a hierarchy of end if;
end for
program-states, in which each program-state rep- C

resents a distinct mode of computation. At any given


time, only a subset of program-states will be active, e3
i.e., actively carrying out their computations.
Within its hierarchy, the model would consist of
both composite and leaf program-states. A compos-
ite program-state is one that can be further decom-
posed into either concurrent or sequential program-
substates. If they are concurrent, all the program- Figure 9: An example of program-state machine. (y)
substates will be active whenever the program-state is
active, whereas if they are sequential, the program- Figure 9 shows an example of a program-state ma-
substates are only active one at a time when the chine, consisting of a root state Y , which itself com-
program-state is active. A sequentially decomposed prises two concurrent substates, A and D. State A,
program-state will contain a set of transition arcs, in turn, contains two sequential substates, B and C .
which represent the sequencing between the program- Note that states B , C , and D are leaf states, though
substates. There are two types of transition arcs. The the gure shows the program only for state D. Ac-
rst, a transition-on-completion arc (TOC), will cording to the graphic symbols given above, we can
be traversed only when the source program-substate see that the arcs labeled e1 and e3 are TOC arcs,
has completed its computation and the associated arc while the arc labeled e2 is a TI arc. The con gura-
condition evaluates to true. The second, a transi- tion of arcs would mean that when state B nishes
tion-immediately arc (TI), will be traversed im- and condition e1 is true, control will transfer to state
mediately whenever the arc condition becomes true, C . If, however, condition e2 is true while in state C ,
regardless of whether the source program-substate has control will transfer to state B regardless of whether

9
C nishes or not. The register, usually called the State register, is de-
Since PSMs can represent a system's states, data, signed to store the states in S , while the two combi-
and activities in a single model, they are more suit- national blocks, referred to as the Next-state logic and
able than HCFSMs for modeling systems which have the Output logic, implement functions f and h. In-
complex data and activities associated with each state. puts and Outputs are representations of Boolean sig-
A PSM can also overcome the primary limitation of nals that are de ned by sets I and O.
programming languages, since it can model states ex-
plicitly. It allows a modeler to specify a system us-
ing hierarchical state-decomposition until he/she feels Inputs

comfortable using program constructs. The program-


A A A
1 2 k
Clk

ming language model and HCFSM model are just two


. . .
Q
extremes of the PSM model. A program can be viewed
1

FF

as a PSM with only one leaf state containing language


1

constructs. A HCFSM can be viewed as a PSM with Y


1

all its leaf states containing no language constructs.


Q Y
2 2 Outputs

. . .
Next FF
Output

In this section we presented the main models to


State 1 Logic Y
n
Logic

capture systems. Obviously, there are more models State

used in codesign, mostly targeted at speci c applica-


Register

tions. For example, the codesign nite state machine Q


m

(CFSM) model [CGH+ 93], which is based on commu- FF


m

nicating FSMs using event broadcasting, is targeted


at small, reactive real-time systems and can be used
to formally de ne and verify a system's properties. Inputs
(a)

A A A

2 Architectures
1 2 k Clk
. . .

. . .

To this point, we have demonstrated how various


Q
1

FF
models can be used to describe a system's function-
1

ality, data, control and structure. An architecture Y


1

is intended to supplement these descriptive models


Q Y
2 2 Outputs

. . .
Next
by specifying how the system will actually be imple-
FF Output
1 Y
State Logic n
Logic
mented. The goal of an architecture, then, is to de- State

scribe the number of components, the type of each Register

component, and the type of each connection among Q


m

these various components in a system. FF


m

Architectures can range from simple controllers to


parallel heterogeneous processors. Despite this va-
State signals

riety, however, architectures nonetheless fall into a (b)

few distinct classes, namely, (1) application-speci c


architectures, such as DSP systems, (2) general- Figure 10: A generic controller design: (a) state-
purpose processors, such as RISCs, and (3) par- based, (b) input-based. (z)
allel processors, such as VLIW, SIMD and MIMD
machines. As mentioned in Section 1.3, there are two dis-
tinct types of controllers, those that are input-based
2.1 Controller architecture and those that are state-based. These types of con-
trollers di er in how they de ne the output function,
The simplest of the application-speci c architectures h. For input-based controllers, h is de ned as a map-
is the controller variety, which is a straight-forward ping S  I ! O, which means that the Output logic
implementation of the nite-state machine model pre- is dependent on two parameters, namely, State reg-
sented in Section 1.3 and de ned by the quintuple ister and Inputs. For state-based controllers, on the
< S; I; O; f; h >. A controller consists of a register other hand, h is de ned as the mapping S ! O, which
and two combinational blocks, as shown in Figure 10. means the Output logic depends on only one parame-

10
ter, the State register. Since the inputs and outputs
are Boolean signals, in either case, this architecture
is well-suited to implementing controllers that do not Selector Selector
require complex data manipulation. Selector
The controller synthesis consists of state minimiza- Register

tion and encoding, Boolean minimization and technol- File Memory


Counter Register

ogy mapping for the Next-state and Output logic.


Bus 1

2.2 Custom Datapath architecture Bus 2


Bus 3

In a custom datapath we compute arbitrary expres- Bus 4

sions. In a datapath we use a di erent number of


counters, registers, register- les and memories with a Selector Latch Latch Selector

varied number of ports that are connected with several


buses. Note that these same buses can be used to sup- ALU 1 ALU 2
Latch Latch

ply operands to functional units as well as to supply Multiplier


results back to storage units. It is also possible for the Latch

functional units to obtain operands from several buses,


though this would require the use of a selector in front
of each input. It is also possible for each unit to have Figure 11: An example of a custom datapath. (z)
input and output latches which are used to temporar-
ily store the input operands or results. Such latching
can signi cantly shorten the amount of time that the we might perform the summation of a hundred num-
buses will be used for operand and result transfer, and bers by declaring the sum to be a temporary variable,
thus can increase the trac over these buses. initially set to zero, and executing the following loop
On the other hand, input and output latching re- statement:
quires a more complicated control unit since each op-
eration requires more than one clock cycle. Namely, at sum = 0
least one clock cycle is required to fetch operands from loop:
registers, register les or memories and store them into for i = 1 to 100
input latches, at least one clock cycle to perform the sum = sum + xi
operation and store a result into an output latch, and end loop
at least one clock cycle to store the result from an
output latch back to a register or memory. The above loop body could be executed on a datap-
An example of such a custom datapath is shown in ath consisting of one register called an Accumulator
Figure 11. Note that it has a counter, a register, a and an ALU. The variable sum would be stored in
3-port register- le and a 2-port memory. It also has the Accumulator, and in each clock cycle, the new xi
four buses and three functional units: two ALUs and would be added to the sum in the ALU so that the
a multiplier. As you can see, ALU1 does not have any new value of sum could again be stored in the Accu-
latches, while ALU2 has latches at both the inputs mulator.
and the outputs and the single multiplier has only the Generally speaking, the majority of digital designs
inputs latched. With this arrangement, ALU1 can work in the same manner. The variable values and
receive its left operand from buses 2 and 4, while the constant are stored in registers or memories, they are
multiplier can receive its right operand from buses 1 fetched from storage components after the rising edge
and 4. Similarly, the storage units can also receive of the clock signal, they are transformed in combina-
data from several buses. Such custom datapaths are torial components during the time between two rising
frequently used in application speci c design to obtain edges of the clock and the results are stored back into
the best performance-cost ratio. the storage components at the next rising edge of the
Datapaths are also used in all standard processor clock signal.
implementations to perform numerical computation or In Figure 12, we have shown a simple datapath that
data manipulations. A processor datapath consists could perform the above summation. This datapath
of some temporary storage, in addition to arithmetic, contains a selector, which selects either 0 or some out-
logic and shift units. Lets consider, for example, how side data as the left operand for the ALU. The right

11
operand will always be the content of the Accumulator, image processing, and multimedia. A datapath archi-
which could also be output through a tri-state driver. tecture often consists of high-speed arithmetic units,
The Accumulator is a shift register with a parallel load. connected in parallel, and heavily pipelined in order
This datapath's schematic is shown in Figure 12(a), to achieve a high throughput.
and in Figure 12(b), we have shown the 9-bit control
word that speci es the values of the control signals
for the Selector, the ALU, the Accumulator and the x(i) b(0) x(i−1) b(1) x(i−2) b(2) x(i−3) b(3)

output drivers.
* * * *
Input O
+ +
Pipeline stages
+
1 0
8 S Selector
y(i)
(a)

7 M A B x(i) b(0) x(i−1) b(1) x(i−2) b(2) x(i−3) b(3)


6 S ALU
S1
5 0

* * * *
4
3
+ + + y(i)
2 S IL IR
S1
1 0 Accumulator
Clk
Pipeline stages

(b)
0

Figure 13: Two di erent datapaths for FIR lter:


(a)
(a) with three pipeline stages, (b) with four pipeline
8 7 6 5 4 3 2 1 0 stages. (y)
Input ALU Shift Accumulator Out

In Figure 13, we can see two di erent datapaths,


select controls values controls enable

(b) both of which are designed to implement a nite-


impulse-response (FIR) lter, which is de ned by the
Figure 12: Simple datapath with one accumulator: (a)
datapath schematic, (b) control word. (z)
expression
y(i) =
X x(i ? k)b(k)
N ?1

On each clock cycle, a speci c control word would k=0


de ne the operation of the datapath. In order to com- where N is 4. Note that the datapath in Figure 13(a)
pute the sum of 100 numbers, we would need 102 clock performs all its multiplications concurrently, and adds
cycles. In this case, the control words would be the the products in parallel by means of a summation tree.
same for all the clock cycles, except the rst and the The datapath in Figure 13(b) also performs its multi-
last. In the rst clock cycle, we would have to clear plications concurrently, but it will then add the prod-
the Accumulator; in the next 100 clock cycles we would ucts serially. Further, note that the datapath in Fig-
add the new data to the accumulated sum, nally, in ure 13(a) has three pipeline stages, each indicated by
the last clock cycle, we would output the accumulated a dashed line, whereas the datapath in Figure 13(b)
sum. has four similarly indicated pipeline stages. Although
Datapaths are also used in many applications where both datapaths use four multipliers and three adders,
a xed computation must be performed repeatedly on the datapath in Figure 13(b) is regular and easier to
di erent sets of data, as is the case in the digital sig- implement in ASIC technologies.
nal processing (DSP) systems used for digital ltering, In this kind of architecture, as long as each opera-

12
tion in an algorithm is implemented by its own unit, as
in Figure 13, we do not need a control for the system,
Control Datapath
inputs inputs

since data simply ows from one unit to the next, and Control

the clock is used to load pipeline registers. Sometimes,


signals

however, it may be necessary to use fewer units to save


Control
unit Status Datapath
signals

silicon area, in which case we would need a simple


controller to steer the data among the units and reg- Control Datapath

isters, and to select the appropriate arithmetic func-


outputs outputs

tion for those units that can perform di erent func-


(a)

tions at di erent times. Another situation would be


Control Datapath
inputs inputs

to implement more than one algorithm with the same Control

datapath, with each algorithm executing at a di er- D Q


signals Selector

ent time. In this case, since each algorithm requires a D Q


Register
RF
Mem

unique ow of data through the datapath, we would Bus 1

need a controller to regulate the ow. Such controllers

...

...

...

...
Bus 2
D Q

are usually simple and without conditional branches. Next− ALU */


.
−.
state State Output
logic

2.3 FSMD architecture


register logic Bus 3
Status
signals
Register

A FSMD architecture implements the FSMD model by


Control unit Datapath

Control Datapath

combining a controller with a datapath. As shown in outputs outputs

Figure 14(a), the datapath has two types of I/O ports. (b)

One type of I/O ports are data ports which are used
by the outside environment to send and receive data to
and from the ASIC. The data could be of type integer, Figure 14: Design model: (a) high-level block dia-
oating-point, or characters and it is usually packed gram, (b) register-transfer-level block diagram. (z)
into one or more words. The data ports are usually 8,
16, 32 or 64 bits wide. The other type of I/O ports Their value is obtained by comparing values of selected
are control ports which are used by the control unit variables stored in the datapath. There are also two
to control the operations performed by the datapath types of output signals: external signals and datapath
and receive information about the status of selected control signals. External signals identify to the envi-
registers in the datapath. ronment that a FSMD architecture has reached cer-
As shown in Figure 14(b), the datapath takes the tain state or nished a particular computation. The
operands from storage units, performs the computa- datapath controls, as mentioned before, select the op-
tion in the combinatorial units and returns the results eration for each component in the datapath.
to storage units during each state, which is usually FSMD architectures are used for various ASIC de-
equal to one clock cycle. signs. Each ASIC design consists of one or more
As mentioned in the previous section the selec- FSMD architectures, although two implementations
tion of operands, operations and the destination for may di er in the number of control units and data-
the result is controlled by the control unit by setting paths, the number of components and connections in
proper values of datapath control signals. The datap- the datapath, the number of states in the control unit
ath also indicates through status signals when a par- and the number of I/O ports. The FSM controller and
ticular value is stored in a particular storage unit or DSP datapath mentioned above are two special cases
when a particular relation between two data values of this kind of architecture. In addition, the FSMD is
stored in the datapath is satis ed. also the basic architecture for general-purpose proces-
Similar to the datapath, a control unit has a set sors, since each processor includes both a control unit
of input and a set of output signals. Each signal is and a datapath.
a Boolean variable that can take a value of 0 or 1.
There are two types of input signals: external sig-
nals and status signals. External signals represent the
conditions in the external environment on which the
2.4 CISC architecture
FSMD architecture must respond. On the other hand, The primary motivation for developing an architecture
the status signals represent the state of the datapath. of complex-instruction-set computers (CISC)

13
was to reduce the number of instructions in com- one register to another. Since the MicroPC is concur-
piled code, which would in turn minimize the num- rently incremented to point to the next control word,
ber of memory accesses required for fetching instruc- this procedure will be repeated for each control word
tions. The motivation was valid in the past, since in the sequence. Finally, when the last control word is
memories were expensive and much slower than pro- being executed, a new instruction will be fetched from
cessors. The secondary motivation for CISC develop- the Memory, and the entire process will be repeated.
ment was to simplify compiler construction, by includ- From this description, we can see that the number
ing in the processor instruction set complex instruc- of control words, and thus the number of clock cycles
tions that mimic programming language constructs. can vary for each instruction. As a result, instruc-
These complex instructions would reduce the seman- tion pipelining can be dicult to implement in CISCs.
tic gap between programming and machine languages In addition, relatively slow microprogram memory re-
and simplify compiler construction. quires a clock cycle to be longer than necessary. Since
instruction pipelines and short clock cycles are neces-
sary for fast program execution, CISC architectures
Control may not be well-suited for high-performance proces-
unit
Control
Datapath sors.
Although a variety of complex instructions could be
Microprogram
memory

executed by a CISC architectures, program-execution


statistics have shown that the instructions used most
frequently tend to be simple, with only a few address-
MicroPC
ing modes and data types. Statistics have also shown
that the most complex instructions were seldom or
+1
Address
selection
Status
never used. This low usage of complex instructions
logic
can be attributed to the slight semantic di erences be-
tween programming language constructs and available
Instruction reg.
Memory
complex instructions, as well as the diculty in map-
ping language constructs into such complex instruc-
tions. Because of this diculty, complex instructions
are seldom used in optimizing compilers for CISC pro-
Figure 15: CISC with microprogrammed control. (y) cessors, thus reducing the usefulness of CISC architec-
tures.
In order to support a complex instruction set, a The steadily declining prices of memories and their
CISC usually has a complex datapath, as well as a con- increasing speeds have made compactly-coded pro-
troller that is microprogrammed, shown in Figure 15, grams and complex instruction sets unnecessary for
which consists of a Microprogram memory, a Micro- high-performance computing. In addition, complex
program counter (MicroPC), and the Address selec- instruction sets have made construction of optimizing
tion logic. Each word in the microprogram memory compilers for CISC architecture too costly. For these
represents one control word, such as the one shown two reasons, the CISC architecture was displaced in
in Figure 12, that contains the values of all the dat- favor of the RISC architecture.
apath control signals for one clock cycle. This means
that each bit in the control word represents the value 2.5 RISC architecture
of one datapath control line, used for loading a regis-
ter or selecting an operation in the ALU, for example. In contrast to the CISC architecture, the architecture
Furthermore, each processor instruction consists of a of a reduced-instruction-set computer (RISC) is
sequence of control words. When such an instruction optimized to achieve short clock cycles, small num-
is fetched from the Memory, it is stored rst in the bers of cycles per instruction, and ecient pipelining
Instruction register, and then used by the Address se- of instruction streams. As shown in Figure 16, the
lection logic to determine the starting address of the datapath of an RISC processor generally consists of
corresponding control-word sequence in the Micropro- a large register le and an ALU. A large register le
gram memory. After this starting address has been is necessary since it contains all the operands and the
loaded into the MicroPC, the corresponding control results for program computation. The data is brought
word will be fetched from the Microprogram memory, to the register le by load instructions and returned to
and used to transfer the data in the datapath from the memory by store instructions. The larger the reg-

14
ister le is, the smaller the number of load and store RISC compiler will need to use a sequence of RISC in-
instructions in the code. When the RISC executes an structions in order to implement complex operations.
instruction, the instruction pipe begins by fetching an At the same time, of course, although these features
instruction into the Instruction register. In the sec- require more sophistication in the compiler, they also
ond pipeline stage the instruction is then decoded and give the compiler a great deal of exibility in perform-
the appropriate operands are fetched from the Regis- ing aggressive optimization.
ter le. In the third stage, one of two things occurs: Finally, we should note that RISC programs tend to
the RISC either executes the required operation in the require 20% to 30% more program memory, due to the
ALU , or, alternatively, computes the address for the lack of complex instructions. However, since simpler
Data cache. In the fourth stage the data is stored instruction sets can make compiler design and running
in either the Data cache or in the Register le. Note time much shorter, the eciency of the compiled code
that the execution of each instruction takes only four is ultimately much higher. In addition, because of
clock cycles, approximately, which means that the in- these simpler instruction sets, RISC processors tend
struction pipeline is short and ecient, losing very few to require less silicon area and a shorter design cycle
cycles in the case of data or branch dependencies. than their CISC counterparts.

Control unit Datapath


2.6 VLIW architecture
A very-long-instruction-word computer (VLIW)
exploits parallelism by using multiple functional units
Control
Register
file
Decode logic Status in its datapath, all of which execute in a lock step
manner under one centralized control. A VLIW in-
struction contains one eld for each functional unit,
ALU
and each eld of a VLIW instruction speci es the ad-
dresses of the source and destination operands, as well
as the operation to be performed by the functional
Instruction reg. unit. As a result, a VLIW instruction is usually very
Data cache
wide, since it must contain approximately one stan-
dard instruction for each functional unit.
Instruction cache Main memory

Figure 16: RISC with hardwired control. (y) Memory

We should also note that, since all the operands are


contained in the register le, and only simple address- Register file

ing modes are used, we can simplify the design of the


datapath as well. In addition, since each operation
can be executed in one clock cycle and each instruc- + +
tion in four, the control unit remains simple and can be * *

implemented with random logic, instead of micropro-


grammed control. Overall, this simpli cation of the
control and datapath in the RISC results in a short Figure 17: An example of VLIW datapath. (y)
clock cycle, and, ultimately, higher performance.
It should also be pointed out, however, that the In Figure 17, we see an example of a VLIW data-
greater simplicity of RISC architectures require a more path, consisting of four functional units: namely, two
sophisticated compiler. For example, a RISC design ALUs and two multipliers, a register le and a mem-
does not stop the instruction pipeline whenever in- ory. In order to utilize all the four functional units, the
struction dependencies occur, which means that the register le in this example has 16 ports: eight output
compiler is responsible for generating a dependency- ports, which supply operands to the functional units,
free code, either by delaying the issue of instructions four input ports, which store the results obtained
or by reordering instructions. Furthermore, due to the from functional units, and four input/output ports,
fact that the number of instructions is reduced, the designed to allow communication with the memory.

15
What is interesting to note here is that, ideally, the each PE, and then collect the results after the compu-
VLIW in Figure 17 would provide four times the per- tations are nished. When it is necessary, PEs can also
formance we could get from a processor with a sin- communicate directly with their nearest neighbors.
gle functional unit, under the assumption that the The primary advantage of array processors is that
code executing on the VLIW had four-way parallelism, they are very convenient for computations that can be
which enables the VLIW to execute four independent naturally mapped on a rectangular grid, as in the case
instructions in each clock cycle. In reality, however, of image processing, where an image is decomposed
most code has a large amount of parallelism inter- into pixels on a rectangular grid, or in the case of
leaved with code that is fundamentally serial. As a weather forecasting, where the surface of the globe is
result, a VLIW with a large number of functional units decomposed into n-by-n-mile squares. Programming
might not be fully utilized. The ideal conditions would one grid point in the rectangular array processor is
also require us to assume that all the operands were quite easy, since all the PEs execute the same instruc-
in the register le, with 8 operands being fetched and tion stream. However, programming any data routing
four results stored back on every clock cycle, in ad- through the array is very dicult, since the program-
dition to four new operands being brought from the mer would have to be aware of all the positions of each
memory to be available for use in the next clock cy- data for every clock cycle. For this reason, problems,
cle. It must be noted, however, that this computation like matrix triangulations or inversions, are dicult to
pro le is not easy to achieve, since some results must program on an array processor.
be stored back to memory and some results may not
be needed in the next clock cycle. Under these con- Array processors, then, are easy to build and easy
ditions, the eciency of a VLIW datapath might be to program, but only when the natural structure of the
less than ideal. problem matches the topology of the array processor.
Finally, we should point out that there are two tech- As a result, they can not be considered general pur-
nological limitation that can a ect the implementation pose machines, because users have diculty writing
of a VLIW architecture. First, while register les with programs for general classes of problems.
8{16 ports can be built, the eciency and performance An MIMD processor, usually called a multipro-
of such register les tend to degrade quickly when we cessor system, di ers from an SIMD in that each
go beyond that number. Second, since VLIW pro- PE executes its own instruction stream. In this kind
gram and data memories require a high communica- of architecture, the program can be loaded by a host
tion bandwidth, these systems tend to require expen- processor, or each processor can load its own program
sive high-pin packaging technology as well. Overall, from a shared memory. Each processor can commu-
these are the reasons why VLIW architectures are not nicate with every other processor within the multi-
as popular as RISC architectures. processor system, using one of the two communica-
tion mechanisms. In a shared-memory multipro-
2.7 Parallel architecture cessor, all the processors are connected to a shared
memory through an interconnection network, which
In the design of parallel processors, we can take ad- means that each processor can access any data in the
vantage of spatial parallelism by using multiple pro- shared memory. In a message-passing multiproces-
cessing elements (PEs) that work concurrently. In this sor, on the other hand, each processor tends to have
type of architecture, each PE may contain its own dat- a large local memory, and sends data to other pro-
apath with registers and a local memory. Two typi- cessors in the form of messages through an intercon-
cal types of parallel processors are the SIMD (single nection network. The interconnection network for a
instruction multiple data) and the MIMD (multiple shared memory must be fast, since it is very frequently
instruction multiple data) processors. used to communicate small amounts of data, like a
In SIMD processors, usually called array proces- single word. In contrast, the interconnection network
sors, all of the PEs execute the same instruction in a used for message passing tends to be much slower,
lock step manner. To broadcast the instructions to all since it is used less frequently and communicates long
the PEs and to control their execution, we generally messages, including many words of data. Finally, it
use a single global controller. Usually, an array pro- should be noted that multiprocessors are much eas-
cessor is attached to a host processor, which means ier to program, since they are task-oriented instead
that it can be thought of as a kind of hardware accel- of instruction-oriented. Each task runs independently
erator for tasks that are computationally intensive. In and can be synchronized after completion, if necessary.
such cases, the host processor would load the data into Thus, multiprocessors make program and data par-

16
titioning, code parallelization and compilation much
simpler than array processors.
Such a multiprocessor, in which the interconnec-
tion network consists of several buses, is shown in Fig-
ure 18. Each processing element (PE) consists of a
processor or ASIC and a local memory connected by
the local bus. The shared or global memory may be ei- PE

ther single port, dual port, or special purpose memory Proc LM

such as FIFO. The PEs and global memories are con-


nected by one or more system buses via correspond-
Lbus
IO
ing interfaces. The system bus is associated with a IF

well-de ned protocol to which the components on the Device


bus have to respect. The protocol may be standard,
such as VME, or custom. An interface bridges the (a)
gap between a local bus of a PE/memory and system
buses. PE1 PE2
The heterogeneous architecture is a superset of Proc1 LM1 Proc2 GM
all previous architectures and it can be customized
LM2

for a particular application to achieve the best cost- Lbus1


performance trade-o . Figure 19 shows some typical
Lbus2
IF1 IF2 IF3
con gurations. SBus
Figure 19(a) shows a simple embedded processor Arbiter

system with an IO device. The IO device directly (b)


communicate with the processor on the processor bus
via an interface. Figure 19(b) shows a shared mem- PE1 PE2
ory system where two PEs are connected to a global Proc1 LM1 Proc2 LM2
memory via a system bus. The PEs can be either
processor system or ASIC system, each of which may Lbus1 Lbus2
contain its own local bus and memory subsystem. Fig- IF1 IF2
ure 19(c) and (d) show two types of message passing
system where two PEs communicate via a channel. FIFO

The former can perform asynchronous communication (c)


given the dedicated devices such as a FIFO. The latter
can perform synchronous communication if the proper PE1 PE2

handshaking between the two PEs are performed via Proc1 LM1 Proc2 LM2

the system bus. Lbus1 Lbus2


IF1 IF2

3 Languages Arbiter
SBus

3.1 Introduction (d)

A system can be described at any one of several dis-


tinct levels of abstraction, each of which serves a par-
ticular purpose. By describing a system at the logic
level, for example, designers can verify detailed timing Figure 19: Some typical con gurations: (a) standard
as well as functionality. Alternatively, at the archi- processor, (b) shared memory, (c) non-blocking mes-
tectural level, the complex interaction among system sage passing, (c) blocking message passing.
components such as processors, memories, and ASICs
can be veri ed. Finally, at the conceptual level, it is
possible to describe the system's functionality without
any notion of its components. Descriptions at such
level can serve as the speci cation of the system for

17
PE1 PE2 PE3

Proc1 LM1 GM1 ASIC1 LM2 GM2 Proc2 LM3

Lbus1 Lbus2 Lbus3 Lbus4 Lbus5

IF1 IF2 IF3 IF4 IF5

Sbus1 Sbus2

Arbiter1 Arbiter2

Figure 18: A heterogeneous multiprocessor

designers to work on. Increasingly, designers need to Since di erent conceptual models possess di erent
conceptualize the system using an executable spec- characteristics, any given speci cation language can
i cation language, which is capable of capturing the be well or poorly suited for that model, depending on
functionality of the system in a machine-readable and whether it supports all or just a few of the model's
simulatable form. characteristics. To nd the language that can capture
Such an approach has several advantages. First, a given conceptual model directly, we would need to
simulating an executable speci cation allows the de- establish a one-to-one correlation between the charac-
signer to verify the correctness of the system's in- teristics of the model and the constructs in the lan-
tended functionality. In the traditional approach, guage.
which started with a natural-language speci cation,
such veri cation would not be possible until enough 3.2 Characteristics of system models
of the design had been completed to obtain a simulat- In this section, we will present some of the character-
able system description (usually gate-level schemat- istics most commonly found in modeling systems. In
ics). The second advantage of this approach is that the presenting these characteristics, part of our goal will
speci cation can serve as an input to codesign tools, be to assess how useful each characteristic is in cap-
which, in turn, can be used to obtain an implementa- turing one or more types of system behavior.
tion of the system, ultimately reducing design times
by a signi cant amount. Third, such a speci cation
can serve as comprehensive documentation, providing 3.3 Concurrency
an unambiguous description of the system's intended Any system can be decomposed into chunks of func-
functionality. Finally, it also serves as a good medium tionality called behaviors, each of which can be de-
for the exchange of design information among various scribed in several ways, using the concepts of pro-
users and tools. As a result, some of the problems cesses, procedures or state machines. In many cases,
associated with system integration can be minimized, the functionality of a system is most easily conceptu-
since this approach would emphasize well-de ned sys- alized as a set of concurrent behaviors, simply because
tem components that could be designed independently representing such systems using only sequential con-
by di erent designers. structs would result in complex descriptions that can
The increasing design complexity associated with be dicult to comprehend. If we can nd a way to
systems-on-a-chip also makes an executable mod- capture concurrency, however, we can usually obtain
eling language extremely desirable where an inter- a more natural representation of such systems. For
mediate implementation can be represented and val- example, consider a system with only two concurrent
idated before proceeding to the next synthesis step. behaviors that can be individually represented by the
For the same reason, we need such a modeling lan- nite-state machines F1 and F2 . A standard represen-
guage to be able to describe design artifacts from pre- tation of the system would be a cross product of the
vious designs and intellectual properties (IP) provided two nite-state machines, F1  F2 , potentially result-
by other sources. ing in a large number of states. A more elegant solu-

18
tion, then, would be to use a conceptual model that inputs, the add and subtract operations in statements
has two or more concurrent nite-state machines, as 1 and 3 will be carried out rst. The results of these
do the Statecharts [Har87] and many other concurrent two computations will provide the data required for
languages. the multiplication in statement 3. Finally, the addi-
Concurrency representations can be classi ed into tion in statement 2 will be performed to compute y.
two groups, data-driven or control-driven, depending
on how explicitly the concurrency is indicated. Fur- Pipelined concurrency: Data ow description in
thermore, a special class of data-driven concurrency the previous section can be viewed as a set of op-
called pipelined concurrency is of particular impor- erations which consume data from their inputs and
tance to signal processing applications. produce data on their outputs. Since the execution
of each operation is determined by the availability of
Data-driven concurrency: Some behaviors can be its input data, the degree of concurrency that can be
clearly described as sets of operations or statements exploited is limited by data dependencies. However,
without specifying any explicit ordering for their ex- when the same data ow operations are applied to a
ecution. In a case like this, the order of execution stream of data samples, we can use pipelined con-
would be determined only by data dependencies be- currency to improve the throughput, that is, the rate
tween them. In other words, each operation will per- at which the system is able to process the data stream.
form a computation on input data, and then output Such throughput improvement is achieved by dividing
new data, which will, in turn, be input to other op- operations into groups, called pipeline stages, which
erations. Operation executions in such data ow de- operate on di erent data sets in the stream. By op-
scriptions depend only upon the availability of data, erating on di erent data sets, pipeline stages can run
rather than upon the physical location of the opera- concurrently. Note that each stage will take the same
tion or statement in the speci cation. Data ow repre- amount of time, called a cycle, to compute its results.
sentations can be easily described from programming For example, Figure 21(a) shows a data ow graph
languages using the single assignment rule, which operating on the data set a(n); b(n); c(n); d(n) and
means that each variable can appear exactly once on x(n), while producing the data set q(n); p(n) and y(n),
the left hand side of an assignment statement. where the index n indicates the nth data in the stream,
called data sample n. Figure 21(a) can be converted
into a pipeline by partitioning the graph into three
a b c d x stages, as shown in Figure 21(b).
In order for the pipeline stages to execute con-
+ − currently, storage elements such as registers or FIFO
1: q = a + b queues have to be inserted between the stages (indi-
2: y = p + x
3: p = (c − d) * q * cated by thick lines in Figure 21(b)). In this way,
while the second stage is processing the results pro-
+ duced by the rst stage at the previous cycle, the
rst stage can simultaneously process the next data
q p y sample in the stream. Figure 21(c) illustrates the
(a) (b) pipelined execution of Figure 21(b), where each row
represents a stage, each column represents a cycle. In
Figure 20: Data-driven concurrency: (a) data ow the third column, for example, while the rst stage
statements, (b) data ow graph generated from (a). (y) is adding a(n + 2) and b(n + 2), and subtracting
c(n + 2) and d(n + 2), the second stage is multiplying
Consider, for example, the single assignment state- (a(n + 1) + b(n + 1)) and c(n + 1) ? d(n + 1), and
ments in Figure 20(a). As in any other data-driven ex- the third stage is nishing the computation of the nth
ecution, it is of little consequence that the assignment sample by adding ((a(n)+ b(n))  (c(n) ? d(n)) to x(n).
to p follows the statement that uses the value of p to
compute the value of y. Regardless of the sequence of Control-driven concurrency: The key concept
the statements, the operations will be executed solely in control-driven concurrency is the control thread,
as determined by availability of data, as shown in the which can be de ned as a set of operations in the sys-
data ow graph of Figure 20(b). Following this princi- tem that must be executed sequentially. As mentioned
ple, we can see that, since a, b, c and d are primary above, in data-driven concurrency, it is the dependen-

19
nth (n+1)th (n+2)th (n+3)th
cycle cycle cycle cycle
a(n) b(n) c(n) d(n) x(n) a(n) b(n) c(n) d(n) x(n) time
+ − + − + −
stage 1
+ − stage 1 + −

* stage 2 * stage 2 * * *

+ stage 3 + stage 3
+ + +

q(n) p(n) y(n) q(n) p(n) y(n)

(a) (b) (c)

Figure 21: Pipelined concurrency: (a) original data ow, (b) pipelined data ow, (c) pipelined execution.

cies between operations that determine the execution ment waits for the previously forked control threads
order. In control-driven concurrency, by contrast, it is to terminate. The fork statement in Figure 22(a), for
the control thread or threads that determine the order example, spawns three control threads A, B and C,
of execution. In other words, control-driven concur- all of which execute concurrently. The correspond-
rency is characterized by the use of explicit constructs ing join statement must wait until all three threads
that specify multiple threads of control, all of which have terminated, after which the statements in R can
execute in parallel. be executed. In Figure 22(b), we can see how pro-
cess statements are used to specify concurrency. Note
that, while a fork-join statement starts from a sin-
sequential behavior X
begin
concurrent behavior X
begin gle control thread and splits it into several concurrent
Q(); process A(); threads as shown in Figure 22(c), a process statement
fork A(); B(); C(); join;
R();
process B();
process C(); represents the behavior as a set of concurrent threads,
end behavior X; end behavior X; as shown in Figure 22(d). For example, the process
(a) (b) statements of Figure 22(b) create three processes A,
B and C, each representing a di erent control thread.
Q
Both fork-join and process statements may be nested,
and both approaches are equivalent to each other in
the sense that a fork-join can be implemented using
A B C A B C nested processes and vice versa.
R 3.4 State transitions
Systems are often best conceptualized as having var-
(c) (d) ious modes, or states, of behavior, as in the case of
controllers and telecommunication systems. For ex-
ample, a trac-light controller [DH89] might incorpo-
Figure 22: Control-driven concurrency: (a) fork-join rate di erent modes for day and night operation, for
statement, (b) process statement, (c) control threads manual and automatic functioning, and for the status
for fork-join statements, (d) control threads for pro- of the trac light itself.
cess statement. (y) In systems with various modes, the transitions be-
tween these modes sometimes occur in an unstruc-
Control-driven concurrency can be speci ed at the tured manner, as opposed to a linear sequencing
task level, where constructs such as fork-joins and pro- through the modes. Such arbitrary transitions are
cesses can be used to specify concurrent execution of akin to the use of goto statements in programming
operations. Speci cally, a fork statement creates a languages. For example, Figure 23 depicts a system
set of concurrent control threads, while a join state- that transitions between modes P, Q, R, S and T, the

20
sequencing determined solely by certain conditions. as declaration types, variables and subprogram names.
Given a state machine with N states, there can be Since a lack of hierarchy would make all such objects
N  N possible transitions among them. global, it would be dicult to relate them to their par-
ticular use in the model, and could hinder our e orts
to reuse these names in di erent portions of the same
start
model.
There are two distinct types of hierarchy { struc-
u v tural hierarchy and behavioral hierarchy { both of
P
which are commonly found in conceptual views of sys-
tems.
w z
Q R T
Structural hierarchy: A structural hierarchy is
x
S
y
one in which a system speci cation is represented as
finish
a set of interconnected components. Each of these
components, in turn, can have its own internal struc-
Figure 23: State transitions between arbitrarily com- ture, which is speci ed with a set of lower-level inter-
plex behaviors. (y) connected components, and so on. Each instance of
In systems like this, transitions between modes can an interconnection between components represents a
be triggered by the detection of certain events or cer- set of communication channels connecting the compo-
tain conditions. For example, in Figure 23, the tran- nents. The advantage of a model that can represent a
sition from state P to state Q will occur whenever structural hierarchy is that it can help the designer to
event u happens while in P. In some systems, actions conceptualize new components from a set of existing
can be associated with each transition, and a partic- components.
ular mode or state can have an arbitrarily complex
behavior or computation associated with it. In the
case of the trac-light controller, for example, in one System

state it may simply be sequencing between the red,


Processor

yellow and green lights, while in another state it may


Control Logic Datapath
data bus

be executing an algorithm to determine which lane of


trac has a higher priority based on the time of the control
Memory

day and the trac density. In simple (Section 1.3) and lines

hierarchical (Section 1.6) nite-state machine models,


simple assignment statements, such as x = y + 1, can
be associated with a state. In the PSM model (Sec-
tion 1.8), any arbitrary program with iteration and
branching constructs can be associated with a state.
Figure 24: Structural hierarchy. (y)
3.5 Hierarchy This kind of structural hierarchy in systems can be
One of the problems we encounter with large systems speci ed at several di erent levels of abstraction. For
is that they can be too complex to be considered in example, a system can be decomposed into a set of
their entirety. In such cases, we can see the advantage processors and ASICs communicating over buses in a
of hierarchical models. First, since hierarchical models parallel architecture. Each of these chips may consist
allow a system to be conceptualized as a set of smaller of several blocks, each representing a FSMD archi-
subsystems, the system modeler is able to focus on one tecture. Finally, each RT component in the FSMD
subsystem at a time. This kind of modular decompo- architecture can be further decomposed into a set of
sition of the system greatly simpli es the development gates while each gate can be decomposed into a set
of a conceptual view of the system. Furthermore, once of transistors. In addition, we should note that di er-
we arrive at an adequate conceptual view, the hier- ent portions of the system can be conceptualized at
archical model greatly facilitates our comprehension di erent levels of abstraction, as in Figure 24, where
of the system's functionality. Finally, a hierarchical the processor has been structurally decomposed into a
model provides a mechanism for scoping objects, such datapath represented as a set of RT components, and

21
into its corresponding control logic represented as a archical transitions. A simple transition is similar
set of gates. to that which connects states in an FSM model in that
it causes control to be transferred between two states
Behavioral hierarchy: The speci cation of a be- that both occupy the same level of the behavioral hi-
havioral hierarchy is de ned as the process of erarchy. In Figure 25(b), for example, the transition
decomposing a behavior into distinct subbehaviors, triggered by event e1 transfers control from behavior
which can be either sequential or concurrent. Q1 to Q2. Group transitions are those which can
The sequential decomposition of a behavior be speci ed for a group of states, as is the case when
may be represented as either a set of procedures or event e5 causes a transition from any of the subbe-
a state machine. In the rst case, a procedural se- haviors of Q to the behavior R. Hierarchical tran-
quential decomposition of a behavior is de ned as sitions are those (simple or group) transitions which
the process of representing the behavior as a sequence span several levels of the behavioral hierarchy. For ex-
of procedure calls. Even in the case of a behavior that ample, the transition labeled e6 transfers control from
consists of a single set of sequential statements, we can behavior Q3 to behavior R1, which means that it must
still think of that behavior as comprising a procedure span two hierarchical levels. Similarly, the transition
which encapsulates those statements. A procedural labeled e7 transfers control from Q to state R2, which
sequential decomposition of behavior P is shown in is at a lower hierarchical level.
Figure 25(a), where behavior P consists of a sequen- For a sequentially decomposed behavior, we must
tial execution of the subbehaviors represented by pro- explicitly specify the initial subbehavior that will be
cedures Q and R. Behavioral hierarchy would be rep- activated whenever the behavior is activated. In Fig-
resented here by nested procedure calls. Recursion in ure 25(b), for example, R is the rst subbehavior that
procedures allows us to specify a dynamic behavioral is active whenever its parent behavior P is activated,
hierarchy, which means that the depth of the hierarchy since a solid triangle points to this rst subbehavior.
will be determined only at run time. Similarly, Q1 and R1 would be the initial subbehaviors
of behaviors Q and R, respectively.
The concurrent decomposition of behaviors al-
P
e4 R lows subbehaviors to run in parallel or in pipelined
fashion.
Q
behavior P
variable x, y; Q1 e5 R1
begin e2
Q(x) ; e1
Q3 e8
R(y) ; e6
end P; Sequential Concurrent Pipelined
Q2 e3 R2
e7 X X X
A A A
(a) (b)

Figure 25: Sequential behavioral decomposition: (a) B B B

procedures, (b) state-machines. (y)


Figure 25(b) shows a state-machine sequential
decomposition of behavior P. In this diagram, P is
C C C

decomposed into two sequential subbehaviors Q and


R, each of which is represented as a state in a state-
machine. This state-machine representation conveys (a) (b) (c)

hierarchy by allowing a subbehavior to be represented


as another state-machine itself. Thus, Q and R are
state-machines, so they are decomposed further into Figure 26: Behavioral decomposition types: (a) se-
sequential subbehaviors. The behaviors at the bottom quential, (b) parallel, (c) pipelined.
level of the hierarchy, including Q1, . . . R2, are called
leaf behaviors. Figure 26 shows a behavior X consisting of three
In a sequentially decomposed behavior, the subbe- subbehaviors A, B and C . In Figure 26(a) the sub-
haviors can be related through several types of tran- behaviors are running sequentially, one at a time, in
sitions: simple transitions, group transitions and hier- the order indicated by the arrows. In Figure 26(b),

22
A; B and C run in parallel, which means that they In the nite-state machine model, we usually desig-
will start when X starts, and when all of them n- nate an explicitly de ned set of states as nal states.
ish, X will nish, just like the fork-join construct dis- This means that, for a state machine, completion will
cussed in Section 3.3. In Figure 26(c), A; B and C run have occurred when control ows to one of these nal
in pipelined mode, which means that they represent states, as shown in Figure 28(a).
pipeline stages which run concurrently where A sup- In cases where we use programming language con-
plies data to B and B to C as discussed in Section 3.3. structs, a behavior will be considered complete when
the last statement in the program has been executed.
3.6 Programming constructs For example, whenever control ows to a return state-
ment, or when the last statement in the procedure is
Many behaviors can best be described as sequential al- executed, a procedure is said to be complete.
gorithms. Consider, for example, the case of a system
intended to sort a set of numbers stored in an array,
or one designed to generate a set of random numbers. B

In such cases, if the system designer manages to de- X Y


q
1

compose the behavior hierarchically into smaller and


X1 e5 Y1
start
q q final e3
X3
smaller subbehaviors, he will eventually reach a stage
0 3 state e1
Y2
q
where the functionality of a subbehavior can be most
X2 e2 e4
2

directly speci ed by means of an algorithm.


The advantage of using such programming con- (a) (b)

structs to specify a behavior is that they allow the sys-


tem modeler to specify an explicit sequencing for the
ReadList

computations in the system. Several notations exist


X
X1
for describing algorithms, but programming language
SortList
e1 X3
constructs are most commonly used. These constructs e2

include assignment statements, branching statements,


X2 OutputList

iteration statements and procedures. In addition, data


types such as records, arrays and linked lists are usu- (c) (d)

ally helpful in modeling complex data structures.


Figure 28: Behavioral completion: (a) nite-state ma-
1 int buf[10], i, j; chine, (b) program-state machine, (c) a single level
2 view of the program-state X, (d) decomposition into
3 for( i = 0; i < 10; i ++ )
4 for( j = 0; j < i; j ++ ) sequential subbehaviors. (y)
5 if( buf[i] > buf[j] )
6 swap( &buf[i], &buf[j] ); The PSM model denotes completion using a special
prede ned completion point. When control ows to
Figure 27: Code segment for sorting. this completion point, the program-state enclosing it is
said to have completed, at which point the transition-
Figure 27 shows how we would use programming on-completion (TOC) arc, which can be traversed only
constructs to specify a behavior that sorts a set of ten when the source program-state has completed, could
integers into descending order. Note that the proce- now be traversed.
dure swap exchanges the values of its two parameters. For example, consider the program-state machine
in Figure 28(b). In this diagram, the behavior of
3.7 Behavioral completion leaf program-states such as X1 have been described
with programming constructs, which means that their
Behavioral completion refers to a behavior's ability completion will be de ned in terms of their execution
to indicate that it has completed, as well as to the of the last statement. The completion point of the
ability of other behaviors to detect this completion. program-state machine for X has been represented as
A behavior is said to have completed when all the a bold square. When control ows to it from program-
computations in the behavior have been performed, state X2 (i.e., when the arc labeled by event e2 is
and all the variables that have to be updated have traversed), the program-state X will be said to have
had their new values written into them. completed. Only then can event e5 cause a TOC tran-

23
sition to program-state Y. Similarly, program-state B
will be said to have completed whenever control ows
along the TOC arc labeled e4 from program-state Y e1
X
e2
X

to the completion point for B. e1 e2

The speci cation of behavioral completion has two


advantages. First, in hierarchical speci cations, com-
Y Z Y Z

pletion helps designers to conceptualize each hierar-


chical level, and to view it as an independent mod- (a) (b)

ule, free from interference from inter-level transitions.


Figure 28(c), for example, shows how the program-
state X in Figure 28(b) would look by itself in isola- Figure 29: Exception types: (a) abortion, (b) inter-
tion from the larger system. Having decomposed the rupt.
functionality of X into the program-substates X1, X2
and X3, the system modeler does not have to be con- completion.
cerned with the e ects of the completion transition
labeled by event e5. From this perspective, the de- Examples of such exceptions include resets and in-
signer can develop the program-state machine for X terrupts in many computer systems.
independently, with its own completion point (transi-
tion labeled e2 from X2). The second advantage of 3.9 Timing
specifying behavioral completion is that the concept
allows the natural decomposition of a behavior into On many occasions in system speci cations, there may
subbehaviors which are then sequenced by the \com- be need to specify detailed timing relations, when a
pletion" transition arcs. For example, Figure 28(d) component receives or generates events in speci c time
shows how we can split an application which sorts a ranges, which are measured in real time units such as
list of numbers into three distinct, yet meaningful sub- nanoseconds.
behaviors: ReadList, SortList and OutputList. Since In general, a timing relation can be described by
TOC arcs sequence these behaviors, the system re- a 4-tuple T = (e1; e2; min; max), where event e1 pre-
quires no additional events to trigger the transitions ceeds e2 by at least min time units and at most max
between them. time units. When such a timing relation is used with
real components it is called timing delay, when it is
3.8 Exception handling used with component speci cations it is called timing
constraint.
Often, the occurrence of a certain event can require Such timing information is especially important for
that a behavior or mode be interrupted immediately, describing parts of the system which interact exten-
thus prohibiting the behavior from updating values sively with the environment according to a prede ned
further. Since the computations associated with any protocol. The protocol de nes the set of timing rela-
behavior can be complex, taking an inde nite amount tions between signals, which both communicating par-
of time, it is crucial that the occurrence of the event, ties have to respect.
or exception, should terminate the current behavior A protocol is usually visualized by a timing dia-
immediately rather than having to wait for the com- gram, such as the one shown in Figure 30 for the read
putation to complete. When such exceptions arise, the cycle of a static RAM. Each row of the timing diagram
next behavior to which control will be transferred is shows a waveform of a signal, such as Address, Read,
indicated explicitly. Write and Data in Figure 30. Each dashed vertical
Depending on the direction of transferred control line designates an occurrence of an event, such as t1,
the exceptions can be further divided into two groups: t2 through t7. There may be timing delays or timing
(a) abortion, when the behavior is terminated, and constraints associated with pairs of events, indicated
(b) interrupt, where control is temporarily trans- by an arrow annotated by x=y, where x stands for the
ferred to other behaviors. An example of an abor- min time, y stands for the max time. For example,
tion is shown in Figure 29(a) where behavior X is the arrow between t1 and t3 designates a timing de-
terminated after the occurence of events e1 or e2. An lay, which says that Data will be valid at least 10, but
example of interrupt is shown in Figure 29(b) where no more than 20 nanoseconds after Address is valid.
control from behavior X is transferred to Y or Z after The timing information is very important for the
the occurrence of e1 or e2 and is returned after their subset of embedded systems known as real time sys-

24
the system is actually describing custom communica-
tion procedures. Hence, it is very important for a
system description language to provide the ability to
Address a

Read rede ne or extend the standard form of communica-


tion.
Write As an analogy, the need for a general mechanism to
specify communication is the same as the need in com-
Data d putation to generalize operators like + and  into func-
tions, which provide a general mechanism to de ne
0/ 10/20 0/ 0/ 5/10 custom computation. In the absence of such a mech-
anism designers tend to mix the behavior responsible
10/20 10/20
for communication with the behavior for computation,
t1 t2 t3 t4 t5 t6 t7
which results in the loss of modularity and reusability.
It follows that we need

Figure 30: Timing diagram (a) a mechanism to separate the speci cation of com-
putation and communication;
tems, whose performance is measured in terms of (b) a mechanism to declare abstract communication
how well the implementation respects the timing con- functions in order to describe what they are and
straints. A favorite example of such systems would be how they can be used;
an aircraft controller, where failure to respond to an
abnormal event in a prede ned timing limit will lead (c) a mechanism to de ne a custom communication
to disaster. implementation which describes how the commu-
nication is actually performed.
3.10 Communication In order to nd a general communication model
the structure of a system must be de ned. A sys-
In general, systems consist of several interacting be- tem's structure consists of a set of blocks which are
haviors which need to communicate with each other to interconnected through a set of communication chan-
be cooperative. Thus a general communication model nels. While the behavior in the blocks speci es how
is necessary for system speci cation. the computation is performed and when the commu-
In traditional programming languages standard nication is started, the channels encapsulate the com-
forms of communication between functions are shared munication implementation. In this way blocks and
variable access and parameter passing in procedure channels e ectively separate the speci cation of co-
calls. These mechanisms provide communication in munication and computation.
an abstract form. The way the communication is per- Each block in a system contains a behavior and a
formed is prede ned and hidden to the programmer. set of ports through which the behavior can commu-
For example, functions communicate through global nicate. Each channel contains a set of communication
variables, which share a common memory space, or functions and a set of interfaces. An interface de-
via parameter passing. In case of local procedure clares a subset of the functions of the channel, which
calls, parameter passing is implemented by exchanging can be used by the connected behaviors. So while the
information on the stack or through processor regis- declaration of the communication functions is given in
ters. In the case of remote procedure calls, parame- the interfaces, the implementation of these functions
ters are passed via the complex protocol of marshal- is speci ed in the channel.
ing/unmarshaling and sending/receiving data through
a network.
While these mechanisms are sucient for stan- B1 B2
dard programming languages, they poorly address the P1
I1 C I2
P2
needs for systems-on-a-chip descriptions, where the
way the communication is performed is often custom
and impossible to prede ne. For example, in telecom-
munication applications the major task of modeling Figure 31: Communication model.

25
For example, the system shown in Figure 31 con- ods. This model also encourages the separation of
tains two blocks B 1 and B 2, and a channel C . Block computation and communication, since the function-
B 1 communicates with the left interface I 1 of channel ality responsible for communication can be con ned in
C via its port P 1. Similarly block B 2 accesses the the channel speci cation and will not be mixed with
right interface I 2 of channel C through its port P 2. the description used for computation.
Note that blocks B 1 and B 2 can be easily replaced by
other blocks as long as the port types stay the same.
Similarly channel C can be exchanged with any other
1 2 3 4 5 1 2 3 4 5

channel that provides compatible interfaces.


clk clk

More speci cally, a channel serves as an encapsula- start start

tor of a set of communication media in the form of rw rw

variables, and a set of methods in the form of func-


tions that operate on these variables. The methods
AD Addr Data AD Addr Data

specify how data is transferred over the channel. All (a) (b)

accesses to the channel are restricted to these meth-


ods.
For example, Figure 32 shows three communication Figure 34: A simple synchronous bus protocol: (a)
examples. Figure 32(a) shows two behaviors commu- read cycle, (b) write cycle.
nicating via a shared variable M . Figure 32(b) shows
a similar situation using the channel model. In fact, Note that the ability to describe timing is very im-
communication through shared memory is just a spe- portant for channel speci cation. Consider, for exam-
cial case of the general channel model. The channel C ple, the synchronous bus speci cation shown in Fig-
from Figure 32(b) can be implemented as a shared ure 34. A component using this bus can initiate a com-
memory channel as shown in Figure 33. Here the munication by asserting the start and rw signals in the
methods receive and send provide the restricted ac- rst cycle, supplying the address in the second cycle,
cesses to the variables M and valid. and then supplying data in the following cycles. The
communication will terminate when the start signal is
deasserted. The description of the protocol is shown in
1 channel C {
2 bool valid; Figure 35. The description encapsulates the commu-
3 int M; nication media, in this case the signals clk, start, rw
4
5 int receive( void ) { and AD, and a set of methods, in this case read cycle
6 while( valid == 0 ); and write cycle, which implement the communication
7
8
return M;
} protocol for reading and writing the data as described
9 void send( int a ) { in the diagram. The call clk.wait() sychronizes the
10
11
M = a;
valid = 1; signal assignments with the bus clock clk.
12 }
13 };
3.11 Process synchronization
Figure 33: Integer channel. In a system that is conceptualized as several concur-
rent processes, the processes are rarely completely in-
A channel can also be hierachical, as shown in Fig- dependent of each other. Each process may gener-
ure 32(c). In the example channel C 1 implements ate data and events that need to be recognized by
a high level communication protocol which breaks a other processes. In cases like these, when the pro-
stream of data packets into a byte stream at the sender cesses exchange data or when certain actions must be
side, or assembles the byte stream into a packet stream performed by di erent processes at the same time, we
at the receiver side. C 1 in turn uses a lower level chan- need to synchronize the processes in such a way that
nel C 2, for example a synchronous bus which transfers one process is suspended until the other reaches a cer-
the byte stream produced by C 1. tain point in its execution. Common synchronization
The adoption of mechanisms discussed above methods fall into two classi cations, namely control-
achieves information hiding, since the media and the dependent and data-dependent schemes.
way the communication is implemented are hidden.
Also the modeling complexity is reduced, since the Control-dependent synchronization: In con-
user only needs to make function calls to the meth- trol-dependent synchronization techniques, it is the

26
B1 B2 B1 C B2
int x; void send(int d) int y;
int x; int y; ... { ... } ...
... int M; ... C.send(x); int receive(void) y=C.receive();
M = x; y = M; ... { ... } ...
... ...

(a) (b)

B1 B2
C1
P1 C2 P2

(c)

Figure 32: Examples of communication: (a) shared memory, (b) channel, (c) hierarchical channel.

to their initial states either the rst time the system


1 channel bus() { is initialized, as is the case with most HDLs, or dur-
2
3
clock clk;
clocked bit start; ing the execution of the processes. In the Statecharts
4 clocked bit rw; [DH89] of Figure 36(c), we can see how the event e, as-
5 clocked bit AD; sociated with a transition arc that reenters the bound-
6
7 word read_cycle( word addr ) { ary of ABC, is designed to synchronize all the orthog-
8 word d; onal states A, B and C into their default substates.
9 start = 1, rw = 1, clk.wait(); Similarly, in Figure 36(d), event e causes B to initial-
10
11
AD = addr,
d = AD,
clk.wait();
clk.wait(); ize to its default substate B1 (since AB is exited and
12 start = 0, rw = 0, clk.wait(); then reentered), at the same time transitioning A from
13
14
return d;
}
A1 to A2.
15 void write_cycle( word a, word d ) {
16 start = 1, rw = 0, clk.wait(); Data-dependent synchronization: In addition to
17
18
AD = addr,
AD = d,
clk.wait();
clk.wait(); these techniques of control-dependent synchroniza-
19 start = 0, clk.wait(); tion, processes may also be synchronized by means of
20 } one of the methods for interprocess communication:
21 }
shared memory or message passing as mentioned in
Section 3.10.
Figure 35: Protocol description of the synchronous Shared-memory based synchronization works
bus protocol. by making one of the processes suspend until the other
process has updated the shared memory with an ap-
propriate value. In such cases, the variable in the
control structure of the behavior that is responsible for shared memory might represent an event, a data value
synchronizing two processes in the system. For exam- or the status of another process in the system, as is il-
ple, the fork-join statement introduced in Section 3.5 lustrated in Figure 37 using the Statecharts language.
is an instance of such a control construct. Figure 36(a) Synchronization by common event requires
shows a behavior X which forks into three concurrent one process to wait for the occurrence of a speci c
subprocesses, A, B and C. In Figure 36(b) we see how event, which can be generated externally or by an-
these distinct execution streams for the behavior X are other process. In Figure 37(a), we can see how event e
synchronized by a join statement, which ensures that is used for synchronizing states A and B into substates
the three processes spawned by the fork statement are A2 and B2, respectively. Another method is that of
all complete before R is executed. Another example synchronization by common variable, which re-
of control-dependent synchronization is the technique quires one of the processes to update the variable with
of initialization, in which processes are synchronized a suitable value. In Figure 37(b), B is synchronized

27
AB
Q A B
behavior X
begin A1 B1
Q(); A B C e e
fork A(); B(); C(); join; AB
R(); synchronization A2 B2 A B
end behavior X; point
R (a) A1
x:=0
B1

e (x=1)
(a) (b)
AB A2 B2
B x:=1
A

ABC AB A1 B1
(b)
A B C A B e entered A2

A2 B2
B1
A1

(c)
A2 B2

Figure 37: Data-dependent synchronization in Stat-


echarts: (a) synchronization by common event, (b)
e e

synchronization by common data, (c) synchronization


(c) (d)

by status detection. (y)


Figure 36: Control synchronization: (a) behavior
X with a fork-join, (b) synchronization of execution Figure 39 shows the textual representation of the
streams by join statement, (c) and (d) synchroniza- example in Figure 38, where an actor is represented
tion by initialization in Statecharts. (y) by a rectangular box with curved corners.

into state B2 when we assign the value \1" to variable


x in state A2. B

Still another method is synchronization by sta- X Y


tus detection, in which a process checks the status of X1
other processes before resuming execution. In a case p a
i ... o
s
int max, j;
TData array;
like this, the transition from A1 to A2 precipitated by array = c.read( );
event e, would cause B to transition from B1 to B2,
e1
max = 0;
for ( j = 0; j < 16; j++ )
m q
as shown in Figure 37(c). X2
ch
if ( array[ j ] > max )

m = max;
max = array[ j ];

i ... o

3.12 SpecC+ Language description


c c

e2

In this section, we will present an example of a speci -


cation language called SpecC+, which was speci cally
developed to capture directly a conceptual model pos-
sessing all the above discussed characteristics.
The SpecC+ view of the world is a hierarchical net- Figure 38: A graphical SpecC+ speci cation example.
work of actors. Each actor possesses
There is an actor construct which capture all the
(a) a set of ports through which the actor communi- information for an actor. An actor construct looks
cates with the environment; like a C++ class which exports an main method. The
ports are declared in the parameter list. The state
(b) a set of state variables; variable, channels and child actor instances are de-
(c) a set of communication channels; clared as typed variables, and the behavior is speci ed
by the methods, or functions start from main. Actor
(d) a behavior which de nes how the actor will construct can be used as a type to instantiate actor
change its state and perform communication instances.
through its ports when it is invoked. SpecC+ supports both behavioral hierarchy and

28
structural hierarchy in the sense that it captures a
system as a hierarchy of actors. Each actor is either a
composite actor or a leaf actor.
Composite actors are decomposed hierarchically
into a set of child actors. For structural hierarchy, the
1 typedef int
2
TData[16];
child actors are interconnected via the communication
3 interface IData( void ) { channels by child actor instantiation statements, sim-
4
5
TData read( void );
void write( TData d ); ilar to component instantiation in VHDL. For exam-
6 }; ple, actor X is instantiated in line 57 of Figure 39 by
mapping its port a and c to the ports (p) and commu-
7
8 channel CData( void ) implements IData {
9 bool valid; nication channels (ch) de ned in its parent actor B.
For behavioral hierarchy, the child actors can either
10 event s;
11 TData storage;
12 be concurrent, in which case all child actors are active
whenever the parent actor is active, or can be sequen-
13 TData read( void ) {
14 if( valid ) s.wait();
15
16
return storage;
}
tial, in which case the child actors are only active one
17 void write( TData d ) { at a time. In Figure 38, actors B and X are composite
18
19 }
storage=d; valid = 1; s.notify(); actors. Note that while B consists of concurrent child
20 }; actors X and Y, X consists of sequential child actors
21
22 actor X1( in TData i, out TData o ) { ... }; X1 and X2.
23 actor X2( in TData i, IData o ) { Leaf actors are those that exist at the bottom
24
25
void main( void ) {
... of the hierarchy whose functionality is speci ed with
26 o.write(...); imperative programming constructs. In Figure 38, for
27
28 };
}
example, Y is a leaf actor.
29
30 actor X( in int a, IData c ) { SpecC+ also supports state transitions, in the
31 TData s; sense that we can represent the sequencing between
32
33
X1
X2
x1( a, s );
x2( s, c ); child actors by means of a set of transition arcs.
34 In this language, an arc is represented as a 3-tuple
35
36
psm main( void ) {
x1 : ( TI, cond1, x2 ); < T; C; N >, where T represents the type of transi-
37 x2 : ( TOC, cond2, complete ); tion, C represents the condition triggering the transi-
38
39 };
}
tion, and N represents the next actor to which control
40 is transferred by the transition. If no condition is as-
41 actor Y ( IData c, out int m ) {
42 void main( void ) { sociated with the transition, it is assumed to be \true"
43
44
int
TData
max, j;
array;
by default.
45 SpecC+ supports two types of transition arcs. A
46
47
array = c.read();
max = 0; transition-on-completion arc (TOC) is traversed
48 for( j = 0; j < 16; j ++ ) whenever the source actor has completed its compu-
49
50
if( array[j] > max )
max = array[j]; tation and the associated condition evaluates as true.
51 m = max; A leaf actor is said to have completed when its last
52
53 };
}
statement has been executed. A sequentially decom-
54 posed actor is said to be complete only when it makes
55 actor B( in TData p, out int q ) {
56 CData ch; a transition to a special prede ned completion point,
57 X x( p, ch ); indicated by the name complete in the next-actor eld
58
59
Y y( ch, q );
of a transition arc. In Figure 38, for example, we
60 csp main( void ) { can see that actor X completes only when child actor
61
62
par { x.main(); y.main(); }
} X2 completes and control ows from X2 to the com-
63 }; plete point when cond2 is true (as speci ed by the arc
< TOC; cond2; complete > in line 36 of Figure 39).
Figure 39: A textual SpecC+ speci cation example. Finally, a concurrently decomposed actor is said to be
completed when all of its child actors have completed.
In Figure 38, for example, actor B completes when all
the concurrent child actors X and Y have completed.
Unlike the TOC arc, a transition-immediate-

29
ly arc (TI) is traversed instantaneously whenever a set of function implementations. For example, the
the associated condition becomes true, regardless of channel CData encapsulates media s and storage and
whether the source actor has or has not completed an implementation of methods read and write. The
its computation. For example, in Figure 38, the arc interface and the channel are related by the imple-
< TI; cond1; x2 > terminates X1 whenever cond1 is ments keyword. A channel related to an interface in
true and transfers control to actor X2. In other words, this way is said to implement this interface, meaning
a TI arc e ectively terminates all lower level child ac- the channel is obligated to implement the set of func-
tors of the source actor. tions prescribed by the interface. For example, CData
Transitions are represented in Figure 38 with has to implement read and write since they appear in
directed arrows. In the case of a sequentially- IData. It is possible that several channels can imple-
decomposed actor, an inverted bold triangle points to ment the same interface, which implies that they can
the rst child actor. An example of such an initial provide di erent implementations of the same set of
child actor is X1 of actor X. The completion of se- functions.
quentially decomposed actors is indicated by a transi- Interfaces are usually used as port data types in
tion arc pointing to the completion point, represented port declarations of an actor (as port c of actor Y at
as a bold square within the actor. Such a completion line 41 of Figure 39). A port of one interface type will
point is found in actor X (transition from X2 labeled be bound to a particular channel which implements
e2). TOC arcs originate from a bold square inside the such an interface during actor instantiation. For ex-
source child actor, as does the arc labeled e2. TI arcs, ample, port c of actor Y is mapped to channel c of
in contrast, originate from the perimeter of the source actor B, when actor Y is instantiated.
child actor, as does the arc labeled e1. The fact that a port of interface type can be bound
SpecC+ supports both data-dependent syn- to a real channel until actor instantiation is called late
chronization and control-dependent synchro- binding. Such a late binding mechanism helps to
nization. In the rst method, actors can synchronize improve the reusability of an actor description, since
using common event. For example, in Figure 38, ac- it is possible to plug in any channel as long as they
tor Y is the consumer of the data produced by actor implement the same interface.
X via channel c, which is of type CData Figure 39. In
the implementation of CData at line 8 of Figure 39,
an event s is used to make sure Y can get valid data ASystem CSramWrapper

from X: the wait function over s will suspend Y if the addr


ASram

data is not ready. In the second method, we could use


addr
data data
AAsic
a TI arc from actor B back to itself in order to syn- rd rd

chronize all the concurrent child actors of B to their


wr wr

word reg[8];
initial states. Furthermore, the fact that X and Y are ....
concurrent actors enclosed in B automatically imple- read_word(0x1, &reg[0]);
....

ments a barrier, since by semantics, B nishes when write_word(0x2, reg[4]); CDramWrapper


....

both the execution of X and Y are nished.


ADram
addr addr
ras ras

Communication in SpecC+ is achieved through


data data

cs cs cas cas

the use of communication channels. Channels can be we we

primitive channels such as variables and signals (like


variable s of actor X in Figure 38), or complex channels
such as object channels (like variable ch in Figure 38),
which directly supports the hierarchical communica-
tion model discussed in Section 3.10. Figure 40: Component wrapper speci cation.
The speci cation of an object channel is separated
in the interface declaration and the channel de nition, Consider, the example in Figure 40 as speci ed in
each of which can be used as data types for channel Figure 41.
variables. The interface de nes a set of function pro- The system described in this example contains an
totype declarations without the actual function body. ASIC (actor AAsic) talking to a memory. The in-
For example, the interface IData in Figure 38 de nes terface IRam speci es the possible transactions to ac-
the function prototypes read and write. The channel cess memories: read a word via read word and write
encapsulates the communication media and provides a word via write word. The description of AAsic can

30
use IRam as its port so that its behavior can make
function calls to methods read word and write word
without knowing how these methods are exactly im-
plemented. There are two types of memories avail-
able in the library, represented by actors ASram and
ADram respectively, the descriptions of which provide
1 interface IRam( void ) { their behavioral models. Obviously, the static RAM
2
3
void
void
read_word( word a, word *d );
write_word( word a, word d ); ASram and dynamic RAM ADram have di erent pins
4 }; and timing protocols to access them, which can be
5
6 actor AAsic( IRam ram ) { encapsulated with the component actors themselves
7 word reg[8]; in channels called wrappers, as CSramWrapper and
8
9 void main( void ) { CDramWrapper in Figure 40. When the actor AAsic
10 ... is instantiated in actor ASystem (lines 52 and 53 in
11
12
ram.read_word( 0x0001, &reg[0] );
... Figure 41), the port IRam will be resolved to either
13
14
ram.write_word( 0x0002, reg[4] );
}
CSramWrapper or CDramWrapper.
15 }; The improvement of reusability of this style of spec-
16
17 actor ASram( in signal<word> addr, i cation is two fold: rst, the encapsulation of commu-
18 inout signal<word> data, nication protocols into the channel speci cation make
19
20
in signal<bit> rd, in signal<bit> wr ) {
... these channels highly reusable since they can be stored
21 }; in the library and instantiated at will. If these chan-
22
23 actor ADram( in signal<word> addr, nel descriptions are provided by component vendors,
24 inout signal<word> data, the error-prone e ort spent on understanding the data
25
26
in signal<bit> cs, in signal<bit> we,
out signal<bit> ras, out signal<bit> cas ) { sheets and interfacing the components can be greatly
27 ... relieved. Secondly, actor descriptions such as AAsic
28
29
};
can be stored in the library and easily reused without
30 channel CSramWrapper( void ) implements IRAM { any change subject to the change of other components
31
32
signal<word> addr, data;
signal<bit> rd, wr;
// address, data
// read/write select with which it interfaces.
33
34
ASram sram( addr, data, rd, wr ); It should be noted that while methods in an actor
35 void read_word( word a, word *d ) { ... } represent the behavior of itself, the methods of a chan-
36
37
void
...
write_word( word a, word d ) { ... } nel represent the behavior of their callers. In other
38 }; words, when the described system is implemented, the
39 methods of the channels will be inlined into the con-
nected actors. When a channel is inlined, the encapsu-
40 channel CDramWrapper( void ) implements IRam {
41 signal<word> addr, data; // address, data
42 signal<bit> cs, we; // chip select, write enable lated media get exposed and its methods are moved to
the caller. In the case of a wrapper, the encapsulated
43 signal<bit> ras, cas; // row, col address strobe
44 ADram sram( addr, data, cs, we, ras, cas );
45
46 void read_word( word a, word *d ) { ... }
actors also get exposed.
47 void write_word( word a, word d ) { ... } Figure 42 shows some typical con gurations. In
48
49
...
}; Figure 42(a), two synthesizable components A and B
50 (eg. actors to be implemented on an ASIC) are inter-
51 actor ASystem( void ) {
52 CSramWrapper ram; // can be replaced by connected via a channel C , for example, a standard
53 // CDramWrapper ram; // this declaration bus. Figure 42(b) shows the situation after inlining.
54
55
AAsic asic( ram );
The methods of the channel C are inserted into the
56 void main( void ) { ... } actors and the bus wires are exposed. In Figure 42(c)
57 };
a synthesizable component A communicates with a
xed component B (eg. an o -the-shelf component)
Figure 41: Source code of the component wrapper through a wrapper W . When W is inlined, as shown
speci cation. in Figure 42(d), the xed component B and the sig-
nals get exposed. In Figure 42(e) again a synthesizable
component A communicates with a xed component
B using a prede ned protocol, that is encapsulated
in the channel C . However, B has its own built-in
protocol, which is encapsulated in the wrapper W . A

31
A B A 1 void read_word( word a, word *d ) {
C W 2 do {
B 3 t1: { addr = a; }
4 t2: { rd = 1; }
5 t3: { }
(a) (c) 6 t4: { *d = data; }
A B A 7 t5: { addr.disconnect(); }
8 t6: { rd = 0; }
B 9 t7: { break}; }
10 }
11 timing {
(b) (d) 12 range( t1; t2; 0; );
13 range( t1; t3; 10; 20 );
A T 14 range( t2; t3; 10; 20 );
C W 15 range( t3; t4; 0; );
B 16 range( t4; t5; 0; );
17 range( t5; t7; 10; 20 );
18 range( t6; t7; 5; 10 );
(e) 19 }
A 20 };
T
B
Figure 43: Timing speci cation of the SRAM read
(f) protocol.
Legend: synthesizable fixed protocol
component component transducer
part lists all the events of the diagram. Events are
speci ed as a label and its associated piece of code,
which describes the change on signal values. The sec-
inlined
component channel wrapper

ond part is a list of range statements, which specify the


timing constraints or timing delays using the 4-tuples
Figure 42: Common con gurations before and after as described in Section 3.9.
channel inlining: (a)/(b) two synthesizable actors con- This style of timing description is used at the spec-
nected by a channel, (c)/(d) synthesizable actor con- i cation level. In order to get an executable model
nected to a xed component, (e)/(f) protocol trans- of the protocol, scheduling has to be performed for
ducer. each do-timing statement. Figure 44 shows the imple-
mentation of the read word method which follows an
ASAP scheduling, where all timing constraints are re-
protocol transducer T has to be inserted between the placed by delays, which are speci ed using the waitfor
channel C and the wrapper W in order to translate all function.
transactions between the two protocols. Figure 42(f)
shows the nal situation, when both channels C and
W are inlined. 1 void read_word( word a, word *d ) {
2 addr = a;
SpecC+ supports the speci cation of timing ex- 3 rd = 1;
plicitly and distinguishes two types of timing speci - 4 waitfor( 10 );

cations, namely timing constraints and timing de-


5 *d = data;
6 addr.disconnect();
lays, as discussed in Section 3.9. At the speci cation 7
8
rd = 0;
waitfor( 10 );
level timing constraints are used to specify time limits 9 };
that have to be satis ed. At the implementation level
computational delays have to be noted. Figure 44: Timing implementation of the SRAM read
Consider, for example, the timing diagram of the protocol.
read protocol for a SRAM, as shown earlier in Fig-
ure 30. The protocol visualized by the timing dia- In this section we presented the characteristics most
gram can be used to de ne the read word method of commonly found in modeling systems and discussed
the SRAM channel above (line 35 in Figure 41). The their usefulness in capturing system behavior. Also,
code segment in Figure 43 shows the speci cation of we presented SpecC+, an example of a speci cation
the read access to the SRAM. language, which was speci cally developed to cap-
The do-timing statement e ectively describes all in- ture directly these characteristics. The next section
formation contained in the timing diagram. The rst presents a generic codesign methodology based on the

32
SpecC+ language.

4 Generic codesign methodol-


ogy
As shown in Figure 45, codesign usually starts from
a formal speci cation which speci es the functionality
as well as the performance, power, cost, and other con-
straints of the intended design. During the codesign
process, the designer will go through a series of well-
de ned design steps which will eventually map the
functionality of the speci cation to the target architec-
ture. These design steps include allocation, partition-
B0 shared sync

ing, scheduling and communication synthesis, which B1


form the synthesis ow of the methodology.
The result of the synthesis ow will then be fed into B2
the backend tools, shown in the lower part of Fig- B5
ure 45. Here, a compiler is used to implement the B6 sync
functionality mapped to processors, a high level syn-
thesizer is used to map the functionality mapped to B7 B4

ASICs, and a interface synthesizer is used to imple-


ment the functionality of interfaces. B3
During each design step, the design model will be
statically analyzed to estimate certain quality met- (a)
rics and how they satisfy the constraints. This design
model will also be used to generate a simulation model, B1( )
{
B3( )
{
B7( )
{
which is used to validate the functional correctness of stmt; stmt; stmt;
the design. In case the validation fails, a debugger }
...
}
...
}
...

can be used to locate and x the errors. Simulation B6( ) B4()


is also used to collect pro ling information which in { {
turn will improve the accuracy of the quality metrics int local;
...
int local;
wait( sync );
estimation. This set of tasks forms the analysis and shared = local + 1; local = shared − 1;

validation ow of the methodology (see Figure 45).


signal( sync ); ...
} }
The following sections describe the tasks of the
generic methodology in more detail. (b)

4.1 System speci cation Figure 46: Conceptual model of speci cation: (a)
We have described the characteristics needed for spec- control- ow view, (b) atomic behaviors.
ifying systems in Section 3.1. The system speci cation
should describe the functionality of the system with-
out premature engagement in the implementation. It
should be made logically as close as possible to the
conceptual model of the system so that it is easy to
be maintained and modi ed. It should also be exe-
cutable so that the speci ed functionality is veri able.
The behavior model in Section 3.12 makes it a good
candidate since it is a simple model which meets these
requirements.
In the example shown in Figure 46, the system it-
self is speci ed as the top behavior B0, which contains

33
Synthesis Flow Analysis & Validation Flow

Analysis & Simulation


Spec
Validation model

Allocation,
Partitioning

Partitioning Analysis & Simulation


model Validation model

Scheduling

Scheduling Analysis & Simulation


model Validation model

Communication
Synthesis

Communication Analysis & Simulation


model Validation model

Compilation High Level Interface


Synthesis synthesis

Implementation Analysis & Simulation


Backend model Validation model

Manufacturing

Figure 45: Generic methodology.

34
an integer variable shared and a boolean variable sync.
There are three child behaviors, B1, B2, B3, with se-
quential ordering, in behavior B0. While B1 and B3
are atomic behaviors speci ed by a sequence of im-
perative statements, B2 is a composite behavior con-
sisting of two concurrent behaviors B4 and B5. B5 in
turn consists of B6 and B7 in sequential order. While PE0
most of the actual behavior of an atomic behavior is PE1 B0
omitted in the gure for space reasons, we do show a
producer-consumer example relevant for later discus- B1 B2 B3

sion: B6 computes a value for variable shared, and B4


consumes this value by reading shared. Since B6 and B4 B5

B4 are executed in parallel, they synchronize with the B6 B7


variable sync using signal/wait primitives to make sure
that B4 accesses shared only after B6 has produced the
value for it.
(a)

Top

4.2 Allocation
shared sync B1_start B1_done B4_start B4_done

Given a library of system components such as proces-


PE0 PE1

sors, memories and custom IP modules, the task of B0


B1_start

allocation is de ned as the selection of the type and B1_ctrl B1

number of these components, as well as the determi-


B1_done
B2

nation of their interconnection, in such a way that the B5


B6
functionality of the system can be implemented, the
sync
B4_start

constraints satis ed, and the objective cost function


B7 B4_ctrl
B4
minimized. The result of the allocation task can be a
B4_done
B3
customization of the generic architecture discussed in
Section 2.2. Allocation is usually carried out manually
by designers and is the starting point of the design (b)

exploration process. B1( ) B1_ctrl( ) B3( ) B7( )


{ { { {
wait( B1_start );

4.3 Partitioning and the model after


signal( B1_start ); stmt; stmt;
... wait( B1_done ); ... ...
signal( B1_done );

partitioning
} } }
}

The task of partitioning de nes the mapping between


B4()
{ B6( )
B4_ctrl( )
the set of behaviors in the speci cation and the set int local;
wait( B4_start );
{
{
int local;
of allocated components in the selected architecture. wait( sync );
signal( B4_start );
wait( B4_done );
...
shared = local + 1;
The quality of such a mapping is determined by how
local = shared − 1; }
... signal( sync );
well the result can meet the design constraints and }
signal( B4_done ); }

minimize the design cost.


The system model after partitioning must re ect (c)

the partitioning decision and must be complete in or-


der for us to perform validation. More speci cally,
the partitioned model re nes the speci cation in the Figure 47: Conceptual model after partitioning: (a)
following way: partitioning decision, (b) conceptual model, (c) atomic
behaviors.
(a) An additional level of hierarchy is inserted which
describes the selected architecture. Figure 47
shows the partitioned model of the example in
Figure 46. Here, the added level of hierarchy in-
cludes two concurrent behaviors, PE0 and PE1.

35
(b) In general, controlling behaviors are needed and
must be added for child behaviors assigned to dif-
ferent PEs than their parents. For example, in
Figure 47, behavior B1 ctrl and B4 ctrl are in-
serted in order to control the execution of B1 and
B4, respectively.
(c) In order to maintain the functional equivalence
between the partitioned model and the original
speci cation, synchronization between PEs is in-
serted. In Figure 47 synchronization variables , PE0: B6 B7 B3
B1 start, B1 done, B4 start, B4 done are added PE1: B1 B4
so that the execution of B1 and B4, which are as- (a)
signed to PE1, can be controlled by their control-
ling behaviors B1 ctrl and B4 ctrl through inter- Top shared sync B6_start B3_start

PE synchronization.
PE0 PE1
However, the model after partitioning is still far B1
from implementation for two reasons: B6_start

B6
(a) There are concurrent behaviors in each PE that sync

have to be serialized; B7 B4
B3_start

(b) Di erent PEs communicate through global vari-


ables which have to be localized. B3

These issues will be addressed in the following two


sections. (b)

4.4 Scheduling and the scheduled B1( )


{
B3( )
{
B7( )

model ... {
wait( B3_start ); stmts;
signal( B6_start ); ...
} ...
}
Given a set of behaviors and possibly a set of perfor-
}

mance constraints, the scheduling task determines a B4() B6( )


total order in invocation time of the behaviors running {
int local;
{
int local;
on the same PE, while respecting the partial order im- wait( sync ); wait( B6_start );
posed by dependencies in the functionality as well as
local = shared − 1; ...
... shared = local + 1;
minimizing the synchronization overhead between the }
signal( B3_start );
}
signal( sync );

PEs and context switching overhead within the PEs.


Depending upon how much information on the par- (c)

tial order of the behaviors is available at compile time,


there are di erent strategies for scheduling.
In one extreme, where ordering information is un- Figure 48: Conceptual model after scheduling: (a)
known until runtime, the system implementation often scheduling decision, (b) conceptual model, (c) atomic
relies on the dynamic scheduler of an underlying run- behaviors.
time system. In this case, the model after scheduling
is not much di erent from the model after partition-
ing, except that a runtime system is added to carry
out the scheduling. This strategy su ers from context
switching overhead when a running task is blocked and
a new task is scheduled.
On the other extreme, if the partial order is com-
pletely known at compile time, a static scheduling
strategy can be taken, provided a good estimation on

36
the execution time of each behavior can be obtained. (b) The designer may also choose to assign a shared
This strategy eliminates the context switching over- variable to the local memory of one particular
head completely, but may su er from inter-PE syn- PE. In this case, accesses to this shared variable
chronization especially in the case of inaccurate per- in models of other PEs have to be changed into
formance estimation. On the other hand, the strategy function calls to message passing primitives such
based on dynamic scheduling does not have this prob- as send and receive. Again, interfaces have to be
lem because whenever a behavior is blocked for inter- inserted to make the message-passing possible.
PE synchronization, the scheduler will select another (c) Another option is to maintain a copy of the shared
to execute. Therefore the selection of the scheduling variable in all the PEs that access it. In this
strategy should be based on the trade-o between con- case, all the statements that perform a write on
text switching overhead and CPU utilization. this variable have to be modi ed to implement a
The model generated after static scheduling will broadcasting scheme so that all the copies of the
remove the concurrency among behaviors inside the shared variable remain consistent. Necessary in-
same PE. As shown in Figure 48, all child behaviors terfaces also need to be inserted to implement the
in PE0 are now sequentially ordered. In order to main- broadcasting scheme.
tain the partial order across the PEs, synchronization
between them must be inserted. For example, B6 is The generated model after communication synthe-
synchronized by B6 start, which will be asserted by sis, as shown in Figure 49, is di erent from previous
B1 when it nishes. models in the following way:
Note that B1 ctrl and B4 ctrl in the model after (a) New behaviors for interfaces, shared memories
partitioning are eliminated by the optimization car- and arbiters are inserted at the highest level of
ried out by static scheduling. It should also be men- the hierarchy. In Figure 49 the added behaviors
tioned that in this section we de ne the tasks, rather are IF0, IF1, IF2, Shared mem, Arbiter.
than the algorithms of codesign. Good algorithms are
free to combine several tasks together. For example, (b) The shared variables from the previous model are
an algorithm can perform the partitioning and static all resolved. They either exist in shared memory
scheduling at the same time, in which case intermedi- or in local memory of one or more PEs. The com-
ate results, such as B1 ctrl and B4 ctrl, are not gen- munication channels of di erent PEs now become
erated at all. the local buses and system buses. In Figure 49,
we have chosen to put all the global variables in
Shared mem, and hence all the global declarations
4.5 Communication synthesis and the in the top behavior are moved to the behavior
communication model Shared mem. New global variables in the top be-
Up to this stage, the communication and synchroniza- havior are the buses lbus0, lbus1, lbus2, sbus.
tion between concurrent behaviors are accomplished (c) If necessary, a communication layer is inserted
through shared variable accesses. The task of this into the runtime system of each PE. The com-
stage is to resolve the shared variable accesses into an munication layer is composed of a set of inter-PE
appropriate inter-PE communication scheme at imple- communication primitives in the form of driver
mentation level. Several communication schemes ex- routines or interrupt service routines, each of
ist: which contain a stream of I/O instructions, which
in turn talk to the corresponding interfaces. The
(a) The designer can choose to assign a shared vari- accesses to the shared variables in the previous
able to a shared memory. In this case, the com- model are transformed into function calls to these
munication synthesizer will determine the loca- communication primitives. For the simple case of
tion of the variables assigned to the shared mem- Figure 49, the communication synthesizer will de-
ory. Given the location of the shared variables, termine the addresses for all global variables, for
the synthesizer then has to change all accesses to example, shared addr for variable shared, and all
the shared variables in the model into statements accesses to the variables are appropriately trans-
that read or write to the corresponding addresses. formed. The accesses to the variables are ex-
The synthesizer also has to insert interfaces for changed with reading and writing to the corre-
the PEs and shared memories to adapt to di er- sponding addresses. For example, shared = local
ent protocols on the buses. + 1 becomes *shared addr = local+1.

37
Top
lbus0 lbus1 lbus2 sbus

IF0 IF1 IF2

shared B1_start B4_start PE0 PE1

sync
B0
B1_done B4_done
B1_ctrl

Shared_mem
B1

Arbiter
B2
B5
B6

LM of PE0 LM of PE1 Shared_mem B7 B4_ctrl


B4

B3

(a) (b)

B1( ) B1_ctrl( )
{ { B3( ) B7( ) IF0( ) IF1( ) IF2( )
wait( *B1_start_addr ); signal( * B1_start_addr ); { { { { {
... wait( * B1_done_addr ); stmt; stmt; stmt; .stmt; stmt;
signal( *B1_done_addr ); } ... ... ... .. ...
} } } } } }

B4()
{ B4_ctrl( ) B6( ) Shared_mem( ) Arbiter( )
int local; { { { {
wait( *B4_start_addr ); signal( *B4_start_addr ); int local; int shared; stmt;
wait( *sync_addr ); wait( *B4_done_addr ); ... bool sync; ...
local = (*shared_addr) − 1; } *shared_addr = local + 1; bool B1_start, B1_done; }
... signal( *sync_addr ); bool B4_start, B4_done;
signal( *B4_done_addr ); } ...
} }

(c)

Figure 49: Conceptual model after communication synthesis: (a) communication synthesis decision, (b) conceptual
model, (c) atomic behaviors.

38
4.6 Analysis and validation ow possible. For example, consider a behavior represent-
Before each design step, which takes an input design ing a piece of software that performs some compu-
model and generates a more detailed design model, the tation and then sends the result to an ASIC. While
input design model has to be functionally veri ed. It the part of the software which communicates with the
also needs to be analyzed, either statically, or dynam- ASIC needs to be simulated at cycle level so that tricky
ically with the help of the simulator or estimator, in timing problems become visible, it is not necessary to
order to obtain an estimation of the quality metrics, simulate the computation part with the same accu-
which will be evaluated by the synthesizer to make racy.
good design decisions. This motivates the set of tools The debugger renders the simulation with break
to be used in the analysis and validation ow of the point and single step ability. This makes it possible
methodology. An example of such a tool set consists to examine the state of a behavior dynamically. A
of visualizer can graphically display the hierarchy tree
of the design model as well as make dynamic data
(a) a static analyzer, visible in di erent views and keep them synchronized
at all times. All these e orts are invaluable in quickly
(b) a simulator, locating and xing the design errors.
(c) a debugger, The pro ler is a good complement of a static
analyzer for obtaining dynamic information such as
(d) a pro ler, and branching probability. Traditionally, it is achieved by
(e) a visualizer. instrumenting the design description, for example, by
inserting a counter at every conditional branch to keep
The static analyzer associates each behavior with track of the number of branch executions.
quality metrics such as program size and program per-
formance in case it is to be implemented as software, or 4.7 Backend
metrics of hardware area and hardware performance if
it is to be implemented as an ASIC. To achieve a fast At the stage of the backend, as shown in the lower
estimation with satisfactory accuracy, the analyzer re- part of Figure 45, the leaf behaviors of the design
lies on probabilistic techniques and the knowledge of model will be fed into di erent tools in order to obtain
backend tools such as compiler and high level synthe- their implementations. If the behavior is assigned to
sizer. a standard processor, it will be fed into a compiler for
The simulator serves the dual purpose of func- this processor. If the behavior is to be mapped on an
tional validation and dynamic analysis. Simulation ASIC, it will be synthesized by a high level synthesis
is achieved by generating an executable simulation tool. If the behavior is an interface, it will be fed into
model from the design model. The simulation model an interface synthesis tool.
runs on a simulation engine, which in the form of run- A compiler translates the design description into
time library, provides an implementation for the simu- machine code for the target processor. A crucial com-
lation tasks such as simulation time advance and syn- ponent of a compiler is its code generator, which emits
chronization among concurrent behaviors. machine code from the intermediate representation
Simulation can be performed at di erent accuracy generated by the parser part of the compiler. A re-
levels. Common accuracy models are functional, cycle targetable compiler is a compiler whose code gen-
based, and discrete event simulation. A functionally erator can emit code for a variety of target proces-
accurate simulation compiles and executes the design sors. An optimizing compiler is a compiler whose
model directly on a host machine without paying spe- code generator fully exploits the architecture of the
cial attention to simulation time. A clock cycle accu- target processor, in addition to the standard optimiza-
rate simulation executes the design model in a clock tion techniques such as constant propagation. Modern
by clock fashion. A discrete event simulation incorpo- RISC processors, DSP processors, and VLIW proces-
rates a even more sophisticated timing model of the sors depend heavily on optimizing compilers to take
components, such as gate delay. Obviously there is a advantage of their speci c architectures.
trade-o between simulation accuracy and simulator The high level synthesizer translates the design
execution time. model into a netlist of register transfer level (RTL)
It should be noted that, while most design method- components, as de ned in Section 2.3 as a FSMD ar-
ologies adopt a xed accuracy simulation at each de- chitecture. The tasks involved in high level synthesis
sign stage, applying a mixed accuracy model is also include allocation, scheduling and binding. Allocation

39
selects the number and type of the RTL components of design tasks for re ning the design and the models
from the library. Scheduling assigns time steps to representing the re nements.
the operations in the behavioral description. Binding In this chapter we presented essential issues in code-
maps variables in the description to storage elements, sign. System codesign starts by specifying the system
operators to functional units, and data transfers to in one of the speci cation languages based on some
interconnect units. All these tasks try to optimize ap- conceptual model. Conceptual models were de ned in
propriate quality metrics subject to design constraints. Section 1, implementation architectures in Section 2,
We de ne an interface as a special type of ASIC while the features needed in executable speci cations
which links the PE that it is associated (via its native were given in Section 3. After a speci cation is ob-
bus) with other components of the system (via the tained the designer must select an architecture, allo-
system bus). Such a interface implements the behav- cate components, and perform partitioning, schedul-
ior of a communication task, which is generated by a ing and communication synthesis to generate the ar-
communication synthesis tool to implement the shared chitectural behavioral description. After each of the
variable accesses. Note that a transducer, which trans- above tasks the designer may validate her decisions by
lates a transaction on one bus into one or a series of generating appropriate simulation models and validat-
transactions on another bus, is just a special case of ing the quality metrics as explained in Section 4.
the above interface de nition. An example of such a Presently, very little research has been done in the
transducer translates a read cycle on a processor bus codesign eld. The current CAD tools are mostly sim-
into a read cycle on the system bus. The communi- ulator backplanes. Future work must include de ni-
cation tasks between di erent PEs are implemented tion of speci cation languages, automatic re nement
jointly by the driver routines and interrupt service of di erent system descriptions and models, and de-
routines implemented in software and the interface velopment of tools for architectural exploration, algo-
circuitry implemented in hardware. While the par- rithms for partitioning, scheduling, and synthesis, and
titioning of the communication task into software and backend tools for custom software and hardware syn-
hardware, and model generation for the two parts is thesis, including IP creation and reuse.
the job of communication synthesis, the task of gen-
erating an RTL design from the interface model is the
job of interface synthesis. Thus interface synthesis
is a special case of high level synthesis. The charac-
Acknowledgements
teristics that distinguish an interface circuitry from a We would like to acknowledge the support provided
normal ASIC is that its ports have to conform to some by UCI grant #TC20881 from Toshiba Inc. and grant
prede ned protocols. These protocols are often spec- #95-D5-146 and #96-D5-146 from Semiconductor Re-
i ed in the form of timing diagrams in vendors' data search Corporation.
sheets. This poses new challenges to the interface syn- We would also like to acknowledge Prentice-Hall
thesizer for two reasons: Inc., Upper Saddle River, NJ 07458, for the permission
(a) the protocols impose a set of timing constraints to reprint gures from [GVNG94] (annotated by y),
on the minimum and maximum skews between gures from [Gaj97] (annotated by z), and the partial
events that the interface produces and other pos- use of text appearing in Chapter 2 and Chapter 3 in
sibly external events, which the interface has to [GVNG94] and Chapter 6 and Chapter 8 in [Gaj97].
satisfy; We would also like to thank Jie Gong, Sanjiv
Narayan and Frank Vahid for valuable insights and
(b) the protocols provide a set of timing delays on the early discussions about models and languages. Fur-
minimum and maximum skews between external thermore, we want to acknowledge Jon Kleinsmith,
events and other events, of which the interface En-shou Chang, Tatsuya Umezaki for contributions
may take advantage. in language requirements and model development.

5 Conclusion and Future Work References


Codesign represents the methodology for speci cation [Ag90] G. Agha. \The Structure and Se-
and design of systems that include hardware and soft- mantics of Actor Languages". Lecture
ware components. A codesign methodology consists Notes in Computer Science, Foundation

40
of Object-Oriented Languages. Springer- Proceedings of the 29th ACM, IEEE De-
Verlag, 1990. sign Automation Conference, 1992.
[AG96] K. Arnold, J. Gosling. The Java Program- [GDWL91] D. D. Gajski, N. D. Dutt, C. H. Wu,
ming Language. Addison-Wesley, 1996. Y. L. Lin. High-Level Synthesis: In-
troduction to Chip and System Design.
[BCJ+ 97] F. Balarin, M. Chiodo, A. Jurecska, Kluwer Academic Publishers, Boston,
H. Hsieh, A. Lavagno, C. Passerone, Massachusetts, 1991.
A. Sangiovanni-Vincentelli, E. Sentovich,
K. Suzuki, B. Tabbara Hardware-Software [GVN94] D. D. Gajski, F. Vahid, S. Narayan. \A
Co-Design of Embedded Systems: A Po- system-design methodology: Executable-
lis Approach. Kluwer Academic Publish- speci cation re nement". In Proceedings
ers, 1997. of the European Conference on Design
[COB95] P. Chou, R. Ortega, G. Boriello. \Inter- Automation, 1994.
face Co-synthesis Techniques for Embed- [GVNG94] D. Gajski, F. Vahid, S. Narayan, J. Gong.
ded Systems". In Proceedings of the Inter- Speci cation and Design of Embedded Sys-
national Conference on Computer-Aided tems. New Jersey, Prentice Hall, 1994.
Design, 1995.
[Har87] D. Harel. \Statecharts: A visual for-
[CGH+ 93] M. Chiodo, P. Giusto, H. Hsieh, A. Ju- malism for complex systems". Science of
recska, L. Lavagno, A. Sangiovanni-Vin- Computer Programming 8, 1987.
centelli. \A formal speci cation model for
hardware/software codesign". Technical [HHE94] D. Henkel, J. Herrmann, R. Ernst. \An
Report UCB/ERL M93/48, U.C. Berke- approach to the adaption of estimated
ley, June 1993. cost parameters in the cosyma system".
[DH89] D. Drusinsky, D. Harel. \Using State- Third International Workshop on Hard-
charts for hardware description and syn- ware/Software Codesign, Grenoble, 1994.
thesis". In IEEE Transactions on Com- [Hoa85] C. A. R. Hoare. Communicating Sequen-
puter Aided Design, 1989. tial Processes. Prentice-Hall International,
[EHB93] R. Ernst, J. Henkel, Englewood Cli s, New Jersey, 1985.
T. Benner. \Hardware-software cosynthe- [HP96] J. L. Hennessy, D. A. Patterson. Com-
sis for microcontrollers". In IEEE Design puter Architecture { A Quantitative Ap-
and Test, Vol. 12, 1993. proach, 2nd edition, Morgan-Kaufmann,
[FLLO95] R. French, M. Lam, J. Levitt, K. Oluko- 1996.
tun. \A General Method for Compiling [KL95] A. Kalavade, E. A. Lee. \The extended
Event-Driven Simulation". In Proceedings partitioning problem: Hardware/software
of 32th Design Automation Conference, 6, mapping and implementation-bin selec-
1995. tion". In Proceedings of the 6th Interna-
[FH92] C. W. Fraser, D. R. Hanson, T. A. Proeb- tional Workshop on Rapid Systems Pro-
sting. \Engineering a Simple, Ecient totyping, 1995.
Code Generator Generator". In ACM Let-
ters on Programming Languages and Sys- [Lie97] C. Liem. Retargetable Compilers for Em-
tems, 1, 3 (Sept. 1992). bedded Core Processors: Methods and
Experiences in Industrial Applications.
[Gaj97] D. D. Gajski. Principles of Digital Design, Kluwer Academic Publishers, 1997.
Prentice Hall, 1997.
[LM87] E. A. Lee, D. G. Messerschmidt. \Static
[GCM92] R. K. Gupta, C. N. Coelho Jr., Scheduling of Synchronous Data Flow
G. De Micheli. \Synthesis and simulation Graphs for Digital Signal Processors". In
of digital systems containing interacting IEEE Transactions on Computer-Aided
hardware and software components". In Design, 87, pp.24-35.

41
[LMD94] B. Landwehr,
P. Marwedel, R. Domer. \OSCAR: Op-
timum Simultaneous Scheduling, Alloca-
tion and Resource Binding Based on Inte-
ger Programming". In Proceedings of the
European Design Automation Conference,
1994.
[LS96] E. A. Lee, A. Sangiovanni-Vincentelli.
\Comparing Models of Computation". In
Proceedings of the International Confer-
ence on Computer Design, San Jose, CA,
Nov. 10-14, 1996.
[MG95] P. Marwedel, G. Goosens. Code Gener-
ation for Embedded Processors. Kluwer
Academic Publishers, 1995.
[NM97] R. Niemann, P. Marwedel. \An Algo-
rithm for Hardware/Software Partition-
ing Using Mixed Integer Linear Program-
ming". In Design Automation for Embed-
ded Systems, 2, Kluwer Academic Pub-
lishers, 1997.
[Pet81] J. L. Peterson. Petri Net Theory and the
Modeling of Systems. Prentice-Hall, En-
glewood Cli s, New Jersey, 1981.
[PK93] Z. Peng, K. Kuchcinski. \An Algorithm
for partitioning of application speci c sys-
tems". In Proceedings of the European
Conference on Design Automation, 1993.
[Rei92] W. Reisig. A Primer in Petri Net Design.
Springer-Verlag, New York, 1992.
[Stau94] J. Staunstrup. A Formal Approach to
Hardware Design. Kluwer Academic Pub-
lishers, 1994.
[Str87] B. Strous-
trup. The C++ Programming Language.
Addison-Wesley, Reading, 1987.
[TM91] D. E. Thomas, P. R. Moorby. The Verilog
Hardware Description Language. Kluwer
Academic Publishers, 1991.
[YW97] T. Y. Yen, W. Wolf. Hardware-soft-
ware Co-synthesis of Distributed Embed-
ded Systems. Kluwer Academic Publish-
ers, 1997.

42
Index
Accumulator, 11 FSM, 4
Action, 4 input-based, 4
Actor, 28 Mealy-type, 4
composite, 29 Moore-type, 4
leaf, 29 state-based, 4
Application-Speci c Architecture, 10 FSMD, 5
Architecture, 3 Function, 9
Array Processor, 16
General-Purpose Processor, 10
Behavior, 18
Leaf, 22 HCFSM, 8
Behavioral Hierarchy, 22 Hierarchy, 8
Block, 25 behavioral, 28
structural, 29
Channel, 25 High Level Synthesizer, 39
CISC, 13
Communication, 30 Inlining, 31
Communication Medium, 26 Interface, 17, 25
Compiler, 39 Interface Synthesis, 40
optimizing, 39
retargetable, 39 Language
Completion Point, 23 Executable Modeling, 18
Complex-Instruction-Set Computer, 13 Executable Speci cation, 18
Concurrency, 8 Late Binding, 30
pipelined, 19 Liveness, 7
Concurrent Decomposition, 22 Method, 26
Control construct, 9 Methodology, 3
branching, 9 MIMD, 10, 16
looping, 9 Mode, 20
sequential composition, 9 Model, 1
subroutine call, 9 activity-oriented, 3
Controller, 10 data-oriented, 4
Cycle, 19 heterogeneous, 4
Data Sample, 19 state-oriented, 3
Data Type structure-oriented, 4
basic, 8 Multiprocessor, 16
composite, 9 message-passing, 16
Data ow, 19 shared-memory, 16
Debugger, 39 Multiprocessor System, 16
Design Exploration, 35 Parallel Processor, 10, 16
Design Process, 3
PE, 17
Exception, 24 Petri net, 6
Executable Modeling, 18 Pipeline Stage, 19
Executable Speci cation, 18 Place, 6
Port, 25
Final State, 23 Procedure, 9
Finite-State Machine, 4 Process, 20
hierarchical concurrent, 8 Processing Element, 17
Fire, 7 Pro ler, 39

43
Program-State, 9 Transition on Completion, 9, 29
composite, 9
concurrent, 9 Very-Long-Instruction-Word Computer, 15
Leaf, 9 Visualizer, 39
sequential, 9 VLIW, 10, 15
Program-State Machine, 9
Programming Language, 8
declarative, 8
imperative, 8
Protocol, 17, 24
PSM, 9
Reduced-Instruction-Set Computer, 14
RISC, 10, 14
Saveness, 7
Sequential Decomposition, 22
procedural, 22
state-machine, 22
SIMD, 10, 16
Simulator, 39
Single Assignment Rule, 19
State, 4, 8
State Transition, 29
Statement, 9
Static Analyzer, 39
Structure, 25
Subbehavior
initial, 22
Substate, 8
concurrent, 8
Synchronization
by common event, 27
by common variable, 27
by status detection, 28
control-dependent, 30
data-dependent, 30
initialization, 27
shared-memory based, 27
System Bus, 17
TI, 9, 30
Timing, 32
Timing Constraint, 24, 32
Timing Delay, 24, 32
Timing Diagram, 24
TOC, 9, 29
Token, 6
Transition, 4, 6, 8
group, 22
hierarchical, 22
simple, 22
Transition Immediately, 9, 30

44

View publication stats

You might also like