0% found this document useful (0 votes)
37 views

control flow and data flow

This paper reviews the scheduling of hybrid dataflow/von-Neumann architectures, focusing on control flow and data flow in processors. It discusses various architectures, including HPS, Task Superscalar, TRIPS, and WaveScalar, and presents a code example to illustrate their operation. The work aims to provide insights into processor architecture choices based on the execution of code segments.

Uploaded by

Suman Chatterjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

control flow and data flow

This paper reviews the scheduling of hybrid dataflow/von-Neumann architectures, focusing on control flow and data flow in processors. It discusses various architectures, including HPS, Task Superscalar, TRIPS, and WaveScalar, and presents a code example to illustrate their operation. The work aims to provide insights into processor architecture choices based on the execution of code segments.

Uploaded by

Suman Chatterjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Control flow and data flow in processors

Patrik Müller
TU Kaiserslautern, Embedded Systems Group
p [email protected]

Abstract
This paper reviews the inter- and intrablock scheduling of organizations of hybrid dataflow/von-Neumann
architectures described in the work of Yazdanpanah et al. [10]. For this purpose, the control flow and
data flow in processors is first explained. Subsequently, a simple example of code is played through with
selected architectures with hybrid attributes. It should be noted that the above elaboration was used as a
basis for understanding and choosing possible processor architectures.

1 Introduction
In the field of computer science the progress of computing power has increased considerably in recent
decades. Although it is not a law of nature, Moore’s Law retains an astonishingly accurate prediction in
doubling the number of transistors within a year since 1965. In fact, we only consider a rule of thumb
that only provides limited predictive power over the computational power itself. Because by purely
increasing the number of transistors, no statement can be made about the efficiency in the use of this
potential. At this point, therefore, the question arises, in which way and what the transistors are used
for. Already in the past they were used as further storage elements (like the cache) to optimize the access
to the memory in its duration. Also in the processing of code techniques such as the pipeline, a sepa-
ration of the implementation into several stages, were developed. However, there are several problems
with the underlying von-Neumann architecture. How is data dependency between commands handled?
Commonly known here are the conflicts read-after-write (RAW), write-after-read (WAR) and write-after-
write (WAW). Again, solutions have been found that improve the performance of the processor (e.g., by
branch prediction and forwarding within the stages of a pipeline). On the developer side too, optimizing
code made it easier to use the computing time more efficiently. However, one must take into account here
that this work is no longer reasonable for a person beyond a certain complexity of the program. From this
context, a rethink in the sense of the processor architecture was necessary. Therefore, several concepts
have been following a different path for several years. From the programmer’s point of view it would be
easiest to give power to the processor over the most timely execution. For this reason, in addition to the
basic form of the von Neumann architecture, which follows very strictly the sequence of the code and
thus depends on the control flow, other architectures based on the data flow of the code can be found.
There are also a number of hybrid architectures that use both concepts depending on the application.
The focus of this work is the consideration of different hybrid variants based on a model of Yazdan-
panah et al. [10]. The detailed description of the model as well as a short explanation about the already
used terms control flow and data flow can be read in section 2. Section 3 describes how the HPS, Task
Superscalar, TRIPS, and WaveScalar architectures work. A contribution of this work should be the play-
2

ing through of an exemplary code segment on the four different platforms. Finally, a brief review is given
in the Conclusions, Results, Discussion of this work.

2 Control flow and data flow


Before we go into the selection of processor architectures, we first introduce the terms control flow and
data flow. Then we take a look at the model presented by Yazdanpanah et al. [10], which is fundamental
to understand the choice of the architectures.

2.1 Control flow


Control flow (which, due to its concept, is also often referred as von-Neumann model) describes, in our
understanding, the temporal sequence of individual commands of a computer program [4]. This order
is usually given by the program itself. The memory in the classic von-Neumann model includes both
the program and the data. The main aspect of this model, however, is the program counter (PC), which
always increments its value when executing the individual commands or adopts the target values for
special instructions such as branches or jumps. Due to the dependence of the commands from the PC,
which also fetches load and write commands, there may be delays in the execution of the entire program.
The conflicts RAW, WAR and WAW mentioned in the introduction are the main drawback of this model.
Thus, with a very large number of transistors, some of the loaded instructions will have to wait until
the particular record that initiated the instruction congestion is written to memory and thus made usable
for further instructions. This has two effects: on the one hand, the execution of the program is delayed
because the command can only be executed later, and on the other hand, valuable space is occupied for
other operations. So time and space resources are wasted due to data hazards.
It can be seen from this presentation that the bottleneck of the control flow concept lies in the one-
dimensionality of the sequential execution of the commands. An answer to this problem provides to
some extent the parallelization of processes in different levels. At the lowest level (and most interesting
for our work) is the instruction level parallelism (ILP) [10], which, as the name implies, allows multiple
instructions to be executed. Noteworthy implementations are the pipelined processors and superscalar
processors. Pipelined processors split the steps of a procedure to execute a command dividing into sev-
eral stages. The number of stages depends on the perspective of the developer. In general, however, it
can be said that there are the Stages Instruction Fetch (IF), Instruction Decode (ID), Execute (EX) and
Memory Access or Writeback (MA) [4]. The purpose of splitting is that multiple commands can run in
the pipeline. For example, a new instruction may be fetched while the previous instruction is still being
decoded. It should be noted that a stage does not necessarily represent a single cycle. Especially in the
arithmetic logic unit (ALU) for the EX-Stage can occur several cycles for a certain type of command.
The more stages, the more commands tend to be loaded. The problem here is that as the number of
uncommited but already executed instructions increases accordingly in the event of an interruption. Su-
perscalar processors extend the pipeline concept by running multiple pipelines in parallel. Thus, several
instructions can logically be located in the same stage - only within another pipeline. But here again
the problem of data dependency emerges: If only one instruction depends on another, the pipeline of the
instruction concerned stops and can not be used to process further independent instructions. One answer
to this problem is the use of so-called out-of-order processors. For this, the section of the implementation
of the code segment describes in more detail 3. At this point, it is sufficient to know that this architecture
makes it possible to conditionally detect dependencies within the scope of the control flow and to swap
3

the execution of the commands if this leads to an improvement in performance.


The next level is data level parallelism (DLP) [10], which can be described by combining multiple
independent data into a single instruction. This form is used, for example, in the calculation of vectors,
as is the case for the calculation within graphical processing units. At the top level is Thread Level
Parallelism (TLP), where multiple threads can be executed in parallel. There is the possibility to do this
on several processing units. However, there is an additional problem with synchronizing data between
threads.

2.2 Data flow


If a program is processed according to the data flow principle, the individual instructions are executed
depending on data dependencies [4]. This means that the program is not necessarily executed in the
original order in which it would be in the control flow. The PC provided by the von-Neumann model as
well as the global memory are omitted here. Two important consequences follow from this consideration:
the instructions must be checked for possible dependencies before the actual execution and there must
be a unit for buffering the instructions so that the ALU continues to be supplied with further instructions.
The data flow can be displayed with a data flow graph, which displays the instructions in its nodes and
the dependencies in its branches. The execution time can also be improved in data flow by parallelizing
the instructions. The basic form is called static architecture or single-token-per-arc [10]. Due to its
simplicity, it does not take into account loops that could be executed in parallel. A more complex variant
is the dynamic architecture, also called multiple-tagged-token-per-arc. There the individual invocations
of a loop are tagged, which enables a unique assignment of the necessary operators. However, the time
saving is also here a two-sided medal, because the assignment of the corresponding tags with the tokens
still takes time.
Data flow and control flow are two concepts that could not be more different in their basic idea.
However, at this point we must ask ourselves: which of the two procedures is really better? The answer
is not a general one, because each concept is subject to a certain motivation as to how a program should be
executed. For example, if you look at a code segment that has no data dependencies or such a high degree
of dependencies, you can assume that the data flow graph does not save any time compared to the control
flow. Rather, the additional calculation for the graph display and subsequent execution in data flow would
take more time than executing the program from the start using control flow structures. However, if we
have a code segment that contains both independent and dependent instructions, execution in the data
flow sequence would make more sense.

2.3 Hybrid control flow/data flow architectures


Fortunately, in addition to the previous concepts, a number of hybrid architectures are available. In
general, these variants can be divided into four classes. The categorization takes place via the behavior
between and within individual program blocks. Figure 1 shows the four classes of hybrid architectures,
which are described in more detail in the following.
First we take a look at the enhanced control flow architectures. The control flow is present both in
and between the blocks. It is important here that the instructions can be executed out of order within the
blocks. This does not contradict the concept of the control flow, since the instructions are still fetched
and decoded according to the control flow (or in order) of the program. The execution itself in the ALU
takes place out of order and is returned to order after the processing. Architectures following this concept
exploit parallelization at instruction level.
4

Secondly, the next group of architectures works at block level in the control flow, but within the
blocks the commands are processed in the data flow. One can imagine that the block is called from the
program and then the data flow graph is determined. Based on the graph, the independent instructions
can now be executed first, followed by instructions whose operands are available later. This consideration
also makes it clear that when executing several blocks in parallel, dependencies between the blocks are
not taken into account. It is therefore assumed that the choice of the size of a block within the code does
not create or contain any external dependencies. An example of such an application would be the use
with loops that have no dependencies within the invocations.
Third is the class of architectures with data flow interblock and control flow intrablock scheduling.
This means that the code was checked from the beginning for data dependencies and blocks were created
from them that follow the data flow of the program. Within the blocks, only a short sequence of in-
structions is used due to the savings in hardware recourses. Possible tags must therefore only be created
locally, which greatly simplifies the process flow in the blocks.
The fourth and last class are the architectures with improved data flow. The data flow graph therefore
consists of two layers, the upper one forms the schedule between and the lower one the schedule within
the blocks. These processors are highly complex because, in contrast to the previous variants, they do
not require any control flow and therefore no PC at all.
The report by Yazdanpanah et al. [10] provides a series of architectures with the model just presented,
which confirm a classification according to the schedule of program execution. Therefore, representa-
tives from each of the four classes were selected for this report in order to gain the best possible insight
into the solution strategies of various obstacles and to describe the behavior of the processors when ex-
ecuting a code example. The choice was made particularly under the aspect of the topicality and the
awareness of the processor architectures. In the next section 3, the architectures High Performance Sub-
strate (HPS), Task Superscalar, the Tera-op Reliable Intelligently Adaptive Processing System (TRIPS)
and WaveScalar are briefly described in their structure and an example code segment is played through.

Figure 1: Inter- and intrablock scheduling of organizations of hybrid dataflow/von-Neumann architec-


tures. (a) Enhanced control flow, (b) control flow/dataflow, (c) dataflow/control flow, and (d) enhanced
dataflow. Blocks are squares and big circles [10].
5

3 Implementations on hybrid architectures


3.1 The code example
The core of this chapter is to describe the different representatives of the different classes in more detail
and to play through a code fragment on them with the help of the explanation. Therefore, the example
should be as simple as possible and at the same time flexible to use, because as we were able to determine
in the previous chapter, the areas of application for control flow and data flow always depend on the task
the program has to perform. The following code was selected in C for this work 3.1.
/ / x , y in Registers
summary ( ) {
i f ( x > 0){
x = x + 3;
y = y ∗ x;
z = 4;
}
z = z + y
}
Algorithm 1: A simple example written in C
The summary method queries the if condition in the first command. If this is true, three arithmetic
operations are performed, including two additions and one multiplication. If x is less than or equal to
0, only a simple addition takes place. In addition, the example was first translated into RISC assembler
language 3.2. A reference for the assembler code can be found here.
/ / R0 c o n t a i n s 0
/ / R1 c o n t a i n s x
/ / R2 c o n t a i n s y

000 ble R1 , R0 , L1
001 addi R1 , R1 , # 3
002 mul R2 , R2 , R1
003 addi R3 , R0 , # 4
004 L1 : add R3 , R3 , R2
Algorithm 2: The code fragment in RISC Assembler language
The data flow graph therefore looks like this:

Figure 2: The Data Flow Graph of the example.


6

3.2 HPS
The architecture high performance substrate (HPS) can best be described as an out-of-order design. It
uses a modified version of the Tomasulo algorithm to tolerate occurring latencies. The algorithm can be
divided into four steps.

1. Rename registers to eliminate false dependencies and enables the connection between producers
and consumers 1 .
2. Buffering allows the pipeline to perform independent operations.
3. Broadcasting the tag allows communication between instructions
4. Wakeup and select enable out-of-order dispatch

Figure 3: An abstract view on the HPS design [5].

The algorithm already reveals that this variant is an enhanced control flow design. Out-of-order can also
be described as restricted data flow, because instead of the entire program, only a small area, the so-called
active window, is checked for data dependencies. Enlarging the active window would cause a number
of space problems. On one side there would be a need of larger tables containing the operations, results
and renamed aliases, on the other side there would be more time invested in finding the right tags in the
tables.
Let us now examine the implementation of the algorithm in the case of HPS. From a dynamic in-
struction stream (it is assumed that the Branch Predictor is not the core of the design and is therefore
neglected), the instructions are loaded into the active window and decoded or merged there. The merger
1 Inthis context, the consumer is described as an instruction whose operands are not yet available. The corresponding
counterpart is the producer - an Instruction, which output simultaneously serves for the input of another Instruction. With this
definition, an instruction can be both producer and consumer.
7

receives a data flow graph for each instruction and can recognize data dependencies. From there, several
steps are executed: firstly, the instructions are tagged and transferred to a register, also known as a table.
This has a ready bit next to the tag indicating whether the instruction is ready to be executed. On the other
hand, the instructions are entered in the node table. The node table consists of the operation type, the
result tag for the assignment to the value buffer and for both operands one tag and ready bit, whether the
operands are ready. The instruction is not fired until both tags of the operands are ready. This ensures that
the functional units (F.U.) are used optimally. Once the instructions have been executed, the results are
distributed so that the value buffer is updated and potential consumers are informed immediately. Once
an instruction is complete, it is retired from the node table to make room for further instructions. This
concept covers parallelization at instruction level. Depending on the possibility (e.g. whether enough
independent instructions are available) several functional units can be used simultaneously.
With the help of this architecture, conflicts can be avoided without producing false results, which
was made possible with the help of renaming. At the same time, data dependencies save valuable time
by firing nodes that are already executable directly. Another variant of the HPS is the HPSm (minimal
functionality) [3], where the development became more concrete (e.g. by means of a pipeline), but also
more complex. For the understanding itself it is therefore sufficient to deal with the design of the HPS.
The overview shown in Figure 3 shows the abstract structure of HPS.

3.2.1 Behaviour of HPS


Let us start from our specific example. For this we have a register alias table, a node table for MUL and
ADD and a result buffer. As an assumption and for simplification, it is assumed that Branch Prediction
has chosen the correct case (x > 0). Consequently, the first command addi R1, R1, #3 is fetched. During
the decoding process of the first command, the next command mul R2, R2, R1 is fetched. In general it
can be said that with out-of-order execution in each cycle another command is fetched and then decoded,
since there are no stables in this arrangement. The addi command is entered in the node table during the
decoding process. Since it can be assumed here that both operands are ready, only the target register (in
this case R1) receives a tag from the node table (e.g. tag = x).
In the next step, the second command is decoded and the third is fetched, the first command is now
executed. The second command also goes into his corresponding node table (i.e. for MUL). Since the
first command has not been completed at this point, the value of the operand is missing so far, so that the
x tag is now also transferred to the node table, the ready bit is set to 0 for this operand.

Figure 4: The execution of the active Window

Now we jump two steps further, because now the third command addi R3, R0, #4 is decoded. When
entering in the node table, it is noticeable that both operands are available. This command is therefore al-
ready executed in the next cycle. Since the first addi command is still not executed, the second command
8

must wait. By working with the node table, it is therefore possible to execute and distribute an instruction
that was called later. If the process is continued, the next command is loaded from the instruction stream
into the active window after the first command retired. Figure 4 shows the individual steps in the code.
The Pipeline stages in this example are fetch (F), decode/merge (D), execute (E), result distribution (R)
and the writeback/retirement (W). The execution of MUL is assumed to be 3 cycles and ADD 2 cycles.
Through the out-of-order mechanism, a saving of ca. 9% of cycles could be achieved.

3.3 TRIPS
TRIPS (Tera-op, Reliable, Intelligently adaptive Processing System) belongs to the control flow/data
flow architectures and tries with its design to counter the problem that pipeline scaling for the previous
architectures is no time saving and therefore no efficient solution in the long run. The developers specify
four essential characteristics for future architectures [1]:
1. Other fine grained mechanisms must be available, as the depth of the pipeline is limited.
2. With the increase of the clock frequency the limitations of the power supply are quickly reached.
Processors must therefore also be able to work power-efficiently.
3. Future ISAs should be accessible for on-chip communication dominant execution.
4. ISAs should support polymorphism. Polymorphism is the ability to use execution and memory
units in different ways and variations.
The TRIPS architecture uses the EDGE ISA, which has a special feature: direct instruction communica-
tion. This means that a producer’s output is delivered directly to the consumer’s input. This inevitably
results in execution in data flow, since the instructions only fire when all operands are ready. TRIPS
was further developed on the basis of the four characteristics mentioned above and fulfils these with the
following considerations. Parallelism is made possible by an array of ALUs running in parallel, which
are scalable in size. The high and power efficient performance is supported by working with 100 or more
instruction blocks. Delays are minimized by executing interdependent instructions physically close to-
gether. To continue supporting languages like C or C++, the architecture uses block atomic execution.
There the compiler combines a large number of instructions into blocks, which are then fetched, executed
and committed as one part. A block can only be committed if it is executed completely, otherwise the
block is rolled back. This means that no parts of the block can be committed, which indicates that they
are treated atomically. In TRIPS, instructions are specified only by the location of the consumer. One
can imagine, for instance, an ADD command so that instead of the output and the two input operands
only the target address for the output is output. This can also lead to an instruction addressing several
consumers. Exceptions would be loads and stores, since they address the cache or memory. The figure 5
shows the processor core (left). It consists of a 4x4 array of execution nodes, which are connected via a
network. The execution nodes are ALUs with a buffer 5 (right) to hold further instructions. In addition,
there are four register banks, as well as four instruction and data cache banks. The global control tile is
responsible for fetching the instruction blocks.

3.3.1 Behaviour of TRIPS


Keeping in mind the data flow graph 2 , a big step towards the final compilation of the code is already
done. We will now take a look at the TRIPS instruction placement and the final instruction block, that is
being executed in data flow order.
9

Figure 5: One of the hyperblocks of the TRIPS processor (left) with one Execution node (right) [1].

As mentioned in the description, every instruction receives a distinct location. In the example there
are seven instructions in total. Since only 0 and 1 is allowed as a state, there are 3 of them needed
generating 23 = 8 states. An example allocation could look like shown in figure 6 on the left side. Read
and write instructions will not receive a location, since they are not ALUs.
Then we can generate the instruction block shown in figure 6 on the right side. As you can see,
instructions use the locations have locations as output and don’t need any information which inputs are
needed. It will simply fire, whenever the inputs are shipped to them.

Figure 6: TRIPS instruction placement (top) and the final instruction block (bottom).

3.4 Task Superscalar


The Task Superscalar architecture belongs to the class with data flow/control flow organization. The
main aspect of this design is the abstraction of tasks as single instructions, which also have input and
10

Figure 7: Frontend of the Task Superscalar design [2].

output operands [2]. One can imagine that the inputs of the explicit instructions within a task, which
are dependent on outputs of other tasks, are abstracted to the inputs at task level. The same happens
with the outputs of all instructions. This summary becomes clear in the operands, because tasks have a
memory object and scalar values. Within the tasks, data flow graphs can now be used to determine the
data flow. At the thread level, it is assumed that this is a single instruction stream. Consequently, the
tasks are still decoded into order, but fired when ready which belongs to be the data flow design. With
these considerations the frontend of Task Superscalar can now be addressed.
The frontend is organized in the form of a pipeline and consists of four modules. First there is the
Gateway pipeline, which controls the flow of tasks into the pipeline. Task Reservation Stations (TRS),
store the task information and track the readiness of the operands of a task. They are located together
on a bus, which enables the exchange of operands between the TRSs. They can be compared with the
reservation stations of an out-of-order processor. The Renaming tables (ORT) objects correspond to the
Renaming tables registers and are used to assign operands to their last version and its producer. The last
module are the object versioning tables (OVT) which track the operand versions generated by decoding
a new data producer, which is similar to the result buffer of the out-of-order architecture. Each OVT is
connected to exactly one ORT. Of course, the performance of such a design depends on the number of
ORTs, OVTs and TRSs and their size.
With this concept, the pipeline forms a variant of an out-of-order pipeline that is used at task level.
Tasks are thus decoded into or decoded and stored in the TRS until all their operands are available.
Only then are the tasks placed on the ready queue and processed in the back end. As soon as a task
is completed, the consumers of the data are informed and the OVTs are also notified, which adjust the
operand versions accordingly.
Figure 7 shows the Frontend of the Task Superscalar design. The presentation of the tiles makes it
better to understand, which tiles are able to communicate on one layer and which are split on purpose.
To understand the instruction flow better, the already mentioned modules are further explained. For
each incoming task, the gateway sends an assignment request to the first free TRS. The gateway only
knows which TRSs still have free space, since the actual space organization is left to each TRS. If the
gateway receives a response from the TRS, it can assign the operands to the ORTs. The Task Reservation
Stations store all meta data of all incoming tasks together with the IDs of the data consumers. One
problem is that the number of consumers of an operand can vary greatly. Consequently, it is not possible
11

to estimate how much space should be reserved on the TRS for this information. This was done by
passing the operands in a linked list, instead of displaying all the data consumers. Therefore, the operand
is passed to the next task as input and after this task is finished it will be passed to the next task as an
output. The object renaming tables map the operands to the task, that accessed the same memory object.
if the ORT is already full and more allocation reqeusts from the gateway are incoming, the ORT stalls the
gateway to keep the existing allocations. Finally, object versioning tables (OVT) manages any data anti-
and output-dependencies between the tasks by renaming the operands and by chaining different inout
operands.

3.4.1 Behaviour of Task Superscalar


In Task Superscalar, a simple example like the one we have used in the first two architectures, would not
show what this design is capable of. For this example another code will be used to exploit the parallelism
using tasks. Therefore, another example is presented and explained short in the following algorithm
3.4.1.
sum (A [ 4 ] , B [ 4 ] ) {
do{
A[ i ] = A[ i ] + B [ i ]
} w h i l e ( i n t i <A . l e n g t h , i ++)
}
Algorithm 3: Sum of two arrays in C
Here two arrays of length 4 are added together. For the method, this means that the loop must be
called a total of four times. There are no data dependencies between the individual loop calls. Conse-
quently, each iteration can be processed for itself. Thus, each iteration is now considered as an individual
task and transferred to the reservation stations and the renaming tables.
If, for example, there were dependencies between the individual blocks, then an allocation between
producers and consumers would be possible using the versioning tables object. In the selected example,
all tasks now have their operands and are ready to fire. The execution itself takes place in control flow
sequence, which is controlled via the backend of the pipeline.
A more suitable example is considered in the article ”Task Superscalar: An out-of-order task pipeline”
[2] since it exploits parallelism more clearly, but it is way more complex then our simple one.

3.5 WaveScalar
WaveScalar describes an execution model based on the motivation to develop a decentralized superscalar
processor core. To bypass the program counter, which would rather be assigned to control flow, the
processor fetches in pure data-driven order. As before, one wants to follow the law that an instruction
only fires if all its operands are available. Normally, values in a data-flow machine have a tag with which
they can differ from other variables. The WaveScalar ISA represents a new model, the smallest unit of
which is a data flow graph of an execution. It is stored in memory as a collection of intelligent instruction
words. The intelligent aspect means that the instruction is dedicated to a functional unit. In practice, the
WaveCache 8 is used, an intelligent cache that holds a set of instructions and executes them accordingly.
In WaveScalar, the instructions for the data flow are optimized, which means that they explicitly send
the data to the instructions that are needed as input there. However, only some of the instructions whose
dependencies have already been determined at compile time actually need these values at runtime. For
example, in an if-then-else condition, only one of two cases becomes true, resulting in only one of the two
12

Figure 8: A simple Wavecache. The processing elements (left) are clustered with data caches and store
buffers (right) [8].

cases actually having to be executed. In WaveScalar such a case would be solved using the conditional
selector Φ instruction or conditional split Φ− 1 instruction. With the conditional selector Φ both cases
would be calculated first and given as input into the instruction. With a selector it is now clear which
of the two cases is used and which is discarded. The conditional split Φ− 1 works in principle like a
branch instruction and is used for the implementation of loops. The compiler in WaveScalar breaks the
old control flow graph of a program into several parts called waves. One processor then executes one
wave after the other. A wave has three important characteristics:
1. If the wave is executed, each instruction is executed no more than once.
2. There are no loops within a wave.
3. The control can enter only at one point.
A new variety of tokens is being used due to growing complexity. They are called wave numbers and
are responsible for tagging the waves. For example, the allocation can be retained in loops, since each
iteration in the form of a wave has a corresponding tag. Since modern systems rely on object linking and
shared memory, WaveScalar provides an INDIRECT-SEND instruction. Three inputs are received with
this instruction: the data value, the address and an offset. The data value is then sent to the address + the
offset. Finally, Memory Ordering is described in more detail. WaveScalar brings load and store ordering
to data flow calculation using wave-ordered memory. The wave-ordered memory records each memory
operation with its location in the wave and arranges the relationships with other memory operations in
the same wave. This allows the storage system to access the memory in the correct order.

3.5.1 Behaviour of WaveScalar


Also with this model it makes sense to edit the given example a little 3.5.1. The if condition now became
an if-then-else condition, which would result in different calculations in the respective cases. Since it is
not based on branch prediction, another mechanism must be used to implement the code. Therefore, a
new data flow graph is designed, the condition is implemented using the Φ instruction. This graph 9 can
now be understood as a single wave.
There would be several waves in loops, for example, but the complexity or scope of a concrete code
example would again be too extensive. Therefore we look at this one wave and can e.g. make the
assumption that another wave follows, which uses the values further. For this, the so-called WAVE-
ADVANCE instruction would have to be used, which takes the data values as input, increases the wave
13

number and generates this assignment as output. This means that the version of the data can be clearly
determined. Within the Wave several commands can now be executed in parallel due to the functioning
of the Φ instruction: the if condition to x, the increment by 3 to x and the assignment of the value 4 to z.
In addition, the multiplication of x and y can be calculated so that all inputs are already available for the
condition selector Φ.
/ / x , y in Registers
summary ( ) {
i f ( x > 0){
x = x + 3;
y = y ∗ x;
} else{
z = 4;
}
z = z + 1
}
Algorithm 4: Modified example written in C, having a if-then-else condition

Depending on which value the output of the comparator results in x, one of the two incoming paths
is discarded, and the other is used for further processing. For our example, this also means that at least
some of the computing power would be wasted. This is not so bad, however, as parallel execution saved
time. This means that spatial use is accepted for improved temporal performance.

Figure 9: A data flow graph being implemented for WaveScalar


14

4 Conclusions, Results, Discussion


This report was written with the task of studying the control flow and data flow in processors. This
initially included the motivation behind the development of these technologies. On the one hand, there
is no getting around the question of performance. Be it the efficiency with which an architecture can
execute a code or the time saving, which will play an ever greater role in our everyday life with the
increasing complexity and sheer mass of applications. On the other hand, when it comes to architecture,
the question of code always arises. Depending on how our program is structured, there are architectures
that are better or worse suited if we want to include the performance aspect again.
The model of the four hybrid variants of architectures was therefore used as a basis for further
comparisons. This resulted in four representatives, one from each class. After an introduction to the
structure and functionality of the hardware, a deeper insight into the behavior was given with the help
of an example. The difficulty was to design the example so that each design could show its best side.
This was only partially successful, since in some cases only more complex solutions would have better
illustrated the principle. Unfortunately, this would have gone beyond the scope and depth of this report
and had to be deliberately reduced. However, the functionality of the processor architectures has already
become partially clearer with the selected code fragment.
A special feature that attracted attention during the development is the direct comparison of the four
architectures and the tags. While HPS, Task Superscalar and WaveScalar use the tag for the allocation
of the data values in their own way, TRIPS dispensed with the necessity to give instructions a tag for the
input values. For data consumers, this meant that there was no information at their level where the values
came from. Instead, it was enough for the producers to know the physical address of the execution of the
instructions.
The merging of two principles that could not be more contradictory has shown what architectures are
currently capable of. And developments are still in their infancy. It will be interesting to see what the
future holds for hybrid architectures.
15

5 Bibliography

References
[1] Doug Burger, Stephen W Keckler, Kathryn S McKinley, Mike Dahlin, Lizy K John, Calvin Lin, Charles R
Moore, James Burrill, Robert G McDonald & William Yoder (2004): Scaling to the End of Silicon with
EDGE Architectures. Computer 37(7), pp. 44–55.
[2] Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M Badia, Eduard Ayguade, Jesus Labarta
& Mateo Valero (2010): Task superscalar: An out-of-order task pipeline. In: Microarchitecture (MICRO),
2010 43rd Annual IEEE/ACM International Symposium on, IEEE, pp. 89–100.
[3] Wen-reel Hwu & Yale N Patt (1986): HPSm, a high performance restricted data flow architecture having
minimal functionality 14(2), pp. 297–306.
[4] Holger Kreißl (2015): Rechnerarchitektur. Available at http://www.kreissl.info/ra. Accessed:
10.09.2018.
[5] Yale N Patt, Wen-mei Hwu & Michael Shebanow (1985): HPS, a new microarchitecture: rationale and
introduction 16(4), pp. 103–108.
[6] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Doug
Burger, Stephen W Keckler & Charles R Moore (2003): Exploiting ILP, TLP, and DLP with the polymorphous
TRIPS architecture , pp. 422–433.
[7] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ran-
ganathan, Doug Burger, Stephen W Keckler, Robert G McDonald & Charles R Moore (2004): Trips: A
polymorphous architecture for exploiting ilp, tlp, and dlp. ACM Transactions on Architecture and Code
Optimization (TACO) 1(1), pp. 62–93.
[8] Steven Swanson, Ken Michelson, Andrew Schwerin & Mark Oskin (2003): WaveScalar. In: Proceedings of
the 36th annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, p. 12.
[9] Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson,
Mark Oskin & Susan J Eggers (2007): The wavescalar architecture. ACM Transactions on Computer Sys-
tems (TOCS) 25(2), p. 54.
[10] Fahimeh Yazdanpanah, Carlos Alvarez-Martinez, Daniel Jimenez-Gonzalez & Yoav Etsion (2014): Hybrid
dataflow/von-Neumann architectures. IEEE Transactions on Parallel and Distributed Systems 25(6), pp.
1489–1509.

You might also like