control flow and data flow
control flow and data flow
Patrik Müller
TU Kaiserslautern, Embedded Systems Group
p [email protected]
Abstract
This paper reviews the inter- and intrablock scheduling of organizations of hybrid dataflow/von-Neumann
architectures described in the work of Yazdanpanah et al. [10]. For this purpose, the control flow and
data flow in processors is first explained. Subsequently, a simple example of code is played through with
selected architectures with hybrid attributes. It should be noted that the above elaboration was used as a
basis for understanding and choosing possible processor architectures.
1 Introduction
In the field of computer science the progress of computing power has increased considerably in recent
decades. Although it is not a law of nature, Moore’s Law retains an astonishingly accurate prediction in
doubling the number of transistors within a year since 1965. In fact, we only consider a rule of thumb
that only provides limited predictive power over the computational power itself. Because by purely
increasing the number of transistors, no statement can be made about the efficiency in the use of this
potential. At this point, therefore, the question arises, in which way and what the transistors are used
for. Already in the past they were used as further storage elements (like the cache) to optimize the access
to the memory in its duration. Also in the processing of code techniques such as the pipeline, a sepa-
ration of the implementation into several stages, were developed. However, there are several problems
with the underlying von-Neumann architecture. How is data dependency between commands handled?
Commonly known here are the conflicts read-after-write (RAW), write-after-read (WAR) and write-after-
write (WAW). Again, solutions have been found that improve the performance of the processor (e.g., by
branch prediction and forwarding within the stages of a pipeline). On the developer side too, optimizing
code made it easier to use the computing time more efficiently. However, one must take into account here
that this work is no longer reasonable for a person beyond a certain complexity of the program. From this
context, a rethink in the sense of the processor architecture was necessary. Therefore, several concepts
have been following a different path for several years. From the programmer’s point of view it would be
easiest to give power to the processor over the most timely execution. For this reason, in addition to the
basic form of the von Neumann architecture, which follows very strictly the sequence of the code and
thus depends on the control flow, other architectures based on the data flow of the code can be found.
There are also a number of hybrid architectures that use both concepts depending on the application.
The focus of this work is the consideration of different hybrid variants based on a model of Yazdan-
panah et al. [10]. The detailed description of the model as well as a short explanation about the already
used terms control flow and data flow can be read in section 2. Section 3 describes how the HPS, Task
Superscalar, TRIPS, and WaveScalar architectures work. A contribution of this work should be the play-
2
ing through of an exemplary code segment on the four different platforms. Finally, a brief review is given
in the Conclusions, Results, Discussion of this work.
Secondly, the next group of architectures works at block level in the control flow, but within the
blocks the commands are processed in the data flow. One can imagine that the block is called from the
program and then the data flow graph is determined. Based on the graph, the independent instructions
can now be executed first, followed by instructions whose operands are available later. This consideration
also makes it clear that when executing several blocks in parallel, dependencies between the blocks are
not taken into account. It is therefore assumed that the choice of the size of a block within the code does
not create or contain any external dependencies. An example of such an application would be the use
with loops that have no dependencies within the invocations.
Third is the class of architectures with data flow interblock and control flow intrablock scheduling.
This means that the code was checked from the beginning for data dependencies and blocks were created
from them that follow the data flow of the program. Within the blocks, only a short sequence of in-
structions is used due to the savings in hardware recourses. Possible tags must therefore only be created
locally, which greatly simplifies the process flow in the blocks.
The fourth and last class are the architectures with improved data flow. The data flow graph therefore
consists of two layers, the upper one forms the schedule between and the lower one the schedule within
the blocks. These processors are highly complex because, in contrast to the previous variants, they do
not require any control flow and therefore no PC at all.
The report by Yazdanpanah et al. [10] provides a series of architectures with the model just presented,
which confirm a classification according to the schedule of program execution. Therefore, representa-
tives from each of the four classes were selected for this report in order to gain the best possible insight
into the solution strategies of various obstacles and to describe the behavior of the processors when ex-
ecuting a code example. The choice was made particularly under the aspect of the topicality and the
awareness of the processor architectures. In the next section 3, the architectures High Performance Sub-
strate (HPS), Task Superscalar, the Tera-op Reliable Intelligently Adaptive Processing System (TRIPS)
and WaveScalar are briefly described in their structure and an example code segment is played through.
000 ble R1 , R0 , L1
001 addi R1 , R1 , # 3
002 mul R2 , R2 , R1
003 addi R3 , R0 , # 4
004 L1 : add R3 , R3 , R2
Algorithm 2: The code fragment in RISC Assembler language
The data flow graph therefore looks like this:
3.2 HPS
The architecture high performance substrate (HPS) can best be described as an out-of-order design. It
uses a modified version of the Tomasulo algorithm to tolerate occurring latencies. The algorithm can be
divided into four steps.
1. Rename registers to eliminate false dependencies and enables the connection between producers
and consumers 1 .
2. Buffering allows the pipeline to perform independent operations.
3. Broadcasting the tag allows communication between instructions
4. Wakeup and select enable out-of-order dispatch
The algorithm already reveals that this variant is an enhanced control flow design. Out-of-order can also
be described as restricted data flow, because instead of the entire program, only a small area, the so-called
active window, is checked for data dependencies. Enlarging the active window would cause a number
of space problems. On one side there would be a need of larger tables containing the operations, results
and renamed aliases, on the other side there would be more time invested in finding the right tags in the
tables.
Let us now examine the implementation of the algorithm in the case of HPS. From a dynamic in-
struction stream (it is assumed that the Branch Predictor is not the core of the design and is therefore
neglected), the instructions are loaded into the active window and decoded or merged there. The merger
1 Inthis context, the consumer is described as an instruction whose operands are not yet available. The corresponding
counterpart is the producer - an Instruction, which output simultaneously serves for the input of another Instruction. With this
definition, an instruction can be both producer and consumer.
7
receives a data flow graph for each instruction and can recognize data dependencies. From there, several
steps are executed: firstly, the instructions are tagged and transferred to a register, also known as a table.
This has a ready bit next to the tag indicating whether the instruction is ready to be executed. On the other
hand, the instructions are entered in the node table. The node table consists of the operation type, the
result tag for the assignment to the value buffer and for both operands one tag and ready bit, whether the
operands are ready. The instruction is not fired until both tags of the operands are ready. This ensures that
the functional units (F.U.) are used optimally. Once the instructions have been executed, the results are
distributed so that the value buffer is updated and potential consumers are informed immediately. Once
an instruction is complete, it is retired from the node table to make room for further instructions. This
concept covers parallelization at instruction level. Depending on the possibility (e.g. whether enough
independent instructions are available) several functional units can be used simultaneously.
With the help of this architecture, conflicts can be avoided without producing false results, which
was made possible with the help of renaming. At the same time, data dependencies save valuable time
by firing nodes that are already executable directly. Another variant of the HPS is the HPSm (minimal
functionality) [3], where the development became more concrete (e.g. by means of a pipeline), but also
more complex. For the understanding itself it is therefore sufficient to deal with the design of the HPS.
The overview shown in Figure 3 shows the abstract structure of HPS.
Now we jump two steps further, because now the third command addi R3, R0, #4 is decoded. When
entering in the node table, it is noticeable that both operands are available. This command is therefore al-
ready executed in the next cycle. Since the first addi command is still not executed, the second command
8
must wait. By working with the node table, it is therefore possible to execute and distribute an instruction
that was called later. If the process is continued, the next command is loaded from the instruction stream
into the active window after the first command retired. Figure 4 shows the individual steps in the code.
The Pipeline stages in this example are fetch (F), decode/merge (D), execute (E), result distribution (R)
and the writeback/retirement (W). The execution of MUL is assumed to be 3 cycles and ADD 2 cycles.
Through the out-of-order mechanism, a saving of ca. 9% of cycles could be achieved.
3.3 TRIPS
TRIPS (Tera-op, Reliable, Intelligently adaptive Processing System) belongs to the control flow/data
flow architectures and tries with its design to counter the problem that pipeline scaling for the previous
architectures is no time saving and therefore no efficient solution in the long run. The developers specify
four essential characteristics for future architectures [1]:
1. Other fine grained mechanisms must be available, as the depth of the pipeline is limited.
2. With the increase of the clock frequency the limitations of the power supply are quickly reached.
Processors must therefore also be able to work power-efficiently.
3. Future ISAs should be accessible for on-chip communication dominant execution.
4. ISAs should support polymorphism. Polymorphism is the ability to use execution and memory
units in different ways and variations.
The TRIPS architecture uses the EDGE ISA, which has a special feature: direct instruction communica-
tion. This means that a producer’s output is delivered directly to the consumer’s input. This inevitably
results in execution in data flow, since the instructions only fire when all operands are ready. TRIPS
was further developed on the basis of the four characteristics mentioned above and fulfils these with the
following considerations. Parallelism is made possible by an array of ALUs running in parallel, which
are scalable in size. The high and power efficient performance is supported by working with 100 or more
instruction blocks. Delays are minimized by executing interdependent instructions physically close to-
gether. To continue supporting languages like C or C++, the architecture uses block atomic execution.
There the compiler combines a large number of instructions into blocks, which are then fetched, executed
and committed as one part. A block can only be committed if it is executed completely, otherwise the
block is rolled back. This means that no parts of the block can be committed, which indicates that they
are treated atomically. In TRIPS, instructions are specified only by the location of the consumer. One
can imagine, for instance, an ADD command so that instead of the output and the two input operands
only the target address for the output is output. This can also lead to an instruction addressing several
consumers. Exceptions would be loads and stores, since they address the cache or memory. The figure 5
shows the processor core (left). It consists of a 4x4 array of execution nodes, which are connected via a
network. The execution nodes are ALUs with a buffer 5 (right) to hold further instructions. In addition,
there are four register banks, as well as four instruction and data cache banks. The global control tile is
responsible for fetching the instruction blocks.
Figure 5: One of the hyperblocks of the TRIPS processor (left) with one Execution node (right) [1].
As mentioned in the description, every instruction receives a distinct location. In the example there
are seven instructions in total. Since only 0 and 1 is allowed as a state, there are 3 of them needed
generating 23 = 8 states. An example allocation could look like shown in figure 6 on the left side. Read
and write instructions will not receive a location, since they are not ALUs.
Then we can generate the instruction block shown in figure 6 on the right side. As you can see,
instructions use the locations have locations as output and don’t need any information which inputs are
needed. It will simply fire, whenever the inputs are shipped to them.
Figure 6: TRIPS instruction placement (top) and the final instruction block (bottom).
output operands [2]. One can imagine that the inputs of the explicit instructions within a task, which
are dependent on outputs of other tasks, are abstracted to the inputs at task level. The same happens
with the outputs of all instructions. This summary becomes clear in the operands, because tasks have a
memory object and scalar values. Within the tasks, data flow graphs can now be used to determine the
data flow. At the thread level, it is assumed that this is a single instruction stream. Consequently, the
tasks are still decoded into order, but fired when ready which belongs to be the data flow design. With
these considerations the frontend of Task Superscalar can now be addressed.
The frontend is organized in the form of a pipeline and consists of four modules. First there is the
Gateway pipeline, which controls the flow of tasks into the pipeline. Task Reservation Stations (TRS),
store the task information and track the readiness of the operands of a task. They are located together
on a bus, which enables the exchange of operands between the TRSs. They can be compared with the
reservation stations of an out-of-order processor. The Renaming tables (ORT) objects correspond to the
Renaming tables registers and are used to assign operands to their last version and its producer. The last
module are the object versioning tables (OVT) which track the operand versions generated by decoding
a new data producer, which is similar to the result buffer of the out-of-order architecture. Each OVT is
connected to exactly one ORT. Of course, the performance of such a design depends on the number of
ORTs, OVTs and TRSs and their size.
With this concept, the pipeline forms a variant of an out-of-order pipeline that is used at task level.
Tasks are thus decoded into or decoded and stored in the TRS until all their operands are available.
Only then are the tasks placed on the ready queue and processed in the back end. As soon as a task
is completed, the consumers of the data are informed and the OVTs are also notified, which adjust the
operand versions accordingly.
Figure 7 shows the Frontend of the Task Superscalar design. The presentation of the tiles makes it
better to understand, which tiles are able to communicate on one layer and which are split on purpose.
To understand the instruction flow better, the already mentioned modules are further explained. For
each incoming task, the gateway sends an assignment request to the first free TRS. The gateway only
knows which TRSs still have free space, since the actual space organization is left to each TRS. If the
gateway receives a response from the TRS, it can assign the operands to the ORTs. The Task Reservation
Stations store all meta data of all incoming tasks together with the IDs of the data consumers. One
problem is that the number of consumers of an operand can vary greatly. Consequently, it is not possible
11
to estimate how much space should be reserved on the TRS for this information. This was done by
passing the operands in a linked list, instead of displaying all the data consumers. Therefore, the operand
is passed to the next task as input and after this task is finished it will be passed to the next task as an
output. The object renaming tables map the operands to the task, that accessed the same memory object.
if the ORT is already full and more allocation reqeusts from the gateway are incoming, the ORT stalls the
gateway to keep the existing allocations. Finally, object versioning tables (OVT) manages any data anti-
and output-dependencies between the tasks by renaming the operands and by chaining different inout
operands.
3.5 WaveScalar
WaveScalar describes an execution model based on the motivation to develop a decentralized superscalar
processor core. To bypass the program counter, which would rather be assigned to control flow, the
processor fetches in pure data-driven order. As before, one wants to follow the law that an instruction
only fires if all its operands are available. Normally, values in a data-flow machine have a tag with which
they can differ from other variables. The WaveScalar ISA represents a new model, the smallest unit of
which is a data flow graph of an execution. It is stored in memory as a collection of intelligent instruction
words. The intelligent aspect means that the instruction is dedicated to a functional unit. In practice, the
WaveCache 8 is used, an intelligent cache that holds a set of instructions and executes them accordingly.
In WaveScalar, the instructions for the data flow are optimized, which means that they explicitly send
the data to the instructions that are needed as input there. However, only some of the instructions whose
dependencies have already been determined at compile time actually need these values at runtime. For
example, in an if-then-else condition, only one of two cases becomes true, resulting in only one of the two
12
Figure 8: A simple Wavecache. The processing elements (left) are clustered with data caches and store
buffers (right) [8].
cases actually having to be executed. In WaveScalar such a case would be solved using the conditional
selector Φ instruction or conditional split Φ− 1 instruction. With the conditional selector Φ both cases
would be calculated first and given as input into the instruction. With a selector it is now clear which
of the two cases is used and which is discarded. The conditional split Φ− 1 works in principle like a
branch instruction and is used for the implementation of loops. The compiler in WaveScalar breaks the
old control flow graph of a program into several parts called waves. One processor then executes one
wave after the other. A wave has three important characteristics:
1. If the wave is executed, each instruction is executed no more than once.
2. There are no loops within a wave.
3. The control can enter only at one point.
A new variety of tokens is being used due to growing complexity. They are called wave numbers and
are responsible for tagging the waves. For example, the allocation can be retained in loops, since each
iteration in the form of a wave has a corresponding tag. Since modern systems rely on object linking and
shared memory, WaveScalar provides an INDIRECT-SEND instruction. Three inputs are received with
this instruction: the data value, the address and an offset. The data value is then sent to the address + the
offset. Finally, Memory Ordering is described in more detail. WaveScalar brings load and store ordering
to data flow calculation using wave-ordered memory. The wave-ordered memory records each memory
operation with its location in the wave and arranges the relationships with other memory operations in
the same wave. This allows the storage system to access the memory in the correct order.
number and generates this assignment as output. This means that the version of the data can be clearly
determined. Within the Wave several commands can now be executed in parallel due to the functioning
of the Φ instruction: the if condition to x, the increment by 3 to x and the assignment of the value 4 to z.
In addition, the multiplication of x and y can be calculated so that all inputs are already available for the
condition selector Φ.
/ / x , y in Registers
summary ( ) {
i f ( x > 0){
x = x + 3;
y = y ∗ x;
} else{
z = 4;
}
z = z + 1
}
Algorithm 4: Modified example written in C, having a if-then-else condition
Depending on which value the output of the comparator results in x, one of the two incoming paths
is discarded, and the other is used for further processing. For our example, this also means that at least
some of the computing power would be wasted. This is not so bad, however, as parallel execution saved
time. This means that spatial use is accepted for improved temporal performance.
5 Bibliography
References
[1] Doug Burger, Stephen W Keckler, Kathryn S McKinley, Mike Dahlin, Lizy K John, Calvin Lin, Charles R
Moore, James Burrill, Robert G McDonald & William Yoder (2004): Scaling to the End of Silicon with
EDGE Architectures. Computer 37(7), pp. 44–55.
[2] Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M Badia, Eduard Ayguade, Jesus Labarta
& Mateo Valero (2010): Task superscalar: An out-of-order task pipeline. In: Microarchitecture (MICRO),
2010 43rd Annual IEEE/ACM International Symposium on, IEEE, pp. 89–100.
[3] Wen-reel Hwu & Yale N Patt (1986): HPSm, a high performance restricted data flow architecture having
minimal functionality 14(2), pp. 297–306.
[4] Holger Kreißl (2015): Rechnerarchitektur. Available at http://www.kreissl.info/ra. Accessed:
10.09.2018.
[5] Yale N Patt, Wen-mei Hwu & Michael Shebanow (1985): HPS, a new microarchitecture: rationale and
introduction 16(4), pp. 103–108.
[6] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Doug
Burger, Stephen W Keckler & Charles R Moore (2003): Exploiting ILP, TLP, and DLP with the polymorphous
TRIPS architecture , pp. 422–433.
[7] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ran-
ganathan, Doug Burger, Stephen W Keckler, Robert G McDonald & Charles R Moore (2004): Trips: A
polymorphous architecture for exploiting ilp, tlp, and dlp. ACM Transactions on Architecture and Code
Optimization (TACO) 1(1), pp. 62–93.
[8] Steven Swanson, Ken Michelson, Andrew Schwerin & Mark Oskin (2003): WaveScalar. In: Proceedings of
the 36th annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, p. 12.
[9] Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson,
Mark Oskin & Susan J Eggers (2007): The wavescalar architecture. ACM Transactions on Computer Sys-
tems (TOCS) 25(2), p. 54.
[10] Fahimeh Yazdanpanah, Carlos Alvarez-Martinez, Daniel Jimenez-Gonzalez & Yoav Etsion (2014): Hybrid
dataflow/von-Neumann architectures. IEEE Transactions on Parallel and Distributed Systems 25(6), pp.
1489–1509.